### Create your own dataset for text classification. It should contain at least 1000 words in total and at least two categories with at least 100 examples per category. You can create it by scraping the web or using some of the documents you have on your computer (do not use anything confidential) or ChatGPT.

 **Importing required libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import tensorflow as tf
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
import os
import urllib.request
from bs4 import BeautifulSoup
import time


**Web Scraping articles that are related to food**

In [None]:
# Load CSV file containing URLs
csv_file_path = "food.csv"
df = pd.read_csv(csv_file_path)

# Create a folder to store the text files
output_folder = "food_text_files"
os.makedirs(output_folder, exist_ok=True)

# Iterate through each URL in the CSV file
for index, row in df.iterrows():
    url = row['url']

    # Set user-agent
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
    try:
        # Retrieve the HTML content from the URL
        request = urllib.request.Request(url, headers=headers)
        html = urllib.request.urlopen(request).read()

        # Create a BeautifulSoup object to parse the HTML
        soup = BeautifulSoup(html, 'html.parser')

        # Extract text from paragraphs and write to a text file
        output_file_path = os.path.join(output_folder, f"food_{index + 1}.txt")  # Output file path based on index
        with open(output_file_path, "w") as file:
            for data in soup.find_all("p"):
                text = data.get_text()
                file.write(text + "\n")

        # Print a message indicating successful execution
        print(f"Text from {url} saved to {output_file_path}.")

    except Exception as e:
        print(f"Failed to retrieve {url}. Error: {e}")


Text from https://www.epicurious.com/recipes-menus/most-saved-recipes saved to food_text_files/food_1.txt.
Text from https://www.epicurious.com/holidays-events/punch-history saved to food_text_files/food_2.txt.
Text from https://www.epicurious.com/recipes-menus/make-mix-and-match-tutti-frutti-thumbprint-cookies saved to food_text_files/food_3.txt.
Text from https://pinchofyum.com/salmon-tacos saved to food_text_files/food_4.txt.
Text from https://pinchofyum.com/mushroom-bowls-with-kale-pesto saved to food_text_files/food_5.txt.
Text from https://iamafoodblog.com/garlic-lobster-pasta/ saved to food_text_files/food_6.txt.
Text from https://iamafoodblog.com/best-mashed-potatoes/ saved to food_text_files/food_7.txt.
Text from https://smittenkitchen.com/2023/11/olive-oil-brownies/ saved to food_text_files/food_8.txt.
Text from https://smittenkitchen.com/2023/09/chicken-rice-with-buttered-onions/ saved to food_text_files/food_9.txt.
Text from https://www.thekitchn.com/milk-tea-recipe-2361234

- In this we perform web scraping on URLs listed in a CSV file containing a 'url' column which consists of url's related to food.
- It extracts text content from HTML paragraphs using BeautifulSoup and saves the text into separate files. The output folder "food_text_files" is created to store the resulting text files.
- Error handling is implemented to manage issues during the scraping process, providing messages for success or failure.

**Web Scraping articles that are related to sports**

In [None]:
csv_file_path = "sports.csv"
df = pd.read_csv(csv_file_path)

# Create a folder to store the text files
output_folder = "sports_text_files"
os.makedirs(output_folder, exist_ok=True)

# Iterate through each URL in the CSV file
for index, row in df.iterrows():
    url = row['url']

    # Set user-agent
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

    try:
        # Retrieve the HTML content from the URL
        request = urllib.request.Request(url, headers=headers)
        html = urllib.request.urlopen(request).read()

        # Create a BeautifulSoup object to parse the HTML
        soup = BeautifulSoup(html, 'html.parser')

        # Extract text from paragraphs and write to a text file
        output_file_path = os.path.join(output_folder, f"sports_{index + 1}.txt")  # Output file path based on index
        with open(output_file_path, "w") as file:
            for data in soup.find_all("p"):
                text = data.get_text()
                file.write(text + "\n")

        # Print a message indicating successful execution
        print(f"Text from {url} saved to {output_file_path}.")

    except Exception as e:
        print(f"Failed to retrieve {url}. Error: {e}")


Text from https://www.cricbuzz.com/cricket-news/128558/stats-travis-head-joins-elite-list-after-world-cup-final-heroics saved to sports_text_files/sports_1.txt.
Text from https://theathletic.com/5111175/2023/12/06/pwhl-womens-hockey-league-history/?source=nyt_sports saved to sports_text_files/sports_2.txt.
Text from https://apnews.com/article/pacers-lakers-haliburton-lebron-nba-tournament-c71a82580486d59cd3ae050403767188 saved to sports_text_files/sports_3.txt.
Text from https://www.cnn.com/2023/12/08/sport/new-england-patriots-pittsburgh-steelers-tnf-nfl-spt-intl/index.html saved to sports_text_files/sports_4.txt.
Text from https://abcnews.go.com/Sports/us-mexico-submit-joint-bid-host-2027-womens/story?id=105502115 saved to sports_text_files/sports_5.txt.
Text from https://www.espncricinfo.com/story/ian-chappell-captains-should-be-suspended-if-their-teams-can-t-bowl-90-overs-a-day-1325032 saved to sports_text_files/sports_6.txt.
Text from https://www.timeforkids.com/g56/big-win-gauff/

- Similarly here we perform web scraping on URLs listed in a CSV file containing a 'url' column which consists of url's related to sports.
- It extracts text content from HTML paragraphs using BeautifulSoup and saves the text into separate files. The output folder "sports_text_files" is created to store the resulting text files.
- Error handling is also implemented here to manage issues during the scraping process, providing messages for success or failure.

**Creating a dataset that stores all the web-scraped articles related to Sports**

In [None]:
text_files_folder = "sports_text_files"

# Create an empty list to store data
data = []
total_words = 0  # Variable to store the total number of words

# Iterate through each file in the folder
for filename in sorted(os.listdir(text_files_folder)):
    file_path = os.path.join(text_files_folder, filename)

    # Read the content of the file
    with open(file_path, "r", encoding="utf-8") as file:
        document = file.read()

    # Extract information from the filename
    category = filename.split('_')[0]
    title = filename.split('.')[0]

    # Calculate the number of words in the document
    num_words = len(document.split())
    total_words += num_words

    # Append data to the list
    data.append([title, document, category, num_words])

# Create a DataFrame from the list
df = pd.DataFrame(data, columns=['Title', 'Document', 'Class', 'Num_Words'])

# Print the sorted dataset
print("Sorted Dataset:")
print(df.sort_values(by='Title'))

# Print the total number of examples per category
total_per_category = df['Class'].value_counts()
print("\nTotal Examples per Category:")
print(total_per_category)

# Print the number of words per example
print("\nNumber of Words per Example:")
print(df[['Title', 'Num_Words']])

# Print the total number of words in the entire dataset
print(f"\nTotal Words in the Dataset: {total_words}")

Sorted Dataset:
         Title                                           Document   Class  \
0     sports_1  \n{{suggest.tag}}\nSearch for “”\n Stats highl...  sports   
1    sports_10  That this moment unfolded, on July 21, 2023, i...  sports   
2   sports_100  HH\nSS\nSixers won by 6 wickets (with 4 balls ...  sports   
3    sports_11  \n{{suggest.tag}}\nSearch for “”\n As many as ...  sports   
4    sports_12  \n      This material may not be published, br...  sports   
..         ...                                                ...     ...   
95   sports_95  Advertisement\nSupported by\nThe On Soccer New...  sports   
96   sports_96  \n{{suggest.tag}}\nSearch for “”\n The Indian ...  sports   
97   sports_97  Penn State defensive coordinator Manny Diaz ha...  sports   
98   sports_98  Saints quarterback Derek Carr received the OK ...  sports   
99   sports_99  \n      The Texas Rangers defeated the Houston...  sports   

    Num_Words  
0         643  
1        3157  
2         1

- Now we process the text files stored in the "sports_text_files" folder, extract information from filenames, and create a DataFrame (df) to organize the data.
- We then calculate the number of words in each document, print a sorted dataset based on titles, and provide additional information, such as the total number of examples per category, the number of words per example, and the total number of words in the entire dataset.

**Creating a dataset that stores all the web-scraped articles related to Food**

In [None]:
text_files_folder = "food_text_files"

# Create an empty list to store data
data = []
total_words = 0  # Variable to store the total number of words

# Iterate through each file in the folder
for filename in sorted(os.listdir(text_files_folder)):
    file_path = os.path.join(text_files_folder, filename)

    # Read the content of the file
    with open(file_path, "r", encoding="utf-8") as file:
        document = file.read()

    # Extract information from the filename
    category = filename.split('_')[0]
    title = filename.split('.')[0]

    # Calculate the number of words in the document
    num_words = len(document.split())
    total_words += num_words

    # Append data to the list
    data.append([title, document, category, num_words])

# Create a DataFrame from the list
df = pd.DataFrame(data, columns=['Title', 'Document', 'Class', 'Num_Words'])

# Print the sorted dataset
print("Sorted Dataset:")
print(df.sort_values(by='Title'))

# Print the total number of examples per category
total_per_category = df['Class'].value_counts()
print("\nTotal Examples per Category:")
print(total_per_category)

# Print the number of words per example
print("\nNumber of Words per Example:")
print(df[['Title', 'Num_Words']])

# Print the total number of words in the entire dataset
print(f"\nTotal Words in the Dataset: {total_words}")

Sorted Dataset:
       Title                                           Document Class  \
0     food_1  To revisit this recipe, visit My Account, then...  food   
1    food_10  This Hong Kong-style milk tea skips the conden...  food   
2   food_100  Kourabiedes are one of the most popular Greek ...  food   
3    food_11  This Christmas cowboy cookie recipe from The H...  food   
4    food_12  Jammy fruit cobbler is usually relegated to su...  food   
..       ...                                                ...   ...   
95   food_95  Published by Lori Rasmussen · Updated Oct 10, ...  food   
96   food_96  Dinner, then Dessert\nTrending Now\nUltimate S...  food   
97   food_97  Dinner, then Dessert\nTrending Now\nUltimate S...  food   
98   food_98  Butternut Squash is best in fall, just underne...  food   
99   food_99  This recipe is a perfect combination of sweet ...  food   

    Num_Words  
0         719  
1         799  
2         253  
3         428  
4         161  
..        .

- Similarly here we processe text files stored in the "food_text_files" folder, extract information from filenames, and create a DataFrame (df) to organize the data related to food.
- Also like sports data we calculate the number of words in each document, print a sorted dataset based on titles, and provide additional information, such as the total number of examples per category, the number of words per example, and the total number of words in the entire dataset.

**Showing a sample text from the food dataset**

In [None]:
# Folders containing the text files
food_text_files_folder = "food_text_files"

def read_text_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return file.read()

# Display one file from food_text_files
food_filename = os.listdir(food_text_files_folder)[0]
food_file_path = os.path.join(food_text_files_folder, food_filename)
food_document = read_text_file(food_file_path)

print(f"\nContent of a file from food_text_files ({food_filename}):")
print(food_document)


Content of a file from food_text_files (food_79.txt):
Stop wondering what's for dinner! Checkout our Meal Plans
Healthy Eating  Made Easy
Stress Free Weekly Meal Plans for Busy Homes.
with nutritional information
 Stress Free Weekly Meal Plans for Busy Homes. 
 Easy Cabbage Fried Rice makes a delicious low carb side dish that tastes just like your favorite Chinese takeout. Jump to Recipe and Video 
Cabbage rice with all the traditional flavors of Chinese fried rice will be your new favorite healthy side dish. It comes together in less than 20 minutes, is full of veggies, and satisfies any craving for takeout.
Personally, I love cauliflower rice and especially cauliflower fried rice, but recently after hearing a friend boldly state that she couldn't eat one more bite of cauliflower anything (she has been Paleo for 2 years), I set out to create another veggie version of her favorite fried rice.
Putting our heads together we considered carrots, jicama, celeriac, and squash; but finally d

Here we read the file and demonstrate an example of the food recipes which are scraped and stored as text in the file

**Showing a sample text from the sports dataset**

In [None]:
# Display one file from sports_text_files
sports_text_files_folder = "sports_text_files"

sports_filename = os.listdir(sports_text_files_folder)[0]
sports_file_path = os.path.join(sports_text_files_folder, sports_filename)
sports_document = read_text_file(sports_file_path)

print(f"Content of a file from sports_text_files ({sports_filename}):")
print(sports_document)

Content of a file from sports_text_files (sports_28.txt):
Advertisement
Supported by
The year was full of unlikely winners and exciting team competitions.
By Cindy Shmerler
There was no champagne courtside. So, as Matteo Berrettini embraced Jannik Sinner after Sinner’s victory over Alex de Minaur last month to clinch Italy’s first Davis Cup title in 47 years, their teammate, Matteo Arnaldi, did the next best thing: He shook a water bottle and poured it over Sinner and Berrettini.
Sinner, 22, ended the season with his 20th win in his last 23 matches. This year, he had a 64-15 record, won four tournaments, reached the semifinals at Wimbledon and was runner-up at the ATP Finals in Turin, Italy. He had wins over the three top-ranked players — Novak Djokovic, whom he beat twice in two weeks, Carlos Alcaraz and Daniil Medvedev. Starting 2023 at No. 15, he ended it at No. 4.
Djokovic sorely wanted to lead Serbia to just its second Davis Cup title. But in the semifinals, he fell to Sinner afte

Similarly here we demonstrate sports articles stored as text.

**Combining sports and text dataset for text classification**

In [None]:
# Create an empty list to store data
data = []

# Iterate through each file in the food_text_files folder
for filename in sorted(os.listdir(food_text_files_folder)):
    file_path = os.path.join(food_text_files_folder, filename)
    document = read_text_file(file_path)
    category = filename.split('_')[0]
    title = filename.split('.')[0]
    data.append([title, document, category])

# Iterate through each file in the sports_text_files folder
for filename in sorted(os.listdir(sports_text_files_folder)):
    file_path = os.path.join(sports_text_files_folder, filename)
    document = read_text_file(file_path)
    category = filename.split('_')[0]
    title = filename.split('.')[0]
    data.append([title, document, category])

# Create a DataFrame from the list
combined_df = pd.DataFrame(data, columns=['Title', 'Document', 'Class'])

# Print the combined dataset
print("Combined Dataset:")
print(combined_df)


Combined Dataset:
         Title                                           Document   Class
0       food_1  To revisit this recipe, visit My Account, then...    food
1      food_10  This Hong Kong-style milk tea skips the conden...    food
2     food_100  Kourabiedes are one of the most popular Greek ...    food
3      food_11  This Christmas cowboy cookie recipe from The H...    food
4      food_12  Jammy fruit cobbler is usually relegated to su...    food
..         ...                                                ...     ...
195  sports_95  Advertisement\nSupported by\nThe On Soccer New...  sports
196  sports_96  \n{{suggest.tag}}\nSearch for “”\n The Indian ...  sports
197  sports_97  Penn State defensive coordinator Manny Diaz ha...  sports
198  sports_98  Saints quarterback Derek Carr received the OK ...  sports
199  sports_99  \n      The Texas Rangers defeated the Houston...  sports

[200 rows x 3 columns]


- This code combines text data from two folders, "food_text_files" and "sports_text_files," into a pandas DataFrame named combined_df.
- It iterates through each file in both folders, reads the content, extracts information from filenames, and appends the data (title, document, category) to the list.
- Finally, it creates a DataFrame using the accumulated data and prints the combined dataset.

### Split the dataset into training (at least 160 examples) and test (at least 40 examples) sets.

**Splitting dataset into training and test sets**

In [None]:
# Split the dataset into training (80%) and test (20%) sets
train_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42)

# Ensure at least 160 examples in the training set and at least 40 examples in the test set
while len(train_df) < 160 or len(test_df) < 40:
    train_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42)

# Print the shapes of the training and test sets
print("Training set shape:", train_df.shape)
print("Test set shape:", test_df.shape)

Training set shape: (160, 3)
Test set shape: (40, 3)


- Here the goal was to create a balanced split of the dataset into training and test sets, ensuring an adequate number of examples for model training and evaluation. So, the training set contains 160 test examples, and test set contains 40 test examples that we will use further for training the model.


### Fine tune a pretrained language model capable of generating text (e.g., GPT) that you can take from the Hugging Face Transformers library with the dataset your created (this tutorial could be very helpful: https://huggingface.co/docs/transformers/training). Report the test accuracy. Discuss what could be done to improve accuracy.

**Installing and importing required libraries**

In [None]:
!pip install accelerate==0.20.1
!pip install transformers[torch]
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tqdm import tqdm

Collecting accelerate==0.20.1
  Downloading accelerate-0.20.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.5/227.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.20.1
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.20.1
    Uninstalling accelerate-0.20.1:
      Successfully uninstalled accelerate-0.20.1
Successfully installed accelerate-0.25.0


**Using a label encoder to convert string labels to numerical labels**

In [None]:
label_encoder = LabelEncoder()
train_df['Class'] = label_encoder.fit_transform(train_df['Class'])
test_df['Class'] = label_encoder.transform(test_df['Class'])

- Here, the dataset was in the string format, so we used LabelEncoder for converting it to relevant numeric values.
- LabelEncoder assigns a unique integer to each class, for example food = 0, and sports = 1.
- After the process, the Class column in our data contains numeric representation (0 for 'food', 1 for 'sports').

**Loading pre-trained BERT model and tokenizer for sequence classification**


In [None]:
model_name = 'bert-base-uncased'
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

- For the text classification task, we employed a pre-trained model known as "bert-base-uncased." This model represents a foundational version of BERT, trained on English text in an uncased format.
- Further we used BERT's corresponding tokenizer, that can tokenize input text and convert it into a format suitable for the BERT model.

**Tokenizing the training and test sets**


In [None]:
train_encodings = tokenizer(list(train_df['Document']), truncation=True, padding=True, return_tensors='pt')
test_encodings = tokenizer(list(test_df['Document']), truncation=True, padding=True, return_tensors='pt')

- Here, we prepared the input data for the BERT model by tokenizing and formatting the text sequences.
- The resulting train_encodings and test_encodings variables hold the encoded representations in a format suitable for feeding into the BERT model for training and evaluation.

**Converting labels to tensors**


In [None]:
train_labels = torch.tensor(train_df['Class'].values)
test_labels = torch.tensor(test_df['Class'].values)

- Here, the encoded sequences, along with corresponding class labels were converted to PyTorch tensors, prepared for training and evaluation of the BERT model.
- These tensors will be used as targets during the training phase, enabling the model to optimize its parameters and improve its ability to make accurate predictions on new data.

**Defining and creating a custom dataset**


In [None]:
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = CustomDataset(train_encodings, train_labels)
test_dataset = CustomDataset(test_encodings, test_labels)

- By organizing the data into a custom dataset, it becomes compatible with PyTorch's DataLoader, facilitating efficient batch processing during model training and evaluation.

**Setting up training parameters**

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-5)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)



- Here we configured the optimizer, device and DataLoader so we can prepare the necessary components for training the BERT model on the training data.
- The DataLoader we used here was to iterate over batches of data during the training process, and the optimizer helped to update the model's parameters based on computed gradients.

**Fine-tuning and training the model**

In [None]:
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])
Epoch 1/3: 100%|██████████| 40/40 [17:34<00:00, 26.35s/it]
Epoch 2/3: 100%|██████████| 40/40 [17:02<00:00, 25.57s/it]
Epoch 3/3: 100%|██████████| 40/40 [17:37<00:00, 26.43s/it]


- We implemented the BERT model, and used some key components like batch iteration, optimizer zero grad, moving data to the device GPU, model forward pass, loss calculation, backward pass, and optimizer step.
- The loops repeats for 3 epochs, and during each epoch, the model is trained on batches of data, updating its parameters to minimize the defined loss function.
- Each epoch took a substantial amount of time to complete, with an average of around 17 minutes per epoch.

**Evaluating on the test set**


In [None]:
model.eval()
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)
all_predictions = []

with torch.no_grad():
    for batch in tqdm(test_loader, desc='Evaluating'):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        all_predictions.extend(predictions)

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])
Evaluating: 100%|██████████| 10/10 [01:12<00:00,  7.24s/it]


- We evaluated the trained BERT model on the test dataset and collected the predictions.
- We then store the model's predictions for each example in the test dataset.


**Reporting test accuracy**


In [None]:
test_true_labels = test_labels.cpu().numpy()
test_accuracy = accuracy_score(test_true_labels, all_predictions)
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.975


- We calculated the model accuracy by comparing the true class labels with the predicted labels.
- The model achieved an accuracy of 97.5% on the test dataset.


**Predicting texts into classes using some examples**

In [None]:
def predict_class(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # Make the prediction
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

    # Decode the predicted class using the label encoder
    decoded_class = label_encoder.inverse_transform([predicted_class])[0]

    return decoded_class

In [None]:
# Example text for classifying food text
example_text = "The kitchen was alive with the sizzle of garlic in olive oil and the aroma of herbs dancing in the air. A symphony of flavors unfolded as the chef crafted a masterpiece, blending fresh ingredients into a culinary delight."

# Predict the class
predicted_class = predict_class(example_text)

# Print the result
print(f"Predicted Class: {predicted_class}")


Predicted Class: food


In [None]:
# Example text for classifying sports text
example_text = "The stadium echoed with the cheers of fans as the athletes sprinted down the track, each step bringing them closer to the finish line. The intensity of the game heightened, showcasing the sheer determination and skill that define the world of competitive sports."

# Predict the class
predicted_class = predict_class(example_text)

# Print the result
print(f"Predicted Class: {predicted_class}")


Predicted Class: sports


- The above code successfully takes an example input in the form of text and then by using the predict_class function and the BERT model trained classifies the text into appropriate class.
- As observed, both the example texts are correctly classified into the 'food' and 'sports' class, signifying the model works quite accurately.

**Discussion on improving accuracy of BERT Model**

The following steps can be used to further improve the test accuracy:

1. We could consider using a learning rate scheduler, such as torch.optim.lr_scheduler, to adjust the learning rate during training. It can help in finding the optimal learning rate.

2. We could implement gradient clipping to prevent exploding gradients. This can be done by setting a maximum gradient value in the optimizer.

3. We could adjust the batch size of the model and train for more epochs until the model converges. However, there is a chance of overfitting the training data.

4. We could apply regularization techniques like dropout during training to prevent overfitting. Techniques such as adding noise, paraphrasing, or using backtranslation can be beneficial as well while performing data augmentation.

5. We can try different transformer-based architecture models like DistilBERT, RoBERTa, or XLNet that can perform better for certain tasks.