# Problem Statement: **AtliQ's Social Media Monitoring for COVID-19 Insights using BERT**

### AtliQ is launching a social media monitoring system to identify tweets related to COVID-19, aiming to track trends, misinformation, and public sentiment. Your task is to fine-tune a pre-trained BERT model using PyTorch to classify whether a given tweet is about COVID-19 or not.

**References:**

1. **Attention is All you Need:** [Click Here](https://arxiv.org/abs/1706.03762)

2. **BERT:** [Click Here](https://arxiv.org/abs/1810.04805)





---



Imports and CUDA

In [None]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset, random_split
from torchvision import datasets, transforms
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
import random
from tqdm.notebook import tqdm
from transformers import BertForSequenceClassification
from transformers import AutoTokenizer

# Check if CUDA (GPU) is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


**Step1:** Dataset Overview

* We use **covid_twitter_dataset_codebasics_DL** dataset which has 5287 rows.
* Columns: ID, text and target
* The feature column (text) contains the tweet content, and the target column (target) contains the binary labels indicating whether a tweet is COVID-19-related (**1**) or not (**0**).


In [None]:
data = # code here
data.head()

In [None]:
data.info()



---



**Step2:** Split the dataset

* Train:Test :: 70:30



---



In [None]:
train_X, test_X, train_Y, test_Y = # code here

**Step3:** Tokenization

* The AutoTokenizer from the Hugging Face library is used to load the pre-trained BERT tokenizer (bert-base-cased).


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
train_tokens = tokenizer(list(train_X), padding = True, truncation=True)
test_tokens = # code here

In [None]:
train_tokens.keys()

In [None]:
print(train_tokens['input_ids'][0])
print(tokenizer.decode(train_tokens['input_ids'][0]))



---



**Step4**: Create a dataset class
* Implement a custom TokenData class inheriting from torch.utils.data.Dataset.
* Use the train argument to toggle between training (train_tokens, train_Y) and testing datasets (test_tokens, test_Y).
* Define __len__ to return the dataset length and __getitem__ to provide tokenized inputs (input_ids, attention_mask) and labels (labels) as tensors.
* Ensure the class works correctly by retrieving a sample and verifying the output format.

In [None]:
class TokenData(Dataset):
    def __init__(self, train = False):
        if train:
            self.text_data = train_X
            self.tokens = train_tokens
            self.labels = list(train_Y)
        else:
            self.text_data = test_X
            self.tokens = test_tokens
            self.labels = list(test_Y)

    def __len__(self):
        # code here

    def __getitem__(self, idx):
        sample = {}
        for k, v in self.tokens.items():
            sample[k] = torch.tensor(v[idx])
        sample['labels'] = torch.tensor(self.labels[idx])
        return sample



---



**Step5**: Create Dataloaders

* Initialize the `TokenData` class to create train_dataset and test_dataset objects for training and testing datasets.
* Set `batch_size=30` and enable shuffling for the training dataset to improve model generalization.
* Verify that train_loader and test_loader correctly return batches of tokenized inputs and labels.

In [None]:
batch_size = 30
train_dataset = # code here
test_dataset = # code here

train_loader = # code here
test_loader = # code here



---



**Step6:** Define Model
*  Use `BertForSequenceClassification` with `bert-base-cased` to load a pre-trained BERT model for sequence classification tasks.
* Use `AdamW` optimizer with a learning rate of `1e-5` to fine-tune the model parameters. This optimizer is well-suited for transformer-based models.
* Use `CrossEntropyLoss` to compute the loss for multi-class classification (as the dataset contains two classes).

In [None]:
bert_model =  # code here
optimizer = # code here
loss_fn = # code here



---



* num_eochs = 3
* Use bert_model.to(device) to move the model to a GPU if available, or keep it on the CPU. This ensures efficient computation, especially for large datasets or models like BERT.

In [None]:
num_epochs =
bert_model.to(device) # Transfer model to GPU if available



---



**Step7:** Model Training

In [None]:
for epoch in range(num_epochs):
    bert_model.train()
    total_train_loss = 0.0  # Initialize total loss for each epoch

    for i, batch in enumerate(train_loader):
        batch = {k: v.to(device) for k, v in batch.items()}

        # Set gradients to zero
        optimizer.zero_grad()

        # Pass data to the model
        outputs = bert_model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])

        # Get logits and calculate the loss
        pred = outputs.logits
        loss = # code here

        # Backpropagation
        loss.backward()

        # Optimizing model parameters
        optimizer.step()

        # Accumulate the loss for the epoch
        total_train_loss += # code here

    # Calculate and log the average loss for the epoch
    avg_train_loss = # code here
    print(f"Epoch {epoch + 1}/{num_epochs} - Training Loss: {avg_train_loss:.4f}")




---



**Step8:** Model Evaluation

In [None]:
# Set model to evaluation mode
bert_model.eval()

# Variables for tracking accuracy and loss
correct = 0
total = 0
total_test_loss = 0.0
all_preds = []
all_labels = []

# Disable gradient computation during testing
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = bert_model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])

        # Get logits
        logits = outputs.logits

        # Calculate loss
        loss = # code here
        total_test_loss += # code here

        # Get predictions
        preds = logits.argmax(dim=1)

        # Update the correct count and total number of samples
        correct += (preds == batch['labels']).sum().item()
        total += batch['labels'].size(0)

        # Store predictions and labels for confusion matrix
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())

# Calculate the average test loss
avg_test_loss = # code here

# Calculate overall accuracy
accuracy = # code here

# Print overall results
print(f"Test Set - Loss: {avg_test_loss:.4f}, Accuracy: {accuracy:.2f}%")

**BONUS**: Visualize the Confusion Matrix



---



**Step9**: Inference with new Tweets

* Run this inference code to predict the output for test_tweets

In [None]:
# Test tweets
test_tweets = [
    "The number of COVID-19 cases has increased in the past week.",
    "Just watched an amazing sunset at the beach!",
    "New vaccination centers are opening up to combat the spread of coronavirus.",
    "A virus has impacted the global ecosystem a lot",
    "India lost the match againt Australia at Adelaide"
]

# Predict function for a single tweet
def predict_tweet(tweet_text):
    # Tokenize
    tokens = tokenizer(tweet_text, padding=True, truncation=True, return_tensors='pt')
    tokens = {k: v.to(device) for k, v in tokens.items()}

    # Predict
    bert_model.eval()
    with torch.no_grad():
        outputs = bert_model(**tokens)
        probs = torch.softmax(outputs.logits, dim=1)
        prediction = torch.argmax(probs, dim=1)

    return {
        'text': tweet_text,
        'is_covid': bool(prediction.item()),
        'confidence': probs[0][prediction[0]].item() * 100
    }

# Make predictions for test tweets
print("Sample Tweet Predictions:\n")
for tweet in test_tweets:
    result = predict_tweet(tweet)
    print(f"Tweet: {result['text']}")
    print(f"Prediction: {'COVID-related' if result['is_covid'] else 'Not COVID-related'}")
    print(f"Confidence: {result['confidence']:.2f}%")
    print("-" * 50 + "\n")