# Homework 9: BERT Sentiment Analysis

*Exercise*: Build your own BERT classifier network and try to achieve the highest possible accuracy on Twitter Dataset (check the TODO items). ee also the [Sentiment_Bert](https://colab.research.google.com/github/serivan/DeepLearning/blob/master/10-AttentionMechanisms/pytorch/Sentiment_Bert.ipynb) Notebook.

Moreover you can consider to use Grid search or Randomized search or Optuna in order to define the hyperparameters.

# **Twitter Sentiment Analysis**

*My ridiculous dog is amazing. [sentiment: positive]*

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

Help build your skills in this important area with this broad dataset of tweets. Work on your technique to grab a top spot in this competition. What words in tweets support a positive, negative, or neutral sentiment? How can you help make that determination using machine learning tools?


Useful link:


*   [Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset)
*   [Competition](https://www.kaggle.com/competitions/tweet-sentiment-extraction/overview)



Import the dataset and install libraries

In [None]:
import torch

# if present, use the gpu as a device

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


In [None]:
%%capture

!pip install transformers
!wget https://raw.githubusercontent.com/serivan/DeepLearning/master/Datasets/Tweets.csv

# **Dataset preprocessing**

In [None]:
import pandas as pd

df = pd.read_csv('Tweets.csv')

df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

df['labels'] = [2 if v == "positive" else 0 if v == "negative" else 1 for v in df.sentiment.tolist()]
df.drop(["textID", "selected_text", "sentiment"], axis=1, inplace=True)

df

Unnamed: 0,text,labels
0,"I`d have responded, if I were going",1
1,Sooo SAD I will miss you here in San Diego!!!,0
2,my boss is bullying me...,0
3,what interview! leave me alone,0
4,"Sons of ****, why couldn`t they put them on t...",0
...,...,...
27476,wish we could come see u on Denver husband l...,0
27477,I`ve wondered about rake to. The client has ...,0
27478,Yay good for both of you. Enjoy the break - y...,2
27479,But it was worth it ****.,2


#Tokenizer and data processing

In [None]:
from transformers import BertTokenizer

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data, max_len):
    input_ids = []
    attention_masks = []

    for sent in data:
        encoded_sent = tokenizer(sent,
                                 padding='max_length',  
                                 truncation=True,       
                                 max_length=max_len) 
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks


#TODO: Load the BERT tokenizer
tokenizer =

In [None]:
from sklearn.model_selection import train_test_split

X = df.text.tolist()
y = df.labels.tolist()

# TODO: Split the data in train, validation and test set

X_train, X_test, y_train, y_test = 
X_train, X_val, y_train, y_val = 

In [None]:
#TODO: Compute a max_len (different strategies are possible...)
max_len = 

# Run function `preprocessing_for_bert` on the train set, the validation set and the test set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, max_len)
val_inputs, val_masks = preprocessing_for_bert(X_val,  max_len)
test_inputs, test_masks = preprocessing_for_bert(X_test, max_len)

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)
test_labels = torch.tensor(y_test)

# For fine-tuning BERT, it's recommended a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

# Create the DataLoader for our test set
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=1)

#Define the model

\#TODO Set the hyperparameters manually or using **grid search**

In [None]:
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    def __init__(self, freeze_bert=False):
        super(BertClassifier, self).__init__()

        #TODO: define the model
        pass
        
    def forward(self, input_ids, attention_mask):

        # TODO: compute the forward pass

        pass

In [None]:
import random
import time
import numpy as np
from tqdm import tqdm

#Training procedure

#TODO: Specify loss function
loss_fn = 
scaler = torch.cuda.amp.GradScaler()


def set_seed(seed_value=42):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

#Used in train
def evaluate(model, val_dataloader):
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy


def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False, accumulation=16):
    print("Start training...\n")

    #list to be returned
    train_losses=[]
    train_accs=[]
    val_losses=[]
    val_accs=[]

    #For the first step of evaluation
    eval_loss=10000
    best_model=None

    for epoch_i in range(epochs):
        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0
        acc = 0
        total = 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(tqdm(train_dataloader)):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)


            with torch.cuda.amp.autocast():
                  #Set the previous grad to 0 
                  model.zero_grad()
                  # Perform a forward pass. This will return logits.
                  logits = model(b_input_ids, b_attn_mask)
                  # Compute loss and accumulate the loss values
                  loss = loss_fn(logits, b_labels)
                  loss /= accumulation

            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            scaler.scale(loss).backward()

            # Accumulation step. It's fondamental when you try to train BERT on Colab GPUs, it avoids the error CUDA_OUT_OF_MEMORY 
            if (step + 1) % accumulation == 0 or step+1 == len(train_dataloader):
              scaler.step(optimizer)
              scaler.update()


        # Print training results
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9}")
        print("-"*70)

        # Reset batch tracking variables
        batch_loss, batch_counts = 0, 0

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)
        train_losses.append(avg_train_loss)
        
        predictions = torch.max(logits, 1).indices.to(device)

        acc += (predictions.detach().cpu().numpy() == b_labels.detach().cpu().numpy()).sum()
        total += len(b_labels)

        epoch_acc = acc*100/total
        train_accs.append(epoch_acc)
        print("-"*70)


        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.

            val_loss, val_accuracy = evaluate(model, val_dataloader)
            val_losses.append(val_loss)
            val_accs.append(val_accuracy)
            if val_loss <= eval_loss:
              eval_loss = val_loss
              best_model = model

            # Print performance over the entire training data            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f}")
            print("-"*70)
        print("\n")

    
    print("Training complete!")
    return train_losses, train_accs, val_losses, val_accs, best_model


In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

def initialize_model():
    bert_classifier = BertClassifier()
    bert_classifier.to(device)

    #TODO: Choose an optimizer
    optimizer = 

    return bert_classifier, optimizer

In [None]:
set_seed(42)

epochs = 4
accumulation = 16
evaluation = True

bert_classifier, optimizer = initialize_model()
train_losses, train_accs, val_losses, val_accs, best_model = train(bert_classifier, train_dataloader, val_dataloader, epochs, evaluation, accumulation)

#Test your model

In [None]:
import matplotlib.pyplot as plt


plt.plot(range(1, epochs+1), train_losses, label="Training")
plt.plot(range(1, epochs+1), val_losses, label="Validation")
plt.xlabel("No. of Epoch")
plt.ylabel("Loss")
plt.title("Training Loss")
plt.legend(loc="center right")
plt.show()

In [None]:
plt.plot(range(1, epochs+1), train_accs, label="Training")
plt.plot(range(1, epochs+1), val_accs, label="Validation")
plt.xlabel("No. of Epoch")
plt.ylabel("Accuracy %")
plt.title("Training Accuracy")
plt.legend(loc="upper left")
plt.show()

In [None]:
def predict(model, val_dataloader):
    model.eval()
    predictions=[]
   
    for batch in val_dataloader:
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        preds = torch.argmax(logits, dim=1).flatten()
        predictions.append(int(preds.detach().cpu().numpy()))

    return predictions

In [None]:
from sklearn.metrics import classification_report

predictions = predict(bert_classifier, test_dataloader)

print(classification_report(y_test, predictions))

In [None]:
best_model.to(device)
predictions = predict(best_model, test_dataloader)

print(classification_report(y_test, predictions))