# Sentiment Analysis with BERT

In this notebook we will train BERT to perform a sentiment analysis task. The same methodology extends to any kind of text classification task. As the original [paper](https://arxiv.org/abs/1810.04805) suggests, we add a single feed foward layer on top of BERT and fine-tune it for our task. Prior to going through this notebook, please make sure you go through and understand the "BERT - The Basics" notebook. 



First we need to install and import the necessary packages.


In [None]:
!pip install matplotlib
!pip install numpy
!pip install scipy
!pip install sklearn
!pip install pandas
!pip install torch
!pip install transformers
!pip install tqdm

In [None]:
import numpy as np
import pandas as pd
import torch 
from transformers import *
import matplotlib.pyplot as plt
import tqdm
import random

In [None]:
RANDOM_SEED = 0

torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Data and Pre-processing

We're going to work with a dataset of fine foods reviews from Amazon. The original dataset spans a period of more than 10 years, including ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. The `Score` column consists of scores ranging from 1 to 5 The original dataset can be found [here](https://www.kaggle.com/snap/amazon-fine-food-reviews).

Run the following cells to load the data into a Pandas Dataframe.

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
data_id = "1AuYKU1xhGTAUoFbZe3kg3jsKqepfXO_F"
df_downloaded = drive.CreateFile({"id": data_id}) 
df_downloaded.GetContentFile("Reviews.csv")
df = pd.read_csv("Reviews.csv")

Run the below cell to see what the data looks like. 

In [None]:
df.head()

### Pre-processing

We substract one from each score so that rather than ranging from 1 to 5 they range from 0 to 4. We also keep track of the number of classes for later. 

In [None]:
df["Score"] -= 1
N_CLASSES = len(df["Score"].unique())

In the pre-processing a few things we want to consider. First, we want to make sure that there are reasonbly even number of samples corresponding to each score 0, 1, 2, 3, 4. If this is not the case, it may not be reasonable to expect our model to be able to infer the score on this level of granularity. We can easily plot this below. 

In [None]:
def plot_scores(df):
  score_counts = df["Score"].value_counts().sort_index().values
  plt.bar(range(5), score_counts)

plot_scores(df)

We see there is an overwhelming number of reviews with score five. This type of imbalance can make it harder for the model to learn to distinguish between the five classes as it may rely on exploiting the prior probabilities to optimize it's loss objective. Thankfully, in our case we have a very large amount of data. Many examples of fine-tuning BERT happen on 1,000s or 10,000s of samples. Due to our abundance of data we can afford to throw out some of our data (in fact, we will need to in order to make training take a reasonable amount of time). We will do so in a way that evens out the number of samples per class. Precisely, we will pick a number $n$ and randomly select $n$ samples from each class.

Note this is not the neccesarily the most efficient or best way of dealing with class imbalance:

1. We don't need to perfectly balance the data. In fact, doing so means the data we're training on IS NOT representative of the actual sample. Depending on our goal, it may be important for the model to learn about the prior probability of each sample. 

2. This data could be kept for use in our test or validation set.

3. For all we know, this imbalance potentially won't result in worse training/valdiation performance. We'd have to experiment to try it out.

If you're curious about these sorts of questions, we encourage you to tinker with this part of the pre-processing stage on your own time (more on this at the end of the notebook).

In [None]:

# Randomly select min_count number of samples for each score
def randomly_balance(df, min_count=0):
  score_counts = df["Score"].value_counts().sort_index().values
  if min_count <= 0:
    min_count = min(score_counts)
  chosen_indices = np.zeros(len(score_counts)*min_count)
  for idx, score in enumerate(df["Score"].unique()):
    chosen_indices[idx*min_count : (idx +1)*(min_count)] = np.random.choice(df[df["Score"] == score].index, min_count)
  df_balanced = df.loc[chosen_indices]
  return df_balanced

df_balanced = randomly_balance(df, 1000)
plot_scores(df_balanced)

We'll go ahead and split our data into training, validation, and test sets. Additionally we'll discard unnecessary columns. 

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_not_train = train_test_split(df_balanced,
                                          test_size=0.2,
                                          random_state=RANDOM_SEED)
df_val, df_test = train_test_split(df_not_train,
                                   test_size=0.5,
                                   random_state=RANDOM_SEED)

# Get rid of unnecessary columns
df_train = df_train[["Text", "Score"]]
df_val = df_val[["Text", "Score"]]
df_test = df_test[["Text", "Score"]]

print("Number of training samples: ", len(df_train))
print("Number of validation samples: ", len(df_val))
print("Number of test samples: ", len(df_test))

Lastly, BERTbase (the default model we'll try out) can only take in sequences with up to 512 tokens. For sequences longer than 512 we'll be forced to reduce the sequence length with some workaround, for example truncation. We want to make sure that for the most part, BERTbase will be able to input our reviews without truncating too much information. To check, we make a histogram the numer of tokens of reviews in our training data set.  

In [None]:
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [None]:
tokenized_text_dict = tokenizer.batch_encode_plus(df_train["Text"].tolist(),
                                                  max_length=None, 
                                                  pad_to_max_length=False,
                                                  return_token_type_ids=True)

In [None]:
sample_token_lengths = [len(tokens) for tokens in tokenized_text_dict['input_ids']]
print("Percentage of samples with >512 tokens: ", sum([1 if x > 512 else 0 for x in sample_token_lengths])/len(sample_token_lengths))
plt.hist(sample_token_lengths, bins=100)
plt.show()

We see that the vast majority of samples have $<=$ 512 tokens, so we can proceed as planned. As an added comfort, we can be reasonably confident that, for even the longer reviews, the "sentiment" of the review will be captured in the first 512 tokens. With this in mind, we could even consider using a smaller `MAX_LENGTH`. It's best practice to use as small a `MAX_LENGTH` as possible without damaging model performance for efficiency/compute reasons.

If you're curious what to do when exceeding BERT's 512 limit, check out [this](https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification) StackOverflow post or this [blog's](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) discussion of how XLNnet handles this issue. 

### Making a Torch DataLoader

Now that we're going to be working with PyTorch objects, we create a device variable which will be used to represent which device a torch tensor is on (`cpu` or `cuda`). We are running this notebook on a GPU, so you should expect the output to be `cuda`. If it is not, go to Runtime -> Change runtime type and select GPU.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

We're going to construct a PyTorch Dataset and DataLoader. The DataLoader will help us load in minibatches when we train and evaluate the model. Read more about these objects [here](https://pytorch.org/docs/stable/data.html).

In [None]:
from torch.utils.data import Dataset, DataLoader

class AmazonReviewDataset(Dataset):

  def __init__(self, reviews, labels, tokenizer, max_len):
    self.reviews = reviews
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.reviews)
  
  def __getitem__(self, item):
    review = str(self.reviews[item])
    label = self.labels[item]
    encoding = self.tokenizer.encode_plus(review,
                                          add_special_tokens=True,
                                          max_length=self.max_len,
                                          return_token_type_ids=False,
                                          pad_to_max_length=True,
                                          return_attention_mask=True,
                                          return_tensors='pt')
    item_dict = {'review_text': review,
                 'input_ids': encoding['input_ids'].flatten(),
                 'attention_mask': encoding['attention_mask'].flatten(),
                 'labels': torch.tensor(label, dtype=torch.long)}
    return item_dict

def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = AmazonReviewDataset(reviews=df["Text"].to_numpy(),
                           labels=df["Score"].to_numpy(),
                           tokenizer=tokenizer,
                           max_len=max_len)
  dl = DataLoader(ds, batch_size=batch_size, num_workers=4)
  return dl

We construct a DataLoader for the training, validation, and test set. Due to RAM constraints we can only afford to take batches of size 8 (any more will crash the notebook) when fine-tuning BERT. This means for large datasets, training for many epochs can take quite a long time! If we freeze the weights of BERT we can work with a larger batch size. 

In [None]:
MAX_LEN = 512
BATCH_SIZE = 8

N_TRAINING_SAMPLES = len(df_train)
N_VAL_SAMPLES = len(df_val)
N_TEST_SAMPLES = len(df_test)

train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)

We can take a peek at an example batch from our training data loader to make sure everything is in order. 

In [None]:
data = next(iter(train_data_loader))
print(data.keys())
print(data['input_ids'].shape)
print(data['attention_mask'].shape)
print(data['labels'].shape)

# Building a Model

We're going to use the built in `BertForSequenceClassfication` model. The model applies dropout (as regulaization) to the final hidden state of the [CLS] token and inputs it to a linear layer which outputs logits for possible classes. 

###Training

We define two helper functions that will train/evaluate the model over one epoch (one loop through the dataset that is passed in). Along the way we keep track of useful statistics which help us monitor our model during training and study the training process afterwards. 

In [None]:
from torch import nn, optim
from sklearn.metrics import confusion_matrix, classification_report

# Trains the model over one epoch

def train_epoch(model, data_loader, device, optimizer, scheduler):
  
  # Put model in training mode
  model = model.train()
  
  losses = []
  correct_predictions = 0
  num_samples = 0
  all_preds = []
  all_labels = []

  for d in tqdm.notebook.tqdm(data_loader):

    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    labels = d["labels"].to(device)
    num_samples += len(input_ids)

    # Forward pass. When we feed in labels, output[0] is the loss and output[1] is the logits
    output = model(input_ids=input_ids,
                   attention_mask=attention_mask,
                   labels=labels)
    
    _, preds = torch.max(output[1], dim=1)
    correct_predictions += torch.sum(preds == labels)
   
    all_preds.extend(preds.tolist())
    all_labels.extend(labels.tolist())

    loss = output[0]
    losses.append(loss.item())
    
    # Take gradient step
    loss.backward()
    MAX_NORM = 1
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=MAX_NORM)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
    
  acc =  correct_predictions.double() / num_samples
  mean_loss = np.mean(losses)
  conf_mat = confusion_matrix(all_labels, all_preds)

  return acc, mean_loss, conf_mat

def eval_epoch(model, data_loader, device):
  
  # Put model in eval mode
  model = model.eval()

  losses = []
  correct_predictions = 0
  num_samples = 0
  all_preds = []
  all_labels = []

  with torch.no_grad():
    for d in tqdm.notebook.tqdm(data_loader):

      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      labels = d["labels"].to(device)
      num_samples += len(input_ids)

      # Forward pass. When we feed in labels, output[0] is the loss and output[1] is the logits
      output = model(input_ids=input_ids,
                     attention_mask=attention_mask,
                     labels=labels)
    
      _, preds = torch.max(output[1], dim=1)
      correct_predictions += torch.sum(preds == labels)

      loss = output[0]
      losses.append(loss.item())
   
      all_preds.extend(preds.tolist())
      all_labels.extend(labels.tolist())

  acc =  correct_predictions.double() / num_samples
  mean_loss = np.mean(losses)
  conf_mat = confusion_matrix(all_labels, all_preds)

  return acc, mean_loss, conf_mat

Now we can go ahead and train the model! We set a fixed number of epochs and learning rate (both tunable parameters). Best practice would be to save the model weights that achieve best validation performance while training (we don't do this). With our current configuration this will take ~30m-60m. To decrease training time we can either train on less data or freeze the BERT pre-trained weights and increase our batch size. 

**If you get a CUDA out of memory error while training restart the notebook (Runtime -> Factory reset runtime) and try again. If that doesn't work decrease batch size.**

In [None]:
LEARNING_RATE = 2e-5
NUM_EPOCHS = 5

# We reseed when making a model so the weights are initiliazed identitcally each time
torch.manual_seed(RANDOM_SEED)
model = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, 
                                                      num_labels=N_CLASSES)
model = model.to(device) # moves model to GPU is GPU is avaliable

# We change the dropout probability from the default config

DROPOUT_PROB = 0.5
model.dropout = nn.Dropout(DROPOUT_PROB)

# Uncomment this to freeze the pre-trained BERT weights
#for param in model.bert.parameters():
#    param.requires_grad = False

# Passing only those parameters that explicitly require grad in case we freeze BERT weights
optimizer = optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=LEARNING_RATE)

#optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, correct_bias=False)
total_steps = len(train_data_loader) * NUM_EPOCHS

# This function comes from the transformers library
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps) 

In [None]:
from collections import defaultdict

# Training time!

history = defaultdict(list)

for epoch in range(NUM_EPOCHS):
  print(f'Epoch {epoch + 1}/{NUM_EPOCHS}')
  print('-' * 10)
  
  train_acc, train_loss, train_conf = train_epoch(model,
                                                  train_data_loader,
                                                  device,
                                                  optimizer,
                                                  scheduler)
  print(f'Train loss: {train_loss}, Train accuracy: {train_acc}')

  val_acc, val_loss, val_conf = eval_epoch(model,
                                           val_data_loader,
                                           device)
  print(f'Val loss: {val_loss}, Val accuracy {val_acc}')
  
  print()
  print()

  history['train_acc'].append(train_acc)
  history['val_acc'].append(val_acc)
  history['train_loss'].append(train_loss)
  history['val_loss'].append(val_loss)
  history['train_conf'].append(train_conf)
  history['val_conf'].append(val_conf)

### Analysis

We can use our ``history`` dict to study the results of our training. First we'll plot our accuracy and loss over time.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

def plot_training_stats(train_list, val_list, title):
  plt.plot(train_list, label='train')
  plt.plot(val_list, label='val')
  plt.title(title)
  plt.ylabel(title)
  plt.xlabel('Epoch')
  plt.legend()
  plt.show()
  plt.clf()

plot_training_stats(history['train_loss'],
                    history['val_loss'],
                    'Loss')
plot_training_stats(history['train_acc'],
                    history['val_acc'],
                    'Accuracy')

Accuracy might not be the best way to gauge how well our model is generalizing. Not all mistakes are necessarily as bad as others. For example, if we classify a 1 as a 2, that's not as inexcusable as classifying a 1 as a 5. We present two alterantive metrics for gauging model performance and plot them:

1. Off by one accuracy: So long as our predicted label is within one of the true label we'll consider the prediction correct. With constant prior class probabilities, we'd expect a model that randomly assigns labels to get this right ~52% of the time.

2. Extreme accuracy: We only consider the classification of 1s, 2s, 4s, and 5s. So long as we don't classify a 1 or 2 as a 4 or 5 or vice versa, we consider the classification correct. A model that ranodmly assigns labels would get this right 50% of the time. 

We also print out the last confusion matrix on the validation set.

In [None]:
from scipy import linalg

def compute_off_acc(conf_mat):
  n = len(conf_mat)
  off_diag = np.zeros(n)
  off_diag[0] = 1
  off_diag[1] = 1
  M = linalg.toeplitz(off_diag, off_diag)
  return np.sum(conf_mat * M)/np.sum(conf_mat)

# hard-coded for 5 classes

def compute_ext_acc(conf_mat):
  conf_mat = np.delete(conf_mat, 2, 0)
  conf_mat = np.delete(conf_mat, 2, 1)
  return (np.sum(conf_mat[0:2, 0:2]) + np.sum(conf_mat[2:4, 2:4]))/np.sum(conf_mat)


off_train_acc = []
off_val_acc = []
ext_train_acc = []
ext_val_acc = []

for i in range(NUM_EPOCHS):
  train_conf = history['train_conf'][i]
  val_conf = history['val_conf'][i]
  off_train_acc.append(compute_off_acc(train_conf))
  off_val_acc.append(compute_off_acc(val_conf))
  ext_train_acc.append(compute_ext_acc(train_conf))
  ext_val_acc.append(compute_ext_acc(val_conf))

plot_training_stats(off_train_acc,
                    off_val_acc,
                    'Off by One Accuracy')
plot_training_stats(ext_train_acc,
                    ext_val_acc,
                    'Extreme Accuracy')

print(history['val_conf'][-1])

For the sake of comparison we compare this to the performance of our un-trained model. Our model is clearly learning quite a bit! And it's doing so after training on only 1-2 epochs. 

In [None]:
torch.manual_seed(RANDOM_SEED)
random_model = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, 
                                                             num_labels=N_CLASSES)
random_model = random_model.to(device)
r_val_acc, r_val_loss, r_val_conf = eval_epoch(random_model,
                                               val_data_loader,
                                               device)
print("Un-trained acc.: ", r_val_acc.item())
print("Un-trained off by one acc.: ",  compute_off_acc(r_val_conf))
print("Un-trained ext one acc.: ",  compute_ext_acc(r_val_conf))

#### <font color='red'>Experiment yourself!</font>

Although we've fully walked through training a BERT classification model, there's still a number of things to be done. Our model has the ability to fit data extremely well and performs well by some metrics, but it's clearly overfitting the training data and there's quite a lot of room for improvement. Ideally we would train on a lot more of our data with larger batch size, but we're limited by Colab's resources. We suggest a number of potential avenues of feasible further experimentation/exploration below. The 5th is the most time consuming, but will give you the best understanding of the material. 

*Note: Prior to experimenting one should hold out a test data set. Since some of these experiments involve altering how the data is processed you should be extra careful about this.*

1. Find the examples where the model is misclassifying and print them out. Can you figure out why the model is making mistakes? Are the mistakes reasonable?
2. In pre-processing, don't balance the classes. See how this impacts model performance and the confusion matrices during training. 
3. There are a number of hyper-parameters to tune (e.g. amount of training data, number of epochs, batch size, max length, dropout, learning rate). Experiment with how changing these effects your results. A reasonble place to start is adding a lot more training data and training for a very small number of epochs.
4. To a human, the difference between a 1 and 2 star or 4 and 5 star review may be immaterial. We chose to train the model on all five classes to give an example of multi-class classifcation. To predict to this level of granularity is a hard, noisy task. As we hinted when we came up with different metrics for evaluating model performance, it isn't necessarily the most reasonable thing to do. Try labelling reviews as negative (1 and 2 star) and positive (4 and 5 star) and re-train the model on newly labelled data. See how the model performs on this binary classification task. 
5. Instead of using `BertForSequenceClassification` make your own model! Here's some motivation: Our model is likely overfitting because we're training such a large number of parameters on a small dataset. One solution is to keep the same model but freeze BERT's pre-trained weights. This greatly reduces the number of trainable parameters and also allows us to significantly increase batch size without crashing the notebook, meaning we can train on a lot more data. If you try this out and you'll see pretty poor results. This is likely because, without fine tuning, the final hidden state of the [CLS] token isn't neccesarily a good representation of the sentence. Try defining your own model class which inputs a different sentence representation into the final linear layer rather than the outputted [CLS] hidden state. Then try training your model with BERT's pretrained weights frozen. Make sure your model class' `forward()` function's output is compatible with `train_epoch()` and `eval_epoch()`. As an example of an alternate sentence representation, you can average the final hidden states of all the tokens in the sequence.

