# **Sentiment Analysis : Fine-Tuning using BERT**

## **Teble of content**

1. [Dataset Customization](#dataset-customization)
2. [BERT Model](#bert-model)
3. [Training Preparation](#training-preparation)
4. [Training Loop](#training-loop)

### Connect with `Google Drive`

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive


### Install the `transformers` package
> This package could be installed from [**Hugging Face**](https://huggingface.co/), it will gives us a PyTorch interface for working with BERT. 

In [None]:
!pip install --upgrade transformers

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.4.2)


### Required Libraries

In [None]:
# Linear algebra
import numpy as np

# Data processing
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm

import transformers
from transformers import BertTokenizer, BertModel, BertForMaskedLM, AdamW, get_linear_schedule_with_warmup

from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
df = pd.read_csv('/gdrive/MyDrive/Movie Review.csv')

In [None]:
df.head(2)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive


In [None]:
df.groupby(['sentiment']).count()

Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


### **Dataset Cusomization**

> Here I'm going to transform the dataset into the format that our BERT model can be trained on. For this reason, I will:
* use **`BertTokenizer`** to load the BERT pre-trained Tokenizer. As you you know before feeding our review to BERT, it *must be tokenized into tokens*, and then *these tokens must be mapped to their index in the tokenizer vocabulary*. The tokenization must be performed using the tokenizer integrated with BERT.
* Format reviews by adding **special tokens** `[CLS]` to the start and `[SEP]` to the end  of each sentence, padding and truncatting sentence to single constant lenght since our reviews have varying lengths, and we will see explicitly how i defferentiate between `real tokens` and `padding tokens` with **attention mask**, just note that I will not truncate reviews this time!  here is the trick; padding is done with a special `[PAD]` token indexed with 0 in the BERT vocabulary.

Please, note that BERT Model has two constraints:
1. Sentences must be `padded` or `truncated` to a fixed length. In my case I padded reviews with a max length equals to 64 token
2. Sentence maximum length is 512 tokens

In [None]:
class MovieReviewDatSet:
  def __init__(self, data_path):
    self.data = pd.read_csv(data_path).fillna('none')
    self.data.sentiment = self.data.sentiment.apply(lambda x: 1 if x == "positive" else 0)
    self.data = self.data.reset_index(drop=True)

    # Load the BERT tokenizer using the uncased vesion
    self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    self.max_len = 64
    
    self.review = self.data.review.values
    self.label = self.data.sentiment.values


  def __len__(self):
    return len(self.review)


  def __getitem__(self, item):
    review = str(self.review[item])
    review = " ".join(review.split())

    '''
    Apply the tokenizer to reviews, then
    conevert tokens to ids based en BERT
    vocabulary.
    1. Splite the sentence into tokens
    2. Add special tokens
    3. Map tokens to their IDs
    4. Padd reviews to the same length (64 token)
    5. Create the attention mask which helps us diffrenciate
      between real tokens from padded tokens marked with [PAD]
      indexed with 0 in the vocabulary.
    '''
    inputs = self.tokenizer.encode_plus(
        review,
        None,
        add_special_tokens = True, # Add special Token [SEP], [CLS]
        max_length = self.max_len,
        pad_to_max_length = True, # Pad all reviews
    )

    # Review tokens IDs
    ids = inputs["input_ids"]
    mask = inputs["attention_mask"]
    token_type_ids = inputs["token_type_ids"]

    # Convert every thing to tensors
    samples = {
        "ids" : torch.tensor(ids, dtype=torch.long),
        "mask" : torch.tensor(mask, dtype=torch.long),
        "token_type_ids" : torch.tensor(token_type_ids, dtype=torch.long),
        "labels" : torch.tensor(self.label[item], dtype=torch.float)
    }

    return samples


#### Split the Customized Dataset
> Here, I'm loading and intantiating the dataset, and then customized it using `DataSetMovieReview` class.
After I will split it into training data `90%` and validation data `10%`.

In [None]:
data_path = '/gdrive/MyDrive/Movie Review.csv'

# Instantiate dataset
dataset = MovieReviewDatSet(data_path)

# Split data into train and valid subsets
train_data, valid_data = train_test_split(dataset, test_size = 0.1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
print(train_data[0])

{'ids': tensor([  101,  1000,  2054,  5650,  2179,  1000,  2001,  1037,  8242,  5456,
         1012,  2004,  2517,  1998,  2856,  2011,  1037,  1012,  4670,  4330,
         1010,  2023,  2003,  5257,  1997,  1037,  2346,  3185,  2007,  1037,
        14046,  5649,  6925,  1010,  2004,  2092,  2004,  1037,  8774,  1997,
         5456,  1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,
         2065,  2017,  4033,  1005,  1056,  2464,  1996,  2143,  1010,  2672,
         2017,  2323,  2644,   102]), 'mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(1.)}


* `Attention Mask` is just an array of **1s** and **0s** indicating which tokens are padded and which are not.

#### DataLoader
Here, I'm creating DataLoaders for training and validation sets. However, the DataLoader needs to know our `batch size` for training and validation, for this reason:
* I specified it and setting equal to 8 for training and 4 for validation
* I specified `num_workers` for training and validation 4 and 1 respectively. It basically indicates how many subprocesses to use for data loading.

In [None]:
# Train data
train_dataloader = DataLoader(train_data,
                              batch_size = 8,
                              num_workers = 4)

# Valid data
valid_dataloader = DataLoader(valid_data,
                              batch_size = 4,
                              num_workers = 1)

In [None]:
type(train_dataloader)

torch.utils.data.dataloader.DataLoader

### **BERT Model**
Since our dataset is ready and well costumized, it's time to fine tune BERT Model. SO, we need to adapte pre-trained BertModel by modifying its output for our clasification task, then train it on our dataset.
As might know there are a lot of classes that we can use to fine-tune BERT like `BertModel`, `BertForPreTraining`, `BertForMaskedLM` and list is too long. In our case, we will be using **`BertModel`**

In [None]:
class BERTModel(nn.Module):
  def __init__(self):
    super(BERTModel, self).__init__()

    # Load pre-trained model (weights)
    self.bert = BertModel.from_pretrained('bert-base-uncased')
    self.bert_drop = nn.Dropout(0.3)

    # Linear classifier layer on top
    self.out = nn.Linear(768, 1)

  # def forward(self, ids, mask, token_type_ids):
  #   _, o2 = self.bert(ids, attention_mask=mask, token_type_ids = token_type_ids)
  #   bo = self.bert_drop(o2)
  #   output = self.out(bo)
  
  def forward(self, ids, mask):
    o2 = self.bert(ids, attention_mask=mask)
    pooled_ouput = o2['pooler_output']    
    bo = self.bert_drop(pooled_ouput)
    output = self.out(bo)
    
    return output

In [None]:
# Get GPU device name; PyTorch will use this GPU 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BERTModel()
model.to(device) 

BERTModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  

>If you took a close look at the model's parameters using the summary above.you will be able to see the:
1. **`Embedding layer`** including `token embedding`, `segment embedding`, and `positional embedding`.
1. First **`12 transformers`** (0->11)
1. **`Output layer`**

### **Optimizer & Scheduler**
> Our Model is loaded and modified successfully. So, we have now to chose the `optimizer` and `parameterize` our scheduler.
If you come back to the [original paper](https://arxiv.org/pdf/1810.04805.pdf), the authors recommended using the following values: 
* **`Batch Size`**: 16, 32
* **`Learning Rate`**: 5e-5, 3e-5, 2e-5
* **`Number of epochs`**: 2, 3, 4

In our case, I chose:
- **Batch size** equals to 8 set when creating the DataLoader.
- `AdamW` optimizer which is a class from the hugging face library with **Lr** : 2e-5 as shxon in script below.
- **Epochs** equal to 4


In [None]:
# Optimizer
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  correct_bias = False) # We chose to correct bias

# Total number of training steps is number of batchs
total_steps = len(train_dataloader)

# create the learning rate Scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = 0,
    num_training_steps=total_steps
)

# loss_fn = nn.CrossEntropyLoss().to(device)

### **Training Preparation**

> Here, I will define some helper function like:
1. **`loss_fn`** fo calculating the loss function.
2. **`train_fn`** for the training phase where I will load the data onto the GPU, feed it through the network, perform the backward pass, and update the parameters with optimizer and the Learning rate.
3. **`eval_fn`**

In [None]:
def loss_fn(outputs, labels):
  return nn.BCEWithLogitsLoss()(outputs, labels.view(-1, 1))

def train_fn(train_dataloader, model, optimizer, device, scheduler):
  model.train()

  for bi, d in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
    ids = d["ids"]
    # token_type_ids = d["token_type_ids"]
    mask = d["mask"]
    labels = d["labels"]

    # Load data onto the GPU
    ids = ids.to(device, dtype=torch.long)
   # token_type_ids = token_type_ids.to(device, dtype=torch.long)
    mask = mask.to(device, dtype = torch.long)
    labels = labels.to(device, dtype = torch.float)

    # Clear out the gradients of the previous pass
    optimizer.zero_grad()

    # Forward & backward passes
    outputs = model(ids=ids,
                   # token_type_ids = token_type_ids,
                    mask=mask)
    loss = loss_fn(outputs, labels)
    loss.backward()

    # Update the parameters 
    optimizer.step()

    # Track varaibles for monitoring progress
    # Update the learning rate
    scheduler.step()


def eval_fn(train_dataloader, model, device):
  model.eval()
  fin_labels = []
  fin_outputs = []
  with torch.no_grad():
    for bi, d in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
      ids = d["ids"]
      #token_type_ids = d["token_type_ids"]
      mask = d["mask"]
      labels = d["labels"]

      # Load the data onto the GPU
      ids = ids.to(device, dtype=torch.long)
      #token_type_ids = token_type_ids.to(device, dtype=torch.long)
      mask = mask.to(device, dtype=torch.float)
      labels = labels.to(device, dtype=torch.float)

      # Forward pass
      outputs = model(ids=ids, 
                      #token_type_ids=token_type_ids, 
                      mask=mask)
      fin_labels.extend(labels.cpu().detach().numpy().tolist())
      fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
  return fin_outputs, fin_labels

### **Training loop**

> Finally, we come to the final phase, where I will fundamentally for each pass in our loop have a **`training`** and **`validation`** phase using our three helper functions defined above.

In [None]:
EPOCHS = 4
best_accuracy = 0
for epoch in range(EPOCHS):
  train_fn(train_dataloader, model, optimizer, device, scheduler)
  outputs, labels = eval_fn(valid_dataloader, model, device)
  outputs = np.array(outputs) >= 0.5

  # Calculate the accuracy for each pass
  accuracy = metrics.accuracy_score(labels, outputs)
  print(f"The Accuracy Score is = {accuracy}")
  
  # Save our best model having the best accuracy
  if accuracy > best_accuracy:
    torch.save(model.state_dict(), "/gdrive/MyDrive/BERTModel.bin")
    best_accuracy = accuracy

100%|██████████| 5625/5625 [06:51<00:00, 13.68it/s]
100%|██████████| 1250/1250 [00:16<00:00, 76.70it/s]


The Accuracy Score is = 0.8356


100%|██████████| 5625/5625 [06:51<00:00, 13.66it/s]
100%|██████████| 1250/1250 [00:16<00:00, 76.82it/s]

The Accuracy Score is = 0.8356



100%|██████████| 5625/5625 [06:51<00:00, 13.67it/s]
100%|██████████| 1250/1250 [00:16<00:00, 76.83it/s]

The Accuracy Score is = 0.8356



100%|██████████| 5625/5625 [06:52<00:00, 13.65it/s]
100%|██████████| 1250/1250 [00:16<00:00, 76.81it/s]

The Accuracy Score is = 0.8356



