# Fine-Tuning BERT for Sentiment Analysis

Siguiendo los pasos que aparecen en "A Complete Guide to BERT with Code"

In [1]:
from huggingface_hub import notebook_login
notebook_login()

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/IMDB Dataset.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## 1 Load and Preprocess the Dataset

### Cleaning the reviews

In [4]:
print(f"Original:\n{df.iloc[1]['review'][0:72]}")

#Remove HTML break tags <br />
df['review_cleaned'] = df['review'].apply(lambda x: x.replace('<br />', ''))
print(f"With no break tags:\n{df.iloc[1]['review_cleaned'][0:72]}")

#Remove unnecesary whitespace
df['review_cleaned'] = df['review_cleaned'].replace(r'\s+', ' ', regex=True)
print(f"Cleaned:\n{df.iloc[1]['review_cleaned'][0:72]}")

Original:
A wonderful little production. <br /><br />The filming technique is very
With no break tags:
A wonderful little production. The filming technique is very unassuming-
Cleaned:
A wonderful little production. The filming technique is very unassuming-


### Encoding the sentiment

In [5]:
df['sentiment_encoded'] = df['sentiment'].apply(lambda x:0 if x == 'negative' else 1)
df.head(5)

Unnamed: 0,review,sentiment,review_cleaned,sentiment_encoded
0,One of the other reviewers has mentioned that ...,positive,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...,1
2,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,negative,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love in the Time of Money"" is...",1


## 2 Tokenize the Data

In [6]:
from transformers import BertTokenizer

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

### Visualization for a single review

In [8]:
#Encode a sample sentence
sample = "I liked this movie"
#We can return token_ids tensor in pytorch, numpy or tensor flow format
token_ids = tokenizer.encode(sample, return_tensors='np')[0]
print(f"Token IDs:\n{token_ids}")

#Convert token_ids back to tokens. Just to visualize special tokens added
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f"Tokens:\n{tokens}")

Token IDs:
[ 101 1045 4669 2023 3185  102]
Tokens:
['[CLS]', 'i', 'liked', 'this', 'movie', '[SEP]']


In [9]:
review = df['review_cleaned'].iloc[0]
token_ids = tokenizer.encode(review, max_length = 512, padding = 'max_length', truncation = True, return_tensors = 'pt')
token_ids #Padding starts at 393th token

tensor([[  101,  2028,  1997,  1996,  2060, 15814,  2038,  3855,  2008,  2044,
          3666,  2074,  1015, 11472,  2792,  2017,  1005,  2222,  2022, 13322,
          1012,  2027,  2024,  2157,  1010,  2004,  2023,  2003,  3599,  2054,
          3047,  2007,  2033,  1012,  1996,  2034,  2518,  2008,  4930,  2033,
          2055, 11472,  2001,  2049, 24083,  1998,  4895, 10258,  2378,  8450,
          5019,  1997,  4808,  1010,  2029,  2275,  1999,  2157,  2013,  1996,
          2773,  2175,  1012,  3404,  2033,  1010,  2023,  2003,  2025,  1037,
          2265,  2005,  1996,  8143, 18627,  2030,  5199,  3593,  1012,  2023,
          2265,  8005,  2053, 17957,  2007, 12362,  2000,  5850,  1010,  3348,
          2030,  4808,  1012,  2049,  2003, 13076,  1010,  1999,  1996,  4438,
          2224,  1997,  1996,  2773,  1012,  2009,  2003,  2170, 11472,  2004,
          2008,  2003,  1996,  8367,  2445,  2000,  1996, 17411,  4555,  3036,
          2110,  7279,  4221, 12380,  2854,  1012,  

Nota: tokenizer.encode() puede devolver los token_ids tanto como un tensor de numpy, pytorch or tensorflow. Para este tipo de proyectos se recomienda el formato pytorch ya que esta mejor preparado para usarse en CUDA

In [11]:
review = df['review_cleaned'].iloc[0]
batch_encoder = tokenizer.encode_plus(review, max_length = 512, padding = 'max_length', truncation = True, return_tensors = 'pt')

AttributeError: BertTokenizer has no attribute encode_plus

**Nota: En la versión actual de la libreria transformers, ya no se usa encode_plus(). Lo que solia hacer encode_plus() ahora se maneja automaticamente desde la funcion tokenizer()**

In [10]:
review = df['review_cleaned'].iloc[0]
batch_encoder = tokenizer(review, max_length = 512, padding = 'max_length', truncation = True, return_tensors = 'pt')

In [11]:
print(f"Batch encoder keys:\n{batch_encoder.keys()}\n")

Batch encoder keys:
KeysView({'input_ids': tensor([[  101,  2028,  1997,  1996,  2060, 15814,  2038,  3855,  2008,  2044,
          3666,  2074,  1015, 11472,  2792,  2017,  1005,  2222,  2022, 13322,
          1012,  2027,  2024,  2157,  1010,  2004,  2023,  2003,  3599,  2054,
          3047,  2007,  2033,  1012,  1996,  2034,  2518,  2008,  4930,  2033,
          2055, 11472,  2001,  2049, 24083,  1998,  4895, 10258,  2378,  8450,
          5019,  1997,  4808,  1010,  2029,  2275,  1999,  2157,  2013,  1996,
          2773,  2175,  1012,  3404,  2033,  1010,  2023,  2003,  2025,  1037,
          2265,  2005,  1996,  8143, 18627,  2030,  5199,  3593,  1012,  2023,
          2265,  8005,  2053, 17957,  2007, 12362,  2000,  5850,  1010,  3348,
          2030,  4808,  1012,  2049,  2003, 13076,  1010,  1999,  1996,  4438,
          2224,  1997,  1996,  2773,  1012,  2009,  2003,  2170, 11472,  2004,
          2008,  2003,  1996,  8367,  2445,  2000,  1996, 17411,  4555,  3036,
         

In [12]:
print(f"Attention mask:\n{batch_encoder['attention_mask']}")

Attention mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 

### Encoding all reviews

In [13]:
import torch

In [14]:
%%time

token_ids = []
attention_masks = []

#Encode each review
for review in df['review_cleaned']:
    batch_encoder = tokenizer(review, max_length = 512, padding = 'max_length', truncation = True, return_tensors = 'pt')
    token_ids.append(batch_encoder['input_ids'])
    attention_masks.append(batch_encoder['attention_mask'])

#Convert token_ids and attention_masks to pytorch tensors
token_ids = torch.cat(token_ids, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)

CPU times: user 51.1 s, sys: 420 ms, total: 51.5 s
Wall time: 51.5 s


## 3 Create the Train and Validation DataLoaders

In [15]:
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

In [16]:
val_size = 0.1

#Split the token IDs
train_ids, val_ids = train_test_split(token_ids, test_size=val_size, shuffle=False)

#Split the attention masks
train_masks, val_masks = train_test_split(attention_masks, test_size=val_size, shuffle=False)

#Split the labels
labels = torch.tensor(df['sentiment_encoded'].values)
train_labels, val_labels = train_test_split(labels, test_size=val_size, shuffle=False)

#Create the DataLoaders
train_data = TensorDataset(train_ids, train_masks, train_labels)
train_dataloader = DataLoader(train_data, shuffle = True, batch_size = 16)
val_data = TensorDataset(val_ids, val_masks, val_labels)
val_dataloader = DataLoader(val_data, batch_size = 16)

Nota: **Sobre suffle**
   - Al hacer las particiones es importante no mezclar los elementos (shuffle=False). De otro modo, perderiamos la asociaciones entre ids, mascaras y labels
   - En train_dataloader se tiene suffle=True porque el objetivo del conjunto de entrenamiento es que el modelo aprenda los patrones del texto, e ignore información asociada a las posiciones de los elementos
   - En data_loader hacer suffle es innecesario porque el modelo no obtiene información a partir de el, solo se usa para evaluar su rendimiento

## 4 Instantiate a BERT model

In [17]:
from transformers import BertForSequenceClassification

In [18]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertForSequenceClassification LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


Nota: Aunque parezca un error, este status report es normal. Significa que el modelo aun no ha sido entrenado para nuestra tarea, por lo que los pesos de la capa lineal aun no tienen valor definido

## 5 Instantitate an Optimizer, Loss Function and Scheduler

In [19]:
from torch.optim import AdamW
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

In [20]:
EPOCHS = 2

#Optimizer
optimizer = AdamW(model.parameters())

#Loss function
loss_function = nn.CrossEntropyLoss()

#Scheduler: To gradually reduce the learning rate as the training process continues
num_training_steps = EPOCHS * len(train_dataloader)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps = 0, num_training_steps = num_training_steps
)

## 6 Fine-Tunning Loop

Nota: Para tener mejores tiempos de entrenamiento podemos usar la potencia de la GPU a traves de la plataforma CUDA.
- Para poder utilizar CUDA es necesario instalar los drivers de NVIDIA

In [21]:
#Check if GPUs with CUDA are available
if torch.cuda.is_available():
    print("CUDA available")
    device = torch.device('cuda:0')
else:
    print("CUDA not available")
    device = torch.device('cpu')

CUDA available


In [22]:
import numpy as np

In [24]:
def calculate_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    accuracy = np.sum(pred_flat == labels_flat) / len(labels_flat)

    return accuracy

In [26]:
model.to(device)

for epoch in range(0, EPOCHS):
    #1. Switch the model to be in "Train Mode". Activates the dropout layer
    model.train()

    #2. Save training loss to track it over subsequent epochs.
    #It should decrease with each epoch if training is successful
    training_loss = 0

    for batch in train_dataloader:

        #3. Move token IDs, attention masks and labels to GPU (if available)
        batch_token_ids = batch[0].to(device)
        batch_attention_mask = batch[1].to(device)
        batch_labels = batch[2].to(device)

        #4. Reset the calculated gradients from the previous iteration loop
        model.zero_grad()

        #5. Pass the batch to the model to calculate the logits (predictions based
        # on the current classifier parameters (weigths and biases)) and loss
        loss, logits = model (
            batch_token_ids,
            token_type_ids = None,
            attention_mask = batch_attention_mask,
            labels = batch_labels,
            return_dict = False
        )

        #6. Extract the total loss for the epoch
        #Loss is returned as a PyTorch tensor, .item() extracts its float value
        training_loss += loss.item()

        #7. Perform a backwar pass of the madel and propagate the loss through the classifier
        #head. This will allow the model to adjust its parameters to improve the performance
        loss.backward()

        #8. Clip the gradient to be no longer than 1.0.
        # So the model does not suffer from the exploding gradients problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        #9. Call the optimizer to take a step in the direction of the error surface as
        #determined by the backward pass
        optimizer.step()
        scheduler.step()

    #10. Calculate the average loss and time taken for training on the epoch
    average_train_loss = training_loss / len(train_dataloader)

    #Validation step for the epoch

    #11. Switch the model to "Evaluation Mode". Deactivates the dropout layer
    model.eval()

    #12. Set the validation loss to 0.
    val_loss = 0
    val_accuracy = 0

    #13. Split the validation data into batches (already done)

    for batch in val_dataloader:

        #14. Move token IDs, attention masks and labels to GPU (if available)
        batch_token_ids = batch[0].to(device)
        batch_attention_mask = batch[1].to(device)
        batch_labels = batch[2].to(device)

        #15. Invoke no_grad() to instruct the model not to calculate the gradients
        #since we will not be performing any optimization, only inference
        with torch.no_grad():
            #16. Pass the batch to the model to calculate the logits and the loss
            loss, logits = model (
                batch_token_ids,
                token_type_ids = None,
                attention_mask = batch_attention_mask,
                labels = batch_labels,
                return_dict = False
            )

        #17. Extract the logits and labels from the model and move them to the CPU
        logits = logits.detach().cpu().numpy()
        label_ids = batch_labels.to('cpu').numpy()

        #18. Increment the loss and calculate the accuracy based on the true labels
        #in the validation dataloader
        val_loss += loss.item()

        #19. Calculate the average loss and accuracy
        val_accuracy += calculate_accuracy(logits, label_ids)

    average_val_accuracy = val_accuracy / len(val_dataloader)


OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 79.25 MiB is free. Including non-PyTorch memory, this process has 3.59 GiB memory in use. Of the allocated memory 3.46 GiB is allocated by PyTorch, and 47.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Nota: **Sobre CUDA y GPU**
Para poder realizar los entrenamientos, todos los elementos incluido el propio modelo deben estar en el mismo sitio, idealmente la GPU. Por ello es necesario incluir los llamadas a .to(device)

Nota:
Con los hiperparametros actuales, batch_size=32 en los DataLoaders y max_length=512 en Encoding, mi ordenador (RTX 3050 con 4GB VRAM) no es capaz de entrenar el modelo por falta de memoria. Se estima que se necesita alrededor de 12GB VRAM. Para poder completar el entrenamiento debo reducir los hiperparametros o entrenar un modelo más pequeño.