# Finetuning BERT for sentiment analysis
Let's explore how to finetune the pre-trained BERT for a sentiment analysis task with the IMDB dataset. The IMDB dataset consists of movie reviews along with the respective sentiment. The dataset used in this section can be downloaded from here. [link TBA]

Import the dependencies

First, let's install the necessary libraries:

In [None]:
%%capture
!pip install nlp==0.4.0
!pip install transformers==3.5.1



Import the necessary modules:

In [None]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from nlp import load_dataset
import torch
import numpy as np


Load the model  and dataset. First, let's download and load the dataset using the nlp library:  

In [None]:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')

Downloading...
From: https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
To: /content/imdbs.csv
  0% 0.00/132k [00:00<?, ?B/s]100% 132k/132k [00:00<00:00, 72.9MB/s]


Downloading:   0%|          | 0.00/2.75k [00:00<?, ?B/s]



Downloading and preparing dataset csv/default-11046c2826f07a01 (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /root/.cache/huggingface/datasets/csv/default-11046c2826f07a01/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b...


0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-11046c2826f07a01/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b. Subsequent calls will reuse this data.



Let us check the datatype:

In [None]:
type(dataset)


Next, let's split the dataset into train and test set:

In [None]:
dataset = dataset.train_test_split(test_size=0.3)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]


Let's print the dataset:

In [None]:
dataset

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 70),
 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 30)}


Now, we create the  train and test sets:




In [None]:
train_set = dataset['train']
test_set = dataset['test']

In [None]:
train_set[0]

{'label': 0,
 'text': "The Hills Have Eyes II is what you would expect it to be and nothing more. Of course it's not going to be an Oscar nominated film, it's just pure entertainment which you can just lose yourself in for 90 minutes.<br /><br />The plot is basically about a group of National Guard trainees who find themselves battling against the notorious mutated hillbillies on their last day of training in the desert. It's just them fighting back throughout the whole film, which includes a lot of violence (which is basically the whole film) as blood and guts are constantly flying around throughout the whole thing, and also yet another graphic rape scene which is pointlessly thrown in to shock the audience.<br /><br />I'd give the Hills Have Eyes II 4 out of 10 for pure entertainment, and that only. Although even then I found myself looking at my watch more and more as the film went on, as it began to drag due to the fact it continued to try and shock the audience with graphic gore a


Next, let's download and load the pre-trained BERT model. In this example, we use the pre-trained bert-base-uncased model. As we can observe below, since we are performing sequence classification, we use the BertForSequenceClassification class:


In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Next, we download and load the tokenizer which is used for pretraining the bert-base-uncased model.
As we can observe, we create the tokenizer using the BertTokenizerFastclass instead of BertTokenizer. The BertTokenizerFast class has many advantages compared to BertTokenizer. We will learn about this in the next section:


In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Now that we loaded the dataset and model, next let's preprocess the dataset.

## Preprocess the dataset
We can preprocess the dataset in a quicker way using our tokenizer. For example, consider the sentence: 'I love Paris'.  

First, we tokenize the sentence and add the [CLS] token at the beginning and [SEP] token at the end as shown below:


tokens = [ [CLS], I, love, Paris, [SEP] ]


Next, we map the tokens to the unique input ids (token ids). Suppose the following are the unique input ids (token ids):


input_ids = [101, 1045, 2293, 3000, 102]

Then, we need to add the segment ids (token type ids). Wait, what are segment ids? Suppose we have two sentences in the input. In that case, segment ids are used to distinguish one sentence from the other. All the tokens from the first sentence will be mapped to 0 and all the tokens from the second sentence will be mapped to 1. Since here we have only one sentence, all the tokens will be mapped to 0 as shown below:


token_type_ids = [0, 0, 0, 0, 0]


Now, we need to create the attention mask. We know that an attention mask is used to differentiate the actual tokens and [PAD] tokens. It will map all the actual tokens to 1 and the [PAD] tokens to 0. Suppose, our tokens length should be 5. Now, our tokens list has already 5 tokens. So, we don't have to add [PAD] token. Then our attention mask will become:


attention_mask = [1, 1, 1, 1, 1]


That's it. But instead of doing all the above steps manually, our tokenizer will do these steps for us. We just need to pass the sentence to the tokenizer as shown below:


In [None]:
tokenizer('I love Paris')

{'input_ids': [101, 1045, 2293, 3000, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}


With the tokenizer, we can also pass any number of sentences and perform padding dynamically. To do that, we need to set padding to True and also the maximum sequence length. For instance, as shown below, we pass three sentences and we set the maximum sequence length, max_length to 5:


In [None]:
tokenizer(['I love Paris', 'birds fly','snow fall'], padding = True, max_length=5)



{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}


That's it, with the tokenizer, we can easily preprocess our dataset. So we define a function called preprocess for processing the dataset as shown below:


In [None]:
def preprocess(data):
    return tokenizer(data['text'], padding=True, truncation=True)


Now, we preprocess the train and test set using the preprocess function:


In [None]:
!pip install dill==0.3.4

Collecting dill==0.3.4
  Downloading dill-0.3.4-py2.py3-none-any.whl.metadata (9.6 kB)
Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.8
    Uninstalling dill-0.3.8:
      Successfully uninstalled dill-0.3.8
Successfully installed dill-0.3.4


In [None]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]


Next, we use the set_format function and select the columns which we need in our dataset and also in which format we need them as shown below:  


In [None]:
train_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

In [None]:
train_set

Dataset(features: {'label': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 70)

That's it. Now that we have the dataset ready, let's train the model.

## Training the model


Define the batch size and epoch size:

In [None]:
batch_size = 8
epochs = 2


Define the warmup steps and weight decay:

In [None]:
warmup_steps = 500
weight_decay = 0.01


Define the training arguments:

In [None]:
from transformers import TrainingArguments

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    #evaluate_during_training=True,
    logging_dir='./logs',
)



Now define the trainer:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)



Start training the model:

In [None]:
trainer.train()

Step,Training Loss


Step,Training Loss


TrainOutput(global_step=18, training_loss=0.6738674375745985, metrics={'train_runtime': 856.8889, 'train_samples_per_second': 0.163, 'train_steps_per_second': 0.021, 'total_flos': 36835547750400.0, 'train_loss': 0.6738674375745985, 'epoch': 2.0})


After training we can evaluate the model using the evaluate function:

In [None]:
trainer.evaluate()

{'eval_loss': 0.6962141394615173,
 'eval_runtime': 68.9175,
 'eval_samples_per_second': 0.435,
 'eval_steps_per_second': 0.058,
 'epoch': 2.0}


In this way, we can finetune the pre-trained BERT. Now that we have learned how to finetune the BERT for the text classification task, in the next section, let's see how to finetune the BERT model for the natural language inference task.