<a href="https://colab.research.google.com/github/rahiakela/getting-started-with-google-bert/blob/main/3-getting-hands-on-with-BERT/3_fine_tuning_BERT_for_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine-tuning BERT for sentiment analysis

Pre-training BERT from scratch is computationally expensive. So, we can download the pre-trained BERT model and use it. Google has open sourced the pre-trained BERT model and we can download it from Google Research's GitHub repository – https://github.com/google-research/bert. They have released the pre-trained BERT model with various configurations.

The pre-trained model is also available in the BERT-uncased and BERT-cased formats. In BERT-uncased, all the tokens are lowercased, but in BERT-cased, the tokens are not lowercased and are used directly for training. 

The BERT-uncased model is the one that is most commonly used, but if we are working on certain tasks such as **Named Entity Recognition (NER)** where we have to preserve the case, then we should use the BERT-cased model. Along with these, Google also released pre-trained BERT models trained using the whole word masking method.

We can use the pre-trained model in the following two ways:
- As a feature extractor by extracting embeddings
- By fine-tuning the pre-trained BERT model on downstream tasks such as text
classification, question-answering, and more

## Setup

**Hugging Face transformers**

Hugging Face is an organization that is on a path to solve and democratize AI through natural language. Their open-source library 'transformers' is very popular among the NLP community. It is very useful and powerful for several NLP and NLU tasks. It includes thousands of pre-trained models in about 100+ languages. One of the many advantages of the transformer library is that it is compatible with both PyTorch and TensorFlow.

In [1]:
%%capture
!pip install torch==1.4.0
!pip install nlp==0.4.0
!pip install transformers==3.5.1

In [2]:
import torch
import numpy as np

from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from nlp import load_dataset

## Fine-tuning BERT for downstream tasks

Now, let's learn how to fine-tune the pre-trained BERT model for downstream tasks. Note that fine-tuning implies that we are not training BERT from scratch; instead, we are using the pre-trained BERT and updating its weights according to our task.

**Text classification**

Let's learn how to fine-tune the pre-trained BERT model for a text classification task. Say we are performing sentiment analysis. In the sentiment analysis task, our goal is to classify whether a sentence is positive or negative. Suppose we have a dataset containing sentences
along with their labels.

Consider a sentence: `I love Paris`. First, we tokenize the sentence, add the `[CLS]` token at the beginning, and add the `[SEP]` token at the end of the sentence. Then, we feed the tokens as an input to the pre-trained BERT model and get the embeddings of all the tokens.

Next, we ignore the embedding of all other tokens and take only the embedding of `[CLS]` token, which is $R_{[CLS]}$. The embedding of the `[CLS]` token will hold the aggregate representation of the sentence. We feed $R_{[CLS]}$ to a classifier (feed-forward network with softmax function) and train the classifier to perform sentiment analysis.

How does fine-tuning the pre-trained BERT model differ from using the pre-trained BERT model as a feature extractor?

We learned that after extracting the embedding $R_{[CLS]}$ of a sentence, we feed $R_{[CLS]}$ to a classifier and train the classifier to perform classification. Similarly, during fine-tuning, we feed the embedding of $R_{[CLS]}$ to a classifier and train the classifier to perform classification.

The difference is that when we fine-tune the pre-trained BERT model, we update the weights of the model along with a classifier. But when we use the pre-trained BERT model as a feature extractor, we update only the weights of the classifier and not the pretrained BERT model.

During fine-tuning, we can adjust the weights of the model in the following two ways:

- Update the weights of the pre-trained BERT model along with the classification layer.
- Update only the weights of the classification layer and not the pre-trained BERT model. When we do this, it becomes the same as using the pre-trained BERT
model as a feature extractor.

<img src='https://github.com/rahiakela/img-repo/blob/master/getting-started-with-google-bert/fine-tuning-text-classification.png?raw=1' width='800'/>

We feed the tokens to the pre-trained BERT model and get the embeddings of all the tokens. We take the embedding of the `[CLS]` token and feed it to a feedforward network with a softmax function and perform classification.



## Loading the model and dataset

Load the model and dataset. First, let's download and load the dataset using the nlp library:

In [None]:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-

dataset = load_dataset("csv", data_files="./imdbs.csv", split="train")

Let us check the datatype.

In [4]:
type(dataset)

nlp.arrow_dataset.Dataset

Next, let's split the dataset into train and test set.

In [5]:
dataset = dataset.train_test_split(test_size=0.3)

dataset

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




{'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 30),
 'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 70)}

Now, we create the train and test sets.

In [6]:
train_set = dataset["train"]
test_set = dataset["test"]

print(train_set.shape, test_set.shape)

(70, 2) (30, 2)


Next, let's download and load the pre-trained BERT model. In this example, we use the pretrained `bert-base-uncased` model. As we can see, since we are performing sequence classification, we use the `BertForSequenceClassification` class:

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

Next, we download and load the tokenizer that was used to pre-train the `bert-base-uncased` model.

As we can see, we create the tokenizer using the `BertTokenizerFast` class instead of `BertTokenizer`. The `BertTokenizerFast` class has many advantages compared to `BertTokenizer`.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

## Preprocessing the dataset

We can preprocess the dataset quickly using our tokenizer. For example, consider the sentence `I love Paris`.

First, we tokenize the sentence and add a `[CLS]` token at the beginning and a `[SEP]` token at the end.

```
tokens = [[CLS], I, love, Paris, [SEP]]
```

Next, we map the tokens to the unique input IDs (token IDs). Suppose the following are the unique input IDs (token IDs):

```
input_ids = [101, 1045, 2293, 3000, 102]
```

Then, we need to add the segment IDs (token type IDs). Wait, what are segment IDs?

Suppose we have two sentences in the input. In that case, segment IDs are used to distinguish one sentence from the other. All the tokens from the first sentence will be mapped to 0 and all the tokens from the second sentence will be mapped to 1. Since here we have only one sentence, all the tokens will be mapped to 0 as shown here:

```
token_type_ids = [0, 0, 0, 0, 0]
```

Now, we need to create the attention mask. We know that an attention mask is used to differentiate the actual tokens and `[PAD]` tokens. It will map all the actual tokens to 1 and the `[PAD]` tokens to 0. Suppose our tokens length should be 5. Our tokens list already has five tokens, so we don't have to add a `[PAD]` token. Our attention mask will become the following:

```
attention_mask = [1, 1, 1, 1, 1]
```

That's it. **But instead of doing all the aforementioned steps manually, our tokenizer will do these steps for us.**

In [9]:
tokenizer("I love Paris")

{'input_ids': [101, 1045, 2293, 3000, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

As we can see, our input sentence is tokenized and mapped to `input_ids`, `token_type_ids`, and also `attention_mask`:

```
{
  'input_ids': [101, 1045, 2293, 3000, 102],
  'token_type_ids': [0, 0, 0, 0, 0],
  'attention_mask': [1, 1, 1, 1, 1]
}
```

With the tokenizer, we can also pass any number of sentences and perform padding dynamically. To do that, we need to set padding to True and also the maximum sequence length. 

For instance, we pass three sentences and we set the maximum sequence length, `max_length`, to 5:

In [10]:
tokenizer(
    [
      "I love Paris", 
      "birds fly",
      "snow fall"        
    ],
    padding=True,
    max_length=5
)

{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}

As we can see, all the sentences are mapped to `input_ids`, `token_type_ids`, and `attention_mask`. The second and third sentences have only two tokens, and after adding `[CLS]` and `[SEP]`, they will have four tokens. Since
we set padding to True and max_length to 5, an additional `[PAD]` token is added to the second and third sentences, and that's why we have 0 in the attention mask of the second and third sentences:

```
{
  'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]],
  'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]
}
```

That's it – **with the tokenizer, we can easily preprocess our dataset.** 

So, we define a function called preprocess to process the dataset as follows:

In [11]:
def preprocess(data):
  return tokenizer(data["text"], padding=True, truncation=True)

Now, we preprocess the train and test sets using the preprocess function:

In [12]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(train_set))

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Next, we use the set_format function and select the columns that we need in our dataset and the format we need them.

In [13]:
train_set.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_set.set_format("torch", columns=["input_ids", "attention_mask", "label"])

That's it. Now that we have the dataset ready, let's train the model.

## Training the model

In [None]:
# Define the batch size and epoch size
batch_size = 8
epochs = 2

# Define the warmup steps and weight decay
warmup_steps = 500
weight_decay = 0.01

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    evaluate_during_training=True,
    logging_dir="./logs"
)

Now define the trainer.

In [15]:
trainer = Trainer(model=model, args=training_args, train_dataset=train_set, eval_dataset=test_set)

Start training the model.

In [16]:
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=18, training_loss=0.6994585990905762)

After training we can evaluate the model using the evaluate function.

In [17]:
trainer.evaluate()

{'epoch': 2.0, 'eval_loss': 0.6513782143592834}

In this way, we can fine-tune the pre-trained BERT model.

## Natural language inference

In NLI, the goal of our model is to determine whether a hypothesis is an entailment (true), a contradiction (false), or undetermined (neutral) given a premise. Let's learn how to perform NLI by fine-tuning BERT.

Consider the sample dataset; as we can see, we have a premise and a hypothesis with a label indicating whether they are entailment,
contradiction, or undetermined:

<img src='https://github.com/rahiakela/img-repo/blob/master/getting-started-with-google-bert/sample-NLI-dataset.png?raw=1' width='800'/>

Now, the goal of our model is to determine whether a sentence pair (premise-hypothesis pair) is an entailment, a contradiction, or undetermined. 


Let's understand how to do this with an example. Consider the following premise-hypothesis pair:

```
Premise: He is playing
Hypothesis: He is sleeping
```

First, we tokenize the sentence pair, then add a `[CLS]` token at the beginning of the first sentence and an `[SEP]` token at the end of every sentence.

```
tokens = [ [CLS], He, is, playing, [SEP], He, is, sleeping [SEP]]
```

Now, we feed the tokens to the pre-trained BERT model and get the embedding of each token. We learned that the representation of the `[CLS]` token holds the aggregate representation.

So, we take the representation of the `[CLS]` token, which is $R_{[CLS]}$, and feed it to a classifier (feeedforward + softmax), which returns the probability of the sentence being a contradiction, an entailment, or neutral. Our results will not be accurate in the initial iteration, but over a course of multiple iterations, we will get better results.

<img src='https://github.com/rahiakela/img-repo/blob/master/getting-started-with-google-bert/fine-tuning-NLI.png?raw=1' width='800'/>

In this way, we have learned how to fine-tune BERT for NLI.