# Sentiment analysis by fine-tuning BERT on IMDb dataset

We will prepare and tokenize the IMDb movie review dataset and fine-tune a distilled BERT model to perform sentiment classification.

The DistilBERT model we are using is a lightweight transformer model created by distilling a pre-trained BERT base model. The original uncased BERT base model contains over 110 million parameters while DistilBERT has 40 percent fewer parameters. Also, DistilBERT runs 60 percent faster and still preserves 95 percent of BERT’s performance on the GLUE language understanding benchmark.

Below, we import all the packages we will use.

In [1]:
import time
import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext
import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

We will set some variables here that will be used later on.

In [3]:
#specify some general settings
torch.backends.cudnn.deterministic = True
random_seed = 42
torch.manual_seed(random_seed) #for reproducibility

### Data preprocessing

Now, we can load the IMDb dataset, which consists of $50000$ reviews, each of which is labeled as having positive sentiment or not.

In [4]:
#load IMDb dataset
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


Next, we split the dataset into separate training, validation, and test sets. We use $70\%$ of the reviews for training, $10\%$ for validation, and the remaining $20\%$ for testing.

In [5]:
#create 70-10-20 training-validation-test split
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values
valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values
test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

Now, we tokenize the texts into individual word tokens using the tokenizer implementation inherited from the pre-trained model class. To do so, we employ the `DistilBertTokenizerFast` class, which first splits text on punctuation and whitespaces, then tokenizes each word into subword units (often referred to as wordpieces). 

In [6]:
#load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

#tokenize training, validation, and test texts
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



We put each of the datasets (training, validation, and test) into a custom `Dataset` class and create corresponding data loaders.

In [8]:
from torch.utils.data import Dataset, DataLoader

#create custom Dataset class
class IMDbDataset(Dataset):
    def __init__(self, encodings, labels): 
        self.encodings = encodings 
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        
        return item
    
    def __len__(self):
        return len(self.labels)
    
#create Dataset object for each of training, validation, test sets
train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

#create DataLoader object with batch size 16 for each of training, validation, test sets
batch_size = 16
train_dl = DataLoader(train_dataset, batch_size, shuffle=True)
valid_dl = DataLoader(valid_dataset, batch_size, shuffle=False)
test_dl = DataLoader(test_dataset, batch_size, shuffle=False)

### Loading and fine-tuning a pre-trained BERT model

Now, we will load the pre-trained DistilBERT model and fine-tune it on the dataset created above. We specify the downstream task as sequence classification by employing the `DistilBertForSequenceClassification` class. Note as well that "uncased" denotes that the model does not distinguish between upper-case and lower-case characters.

In [9]:
#load pre-trained DistilBert model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Having loaded the model, we define the Adam optimizer we will be using, and specify that we will fine-tune the model for $3$ epochs. We also define a function that allows us to compute classification accuracy batch by batch to work around memory limitations.

In [10]:
#define Adam optimizer with learning rate 0.00005
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

#set number of epochs
num_epochs = 3

#define function to compute accuracy batch by batch
def compute_accuracy(model, data_loader, device):

    with torch.no_grad(): #dont compute gradients
        correct_pred, num_examples = 0, 0
        for batch_idx, batch in enumerate(data_loader):

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs['logits']
            predicted_labels = torch.argmax(logits, 1)
            num_examples += labels.size(0)
            correct_pred += (predicted_labels == labels).sum() 
    
    return correct_pred.float()/num_examples * 100

Finally, we execute the fine-tuning loop (warning: this took a long time to run on my CPU).

In [15]:
#track training time
start_time = time.time()

for epoch in range(num_epochs):
    
    model.train()
    
    for batch_idx, batch in enumerate(train_dl):
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        #forward pass
        outputs = model(input_ids,
                        attention_mask=attention_mask,
                        labels=labels)
        loss, logits = outputs['loss'], outputs['logits']
        
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #logging
        if not batch_idx % 250:
            print(f'Epoch: {epoch+1:04d}/{num_epochs:04d}'
                  f' | Batch'
                  f'{batch_idx:04d}/'
                  f'{len(train_dl):04d} | '
                  f'Loss: {loss:.4f}')
    model.eval()
    
    with torch.set_grad_enabled(False): 
        print(f'Training accuracy: '
              f'{compute_accuracy(model, train_dl, device):.2f}%'
              f'\n Valid accuracy: '
              f'{compute_accuracy(model, valid_dl, device):.2f}%')
    
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')

print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_dl, device):.2f}%')

Epoch: 0001/0003 | Batch0000/2188 | Loss: 0.7027
Epoch: 0001/0003 | Batch0250/2188 | Loss: 0.1315
Epoch: 0001/0003 | Batch0500/2188 | Loss: 0.1684
Epoch: 0001/0003 | Batch0750/2188 | Loss: 0.3161
Epoch: 0001/0003 | Batch1000/2188 | Loss: 0.1067
Epoch: 0001/0003 | Batch1250/2188 | Loss: 0.5375
Epoch: 0001/0003 | Batch1500/2188 | Loss: 0.2163
Epoch: 0001/0003 | Batch1750/2188 | Loss: 0.1709
Epoch: 0001/0003 | Batch2000/2188 | Loss: 0.1655
Training accuracy: 96.34%
 Valid accuracy: 92.66%
Time elapsed: 577.81 min
Epoch: 0002/0003 | Batch0000/2188 | Loss: 0.0816
Epoch: 0002/0003 | Batch0250/2188 | Loss: 0.0810
Epoch: 0002/0003 | Batch0500/2188 | Loss: 0.0603
Epoch: 0002/0003 | Batch0750/2188 | Loss: 0.0445
Epoch: 0002/0003 | Batch1000/2188 | Loss: 0.1183
Epoch: 0002/0003 | Batch1250/2188 | Loss: 0.0400
Epoch: 0002/0003 | Batch1500/2188 | Loss: 0.3024
Epoch: 0002/0003 | Batch1750/2188 | Loss: 0.3865
Epoch: 0002/0003 | Batch2000/2188 | Loss: 0.0166
Training accuracy: 98.73%
 Valid accuracy: 

As we can see above, the model definitely overfits somewhat to the training data, but it still achieves a classification accuracy of $92.35\%$ on the test set, which is significantly better accuracy on the test set than either of the other two methods (namely, logistic regression with bag-of-words, and an RNN). Notably, validation accuracy actually decreased over the three epochs, but not by much. With greater computational power, it would be interesting to train the model over more epochs to see if validation and test accuracy increase with more training.