# Finetuning BERT for movie review classification

Docs: https://huggingface.co/docs/transformers/model_doc/distilbert 

> The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In [1]:
%pip install watermark
%pip install torch
%pip install transformers

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> 

In [2]:
%load_ext watermark
%watermark -a 'rpharale' -v -p torch,transformers

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Author: rpharale

Python implementation: CPython
Python version       : 3.8.10
IPython version      : 7.13.0

torch       : 1.12.1
transformers: 4.25.1



In [3]:
import torch
import transformers
import pandas as pd
import numpy as np
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import time
import os
import requests
import zipfile
import io
import urllib

## Env Settings

In [4]:
torch.backends.cudnn.deterministic = True
RANDOM_SEED = 142
torch.manual_seed(RANDOM_SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

NUM_EPOCHS = 3
model_checkpoint = "distilbert-base-uncased"

cuda


## Fetch Dataset

Download the IMDB movie review dataset from https://ai.stanford.edu/~amaas/data/sentiment/ \
Download the csv version from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download 

In [5]:
url = "https://raw.githubusercontent.com/rpharale/data/main/dataset/NLP/IMDB_Movie_Review/IMDB_Dataset.csv.zip"
csv_filepath = 'IMDB_Dataset.csv'

filename = os.path.basename(url)

# Download the zip file
urllib.request.urlretrieve(url, filename)

with zipfile.ZipFile(filename, 'r') as f_zip:
    with open('IMDB_Dataset.csv', 'wb')  as f_csv:
        f_csv.write(f_zip.read(csv_filepath))

In [6]:
df = pd.read_csv(csv_filepath)

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
df.loc[df['sentiment'] == 'positive', 'sentiment'] = 1
df.loc[df['sentiment'] == 'negative', 'sentiment'] = 0

In [9]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [10]:
df.shape

(50000, 2)

## Split the dataset into train, test and val data

In [11]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

val_texts = df.iloc[35000:40000]['review'].values
val_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:50000]['review'].values
test_labels = df.iloc[40000:50000]['sentiment'].values

## Tokenization

In [12]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_checkpoint)
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

In [14]:
print(train_encodings)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



## Dataset Class and Loader

In [15]:
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = IMDBDataset(train_encodings, train_labels)
val_dataset = IMDBDataset(val_encodings, val_labels)
test_dataset = IMDBDataset(test_encodings, test_labels)

In [16]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=True)

## Load Model

In [17]:
model = DistilBertForSequenceClassification.from_pretrained(model_checkpoint)
model.to(device)
model.train()

optim = torch.optim.Adam(model.parameters(), lr=5e-5)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classi

## Train model

In [18]:
def eval_model(model, data_loader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0
        
        for batch_idx, batch in enumerate(data_loader):
            
            # batch data
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss, logits = outputs['loss'], outputs['logits']
            
            _, pred_labels = torch.max(logits, 1)
            correct_pred += (pred_labels == labels).sum()
            num_examples += len(labels)
    
    return correct_pred.float() / num_examples * 100.0

In [19]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    
    model.train()
    
    for batch_idx, batch in enumerate(train_loader):
        
        # data
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Forward pass
        output = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = output['loss'], output['logits']
        
        # Backward pass
        optim.zero_grad()
        loss.backward()
        optim.step()
        
        # Logging
        if batch_idx % 200 == 0:
            print(f" \
                  Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d}, \
                  Batch: {batch_idx+1:04d}/{len(train_loader):04d}, \
                  Loss: {loss:.4f} \
                  ")

    model.eval()
    
    with torch.no_grad():
        print(f"Training accuracy: {eval_model(model, train_loader, device):.2f}%,\
              Val accuracy: {eval_model(model, val_loader, device):.2f}% \
             ")
    
    print(f"Time elapsed: {(time.time() - start_time)/60.0:.2f} min")

print(f"Total training time: {(time.time() - start_time)/60.0:.2f} min")
print(f"Test accuracy: {eval_model(model, test_loader, device):.2f}%")

                   Epoch: 0001/0003,                   Batch: 0001/2188,                   Loss: 0.6950                   
                   Epoch: 0001/0003,                   Batch: 0201/2188,                   Loss: 0.4441                   
                   Epoch: 0001/0003,                   Batch: 0401/2188,                   Loss: 0.2589                   
                   Epoch: 0001/0003,                   Batch: 0601/2188,                   Loss: 0.0580                   
                   Epoch: 0001/0003,                   Batch: 0801/2188,                   Loss: 0.5265                   
                   Epoch: 0001/0003,                   Batch: 1001/2188,                   Loss: 0.3484                   
                   Epoch: 0001/0003,                   Batch: 1201/2188,                   Loss: 1.0605                   
                   Epoch: 0001/0003,                   Batch: 1401/2188,                   Loss: 0.2074                   
                

In [20]:
# Save the model checkpoint
model.save_pretrained('./model/')