# GPT-2 for Sentiment Analysis on IMDb movie reviews

## Table of Contents
1. [Introduction](##Introduction)
2. [Data exploration](##Data-Exploration)
3. [Zero Shot Classification](##Zero-shot-classification)

## Introduction

The [IMDb](https://ai.stanford.edu/~amaas/data/sentiment/) is a binary sentiment classification dataset consisting of 100k movie reviews(50k positive and 50k negative). The dataset is split into train and test containing 50k reviews each.

In this notebook, my goals are:
1. Understand and implement [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Run GPT-2 on the IMDb classification task.
2. Fine-tune GPT-2 for sentiment classification in under ~30 minutes on a 8GB Apple M2 macbook air (Faster if you have a Nvidia GPU).
3. Understand how [LoRA](https://arxiv.org/abs/2106.09685) is implemented and use it to fine-tune GPT-2 for sentiment classification.

## Data-Exploration
Get a summary of the dataset. i.e
1. No of samples
2. No of positive / negative samples.
3. Length of the movie reviews



In [1]:
import pandas
import torch
from torch.utils.data import Dataset

from gpt_config import GPTConfig
from sentiment_classification.reviewsDataset import reviewsDataset
from sentiment_classification.eval import Eval
from sentiment_classification.eval_config import EvalConfig
from sentiment_classification.train import Trainer
from sentiment_classification.train_config import TrainConfig


In [None]:
# Dataset exploration

imdb_train = reviewsDataset("train",max_length=10000)
imdb_test = reviewsDataset("test",max_length=10000)


def format_data(dataset: Dataset) -> pandas.DataFrame:

    data = []
    for batch in dataset:
        data.append({"input_ids":len(batch["input_ids"]),
                    "label": batch["label"],
                    "filename": batch["fpath"]})
    
    return pandas.DataFrame(data)

train_data = format_data(imdb_train)
test_data = format_data(imdb_test)


*Summary statistics of the dataset*

In [None]:
def summary(data: pandas.DataFrame) -> None:
    print(f"Number of reviews: {len(data)}")
    print(f"Positive Reviews: {data[data['label'] == 1]['label'].count()}")
    print(f"Negative Reviews: {data[data['label'] == 0]['label'].count()}")
    print(f"Max Review Length: {data['input_ids'].max()}\nMin Review Length: {data['input_ids'].min()}")
    print(f"Median Review Length: {data['input_ids'].median()}\nMean Review Length: {data['input_ids'].mean()}")

print("Train\n--------------")
summary(train_data)
print("Test\n---------------")
summary(test_data)

*Length of reviews (measured by the number of tokens)*

In [None]:
from matplotlib import pyplot as plt
def plot_hist(title: str,df: pandas.DataFrame) -> None:
    plt.figure()
    plt.hist(df["input_ids"],bins=100)
    plt.xlabel(f"No of tokens")
    plt.ylabel("Count")
    plt.title(f"{title}")

plot_hist(title='Train Data', df=train_data) 
plot_hist(title="Test Data", df=test_data)   

In [None]:
plot_hist(title="Positive Reviews Test",df=test_data[test_data['label']==1])
plot_hist(title="Negative Reviews Test",df=test_data[test_data['label']==0])

Run the test.py in `sentiment_classification` and write the results to a file

In [2]:
def get_metrics_by_bin(results, bins,threshold=0.5):
    TP = len(results[(results["label"] >= threshold) & (results["prediction"] >= threshold)])
    FP = len(results[(results["label"] < threshold) & (results["prediction"] >= threshold)])
    TN = len(results[(results["label"] < threshold) & (results["prediction"] < threshold)])
    FN = len(results[(results["label"] > threshold) & (results["prediction"] < threshold)])
    
    print("Metrics")
    print(f"Precision: {TP/(TP+FP)}\nRecall: {TP/(TP+FN)}\nAccuracy: {(TP+TN)/len(results)}")
    bins = range(0,1500,128)
    results["bin"] = pandas.cut(results['length'],bins)
    metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
                                                                            "FP":((x["label"] < threshold) & (x["prediction"] >= threshold)).sum(),
                                                                            "FN": ((x["label"] >= threshold) & (x["prediction"] < threshold)).sum(),
                                                                            "TN": ((x["label"] < threshold) & (x["prediction"] < threshold)).sum()}))

    metrics_by_bin["accuracy"] = (metrics_by_bin["TP"] + metrics_by_bin["TN"])/(metrics_by_bin["TP"] + metrics_by_bin["TN"]+ metrics_by_bin["FP"]+ metrics_by_bin["FN"])
    metrics_by_bin["precision"] = metrics_by_bin["TP"]/(metrics_by_bin["TP"] + metrics_by_bin["FP"])
    metrics_by_bin["recall"] = metrics_by_bin["TP"]/(metrics_by_bin["TP"] + metrics_by_bin["FN"])
    print("Metrics by bin")
    print(metrics_by_bin.to_markdown())

Predict the next word given the following prompt
 
'''
Review: The movie was awesome. Sentiment: Positive. 
Review: The performances were disappointing. Sentiment: Negative. 
Review: {review} Sentiment:
'''
I calculate the probabilities of the word " Positive" and " Negative" and classify the review based on which probability is greater.

**Run evaluation for the zero shot approach**

In [11]:
model_config = GPTConfig(block_size=128,use_lora=False,binary_classification_head=False)
eval_config = EvalConfig(results_path="zero_shot_128.txt",subset=False,batch_size=2)
test_set = reviewsDataset(split="test")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M


100%|██████████| 12500/12500 [08:31<00:00, 24.43it/s]


In [12]:
res_file = pandas.read_csv("zero_shot_128.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

Metrics
Precision: 0.6153087115872569
Recall: 0.6984
Accuracy: 0.63088
Metrics by bin
| bin          |   TP |   FP |   FN |   TN |   accuracy |   precision |   recall |
|:-------------|-----:|-----:|-----:|-----:|-----------:|------------:|---------:|
| (0, 128]     |  605 |  284 |  320 |  520 |   0.650665 |    0.68054  | 0.654054 |
| (128, 256]   | 2132 | 1336 |  831 | 1737 |   0.640987 |    0.614764 | 0.719541 |
| (256, 384]   |  746 |  495 |  339 |  624 |   0.621597 |    0.601128 | 0.687558 |
| (384, 512]   |  364 |  267 |  161 |  292 |   0.605166 |    0.576862 | 0.693333 |
| (512, 640]   |  210 |  154 |   92 |  148 |   0.592715 |    0.576923 | 0.695364 |
| (640, 768]   |  113 |   80 |   57 |   82 |   0.587349 |    0.585492 | 0.664706 |
| (768, 896]   |   86 |   52 |   31 |   52 |   0.624434 |    0.623188 | 0.735043 |
| (896, 1024]  |   42 |   24 |   23 |   28 |   0.598291 |    0.636364 | 0.646154 |
| (1024, 1152] |   38 |   18 |   13 |   20 |   0.651685 |    0.678571 | 0.745098 |
|

  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),


**Finetuning without LoRA**

In [None]:
train_config = TrainConfig(out_dir="run/dropout/",init_from="resume",checkpoint_name="finetune_no_lora.ckpt")
model_config = GPTConfig(use_lora=False)
rd = reviewsDataset(split="train",max_length=model_config.block_size)
train_set, val_set = torch.utils.data.random_split(rd,[0.85,0.15])
trainer = Trainer(train_set,val_set,train_config,model_config)
trainer.train()

**Run eval using the fine-tuned model**

In [6]:
model_config = GPTConfig(block_size=128,use_lora=False,load_from_checkpoint=True,checkpoint_path="run/dropout/finetune_no_lora.ckpt")
eval_config = EvalConfig(results_path="finetuned_no_lora.txt",subset=True)
test_set = reviewsDataset(split="train")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M


  ckpt = torch.load(self.model_config.checkpoint_path,map_location=self.eval_config.device)
100%|██████████| 125/125 [00:04<00:00, 26.20it/s]


**Test the performance of the fine-tuned model**

In [7]:
res_file = pandas.read_csv("finetuned_no_lora.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

Metrics
Precision: 0.7832167832167832
Recall: 0.896
Accuracy: 0.824
Metrics by bin
| bin          |   TP |   FP |   FN |   TN |   accuracy |   precision |     recall |
|:-------------|-----:|-----:|-----:|-----:|-----------:|------------:|-----------:|
| (0, 128]     |   19 |    4 |    2 |   16 |   0.853659 |    0.826087 |   0.904762 |
| (128, 256]   |   47 |   10 |    4 |   47 |   0.87037  |    0.824561 |   0.921569 |
| (256, 384]   |   22 |   10 |    2 |   10 |   0.727273 |    0.6875   |   0.916667 |
| (384, 512]   |    9 |    4 |    1 |    9 |   0.782609 |    0.692308 |   0.9      |
| (512, 640]   |    7 |    1 |    2 |    8 |   0.833333 |    0.875    |   0.777778 |
| (640, 768]   |    4 |    1 |    0 |    1 |   0.833333 |    0.8      |   1        |
| (768, 896]   |    2 |    0 |    2 |    1 |   0.6      |    1        |   0.5      |
| (896, 1024]  |    0 |    1 |    0 |    0 |   0        |    0        | nan        |
| (1024, 1152] |    0 |    0 |    0 |    1 |   1        |  nan     

  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),


**Run training using LoRA**

In [20]:
# torch.manual_seed(1335)
train_config = TrainConfig(out_dir="run/dropout_low_lr",checkpoint_name="finetune_lora.ckpt",init_from="resume",learning_rate=1e-4,max_iters=80000,lr_decay_iters=80000)
model_config = GPTConfig(block_size=128,use_lora=True,r=8,binary_classification_head=True)
rd = reviewsDataset(split="train",max_length=model_config.block_size)
train_set, val_set = torch.utils.data.random_split(rd,[0.85,0.15])
trainer = Trainer(train_set,val_set,train_config,model_config)
trainer.train()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M
Resuming training from run/dropout_low_lr/finetune_lora.ckpt


  self.ckpt = torch.load(ckpt_path,map_location=self.train_config.device)


num decayed parameter tensors: 25, with 295,680 parameters
num non-decayed parameter tensors: 74, with 102,145 parameters


  0%|          | 0/32000 [00:00<?, ?it/s]

Step: 48000
 Train Loss: 0.598097562789917
Validation Loss: 0.4696851074695587
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


  6%|▋         | 2000/32000 [07:53<2:14:21,  3.72it/s] 

Step: 50000
 Train Loss: 0.404984712600708
Validation Loss: 0.43395259976387024
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 12%|█▎        | 4000/32000 [16:03<1:31:21,  5.11it/s]  

Step: 52000
 Train Loss: 0.5191661715507507
Validation Loss: 0.4795258045196533
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 19%|█▉        | 6000/32000 [24:01<1:27:26,  4.96it/s] 

Step: 54000
 Train Loss: 0.5285136699676514
Validation Loss: 0.6277976632118225
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 25%|██▌       | 8000/32000 [32:23<1:29:25,  4.47it/s] 

Step: 56000
 Train Loss: 0.5062193274497986
Validation Loss: 0.529377281665802
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 31%|███▏      | 10000/32000 [40:48<1:14:37,  4.91it/s]

Step: 58000
 Train Loss: 0.4908844828605652
Validation Loss: 0.45829907059669495
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 38%|███▊      | 12000/32000 [48:54<1:05:16,  5.11it/s] 

Step: 60000
 Train Loss: 0.6196596026420593
Validation Loss: 0.4461834728717804
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 44%|████▍     | 14000/32000 [57:19<1:17:42,  3.86it/s] 

Step: 62000
 Train Loss: 0.5475892424583435
Validation Loss: 0.5505332946777344
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 50%|█████     | 16000/32000 [1:05:49<57:06,  4.67it/s]  

Step: 64000
 Train Loss: 0.4960448145866394
Validation Loss: 0.6434656381607056
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 56%|█████▋    | 18000/32000 [1:13:57<45:35,  5.12it/s]   

Step: 66000
 Train Loss: 0.4511506259441376
Validation Loss: 0.4789242446422577
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 62%|██████▎   | 20000/32000 [1:22:37<52:21,  3.82it/s]   

Step: 68000
 Train Loss: 0.5850903391838074
Validation Loss: 0.6562095880508423
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 69%|██████▉   | 22000/32000 [1:31:05<32:33,  5.12it/s]   

Step: 70000
 Train Loss: 0.4576529264450073
Validation Loss: 0.5525995492935181
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 75%|███████▌  | 24000/32000 [1:39:07<25:49,  5.16it/s]   

Step: 72000
 Train Loss: 0.5173817873001099
Validation Loss: 0.4392203390598297
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 81%|████████▏ | 26000/32000 [1:47:08<20:02,  4.99it/s]   

Step: 74000
 Train Loss: 0.5350633263587952
Validation Loss: 0.5186405777931213
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 88%|████████▊ | 28000/32000 [1:55:15<13:20,  5.00it/s]   

Step: 76000
 Train Loss: 0.5807334780693054
Validation Loss: 0.4768508970737457
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


 94%|█████████▍| 30000/32000 [2:03:10<06:21,  5.25it/s]   

Step: 78000
 Train Loss: 0.47857531905174255
Validation Loss: 0.4488178491592407
Saving checkpoint to run/dropout_low_lr/finetune_lora.ckpt


100%|██████████| 32000/32000 [2:11:16<00:00,  4.06it/s]  


**Evaluate using the LoRA finetuned model**

In [23]:
model_config = GPTConfig(use_lora=True,binary_classification_head=True,block_size=128,load_from_checkpoint=True,checkpoint_path="run/dropout_low_lr/finetune_lora.ckpt")
eval_config = EvalConfig(results_path="finetuned_lora.txt",batch_size=2,subset=False)
test_set = reviewsDataset(split="test")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M


  ckpt = torch.load(self.model_config.checkpoint_path,map_location=self.eval_config.device)
100%|██████████| 12500/12500 [10:03<00:00, 20.72it/s] 


**Test the performance of the LoRA finetuned model**

In [24]:
res_file = pandas.read_csv("finetuned_lora.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

Metrics
Precision: 0.8346602481022033
Recall: 0.72128
Accuracy: 0.7892
Metrics by bin
| bin          |   TP |   FP |   FN |   TN |   accuracy |   precision |   recall |
|:-------------|-----:|-----:|-----:|-----:|-----------:|------------:|---------:|
| (0, 128]     | 1553 |  202 |  297 | 1387 |   0.8549   |    0.8849   | 0.839459 |
| (128, 256]   | 4369 |  680 | 1532 | 5374 |   0.814973 |    0.86532  | 0.740383 |
| (256, 384]   | 1466 |  343 |  702 | 1976 |   0.767105 |    0.810392 | 0.676199 |
| (384, 512]   |  663 |  252 |  394 |  894 |   0.706764 |    0.72459  | 0.627247 |
| (512, 640]   |  380 |  124 |  223 |  452 |   0.705683 |    0.753968 | 0.630182 |
| (640, 768]   |  225 |   64 |  137 |  258 |   0.70614  |    0.778547 | 0.621547 |
| (768, 896]   |  137 |   52 |   76 |  155 |   0.695238 |    0.724868 | 0.643192 |
| (896, 1024]  |   90 |   32 |   49 |   86 |   0.684825 |    0.737705 | 0.647482 |
| (1024, 1152] |   65 |   17 |   29 |   69 |   0.744444 |    0.792683 | 0.691489 |
|

  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
