# GPT-2 for Sentiment Analysis on IMDb movie reviews

## Table of Contents
1. [Introduction](##Introduction)
2. [Data exploration](##Data-Exploration)
3. [Zero Shot Classification](##Zero-shot-classification)

## Introduction

The [IMDb](https://ai.stanford.edu/~amaas/data/sentiment/) is a binary sentiment classification dataset consisting of 100k movie reviews(50k positive and 50k negative). The dataset is split into train and test containing 50k reviews each.

In this notebook, my goals are:
1. Understand and implement [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Run GPT-2 on the IMDb classification task.
2. Fine-tune GPT-2 for sentiment classification in under ~30 minutes on a 8GB Apple M2 macbook air (Faster if you have a Nvidia GPU).
3. Understand how [LoRA](https://arxiv.org/abs/2106.09685) is implemented and use it to fine-tune GPT-2 for sentiment classification.

## Data-Exploration
Get a summary of the dataset. i.e
1. No of samples
2. No of positive / negative samples.
3. Length of the movie reviews



In [1]:
import pandas
import torch
from torch.utils.data import Dataset

from gpt_config import GPTConfig
from sentiment_classification.reviewsDataset import reviewsDataset
from sentiment_classification.eval import Eval
from sentiment_classification.eval_config import EvalConfig
from sentiment_classification.train import Trainer
from sentiment_classification.train_config import TrainConfig


In [None]:
# Dataset exploration

imdb_train = reviewsDataset("train",max_length=10000)
imdb_test = reviewsDataset("test",max_length=10000)


def format_data(dataset: Dataset) -> pandas.DataFrame:

    data = []
    for batch in dataset:
        data.append({"input_ids":len(batch["input_ids"]),
                    "label": batch["label"],
                    "filename": batch["fpath"]})
    
    return pandas.DataFrame(data)

train_data = format_data(imdb_train)
test_data = format_data(imdb_test)


*Summary statistics of the dataset*

In [None]:
def summary(data: pandas.DataFrame) -> None:
    print(f"Number of reviews: {len(data)}")
    print(f"Positive Reviews: {data[data['label'] == 1]['label'].count()}")
    print(f"Negative Reviews: {data[data['label'] == 0]['label'].count()}")
    print(f"Max Review Length: {data['input_ids'].max()}\nMin Review Length: {data['input_ids'].min()}")
    print(f"Median Review Length: {data['input_ids'].median()}\nMean Review Length: {data['input_ids'].mean()}")

print("Train\n--------------")
summary(train_data)
print("Test\n---------------")
summary(test_data)

*Length of reviews (measured by the number of tokens)*

In [None]:
from matplotlib import pyplot as plt
def plot_hist(title: str,df: pandas.DataFrame) -> None:
    plt.figure()
    plt.hist(df["input_ids"],bins=100)
    plt.xlabel(f"No of tokens")
    plt.ylabel("Count")
    plt.title(f"{title}")

plot_hist(title='Train Data', df=train_data) 
plot_hist(title="Test Data", df=test_data)   

In [None]:
plot_hist(title="Positive Reviews Test",df=test_data[test_data['label']==1])
plot_hist(title="Negative Reviews Test",df=test_data[test_data['label']==0])

Run the test.py in `sentiment_classification` and write the results to a file

In [2]:
def get_metrics_by_bin(results, bins,threshold=0.5):
    TP = len(results[(results["label"] >= threshold) & (results["prediction"] >= threshold)])
    FP = len(results[(results["label"] < threshold) & (results["prediction"] >= threshold)])
    TN = len(results[(results["label"] < threshold) & (results["prediction"] < threshold)])
    FN = len(results[(results["label"] > threshold) & (results["prediction"] < threshold)])
    
    print("Metrics")
    print(f"Precision: {TP/(TP+FP)}\nRecall: {TP/(TP+FN)}\nAccuracy: {(TP+TN)/len(results)}")
    bins = range(0,1500,128)
    results["bin"] = pandas.cut(results['length'],bins)
    metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
                                                                            "FP":((x["label"] < threshold) & (x["prediction"] >= threshold)).sum(),
                                                                            "FN": ((x["label"] >= threshold) & (x["prediction"] < threshold)).sum(),
                                                                            "TN": ((x["label"] < threshold) & (x["prediction"] < threshold)).sum()}))

    metrics_by_bin["accuracy"] = (metrics_by_bin["TP"] + metrics_by_bin["TN"])/(metrics_by_bin["TP"] + metrics_by_bin["TN"]+ metrics_by_bin["FP"]+ metrics_by_bin["FN"])
    metrics_by_bin["precision"] = metrics_by_bin["TP"]/(metrics_by_bin["TP"] + metrics_by_bin["FP"])
    metrics_by_bin["recall"] = metrics_by_bin["TP"]/(metrics_by_bin["TP"] + metrics_by_bin["FN"])
    print("Metrics by bin")
    print(metrics_by_bin.to_markdown())

Predict the next word given the following prompt
 
'''
Review: The movie was awesome. Sentiment: Positive. 
Review: The performances were disappointing. Sentiment: Negative. 
Review: {review} Sentiment:
'''
I calculate the probabilities of the word " Positive" and " Negative" and classify the review based on which probability is greater.

**Run evaluation for the zero shot approach**

In [None]:
model_config = GPTConfig(block_size=128,use_lora=False)
eval_config = EvalConfig(results_path="zero_shot_128.txt",subset=False)
test_set = reviewsDataset(split="test")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

In [None]:
res_file = pandas.read_csv("zero_shot_128.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

**Finetuning without LoRA**

In [None]:
train_config = TrainConfig(out_dir="run/first_part_review_no_lora/",init_from="gpt2",checkpoint_name="finetune_no_lora.ckpt")
model_config = GPTConfig(use_lora=False)
rd = reviewsDataset(split="train",max_length=model_config.block_size)
train_set, val_set = torch.utils.data.random_split(rd,[0.85,0.15])
trainer = Trainer(train_set,val_set,train_config,model_config)
trainer.train()

**Run eval using the fine-tuned model**

In [None]:
model_config = GPTConfig(block_size=128,use_lora=False,load_from_checkpoint=True,checkpoint_path="run/finetune_no_lora.ckpt")
eval_config = EvalConfig(results_path="finetuned_no_lora.txt",subset=False)
test_set = reviewsDataset(split="test")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

**Test the performance of the fine-tuned model**

In [None]:
res_file = pandas.read_csv("finetuned_no_lora.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

**Run training using LoRA**

In [4]:
train_config = TrainConfig(out_dir="run/first_part_review",checkpoint_name="finetune_lora.ckpt",init_from="resume",learning_rate=5e-3)
model_config = GPTConfig(block_size=128,use_lora=True,r=8,binary_classification_head=True)
rd = reviewsDataset(split="train",max_length=model_config.block_size)
train_set, val_set = torch.utils.data.random_split(rd,[0.85,0.15])
trainer = Trainer(train_set,val_set,train_config,model_config)
trainer.train()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M
Resuming training from run/first_part_review/finetune_lora.ckpt


  self.ckpt = torch.load(ckpt_path,map_location=self.train_config.device)


num decayed parameter tensors: 25, with 295,680 parameters
num non-decayed parameter tensors: 0, with 0 parameters


  0%|          | 0/52000 [00:00<?, ?it/s]

Step: 8000
 Train Loss: 0.5172512531280518
Validation Loss: 0.5576334595680237
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


  4%|▍         | 2000/52000 [07:09<2:42:08,  5.14it/s]

Step: 10000
 Train Loss: 0.41479671001434326
Validation Loss: 0.4026747941970825
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


  8%|▊         | 4000/52000 [14:14<2:49:27,  4.72it/s] 

Step: 12000
 Train Loss: 0.48558348417282104
Validation Loss: 0.4565548300743103
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 12%|█▏        | 6000/52000 [21:21<2:35:48,  4.92it/s] 

Step: 14000
 Train Loss: 0.47223418951034546
Validation Loss: 0.41489166021347046
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 15%|█▌        | 8000/52000 [28:32<2:34:47,  4.74it/s] 

Step: 16000
 Train Loss: 0.35626357793807983
Validation Loss: 0.4491492509841919
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 19%|█▉        | 10000/52000 [35:37<2:31:28,  4.62it/s]

Step: 18000
 Train Loss: 0.4400196373462677
Validation Loss: 0.3557663857936859
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 23%|██▎       | 12000/52000 [42:47<2:14:25,  4.96it/s] 

Step: 20000
 Train Loss: 0.40503746271133423
Validation Loss: 0.3974704444408417
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 27%|██▋       | 14000/52000 [49:58<2:12:03,  4.80it/s] 

Step: 22000
 Train Loss: 0.35306239128112793
Validation Loss: 0.48951295018196106
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 31%|███       | 16000/52000 [57:10<1:54:25,  5.24it/s] 

Step: 24000
 Train Loss: 0.4627331495285034
Validation Loss: 0.4639752209186554
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 35%|███▍      | 18000/52000 [1:04:21<2:02:55,  4.61it/s]

Step: 26000
 Train Loss: 0.4530695080757141
Validation Loss: 0.40638023614883423
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 38%|███▊      | 20000/52000 [1:11:36<1:48:04,  4.93it/s] 

Step: 28000
 Train Loss: 0.36468324065208435
Validation Loss: 0.5254888534545898
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 42%|████▏     | 22000/52000 [1:18:48<1:45:40,  4.73it/s] 

Step: 30000
 Train Loss: 0.31557270884513855
Validation Loss: 0.35610634088516235
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 46%|████▌     | 24000/52000 [1:26:00<1:42:10,  4.57it/s] 

Step: 32000
 Train Loss: 0.3332388997077942
Validation Loss: 0.4747641086578369
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 50%|█████     | 26000/52000 [1:33:12<1:35:10,  4.55it/s] 

Step: 34000
 Train Loss: 0.38842177391052246
Validation Loss: 0.34253016114234924
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 54%|█████▍    | 28000/52000 [1:40:23<1:07:21,  5.94it/s] 

Step: 36000
 Train Loss: 0.33774372935295105
Validation Loss: 0.5448333024978638
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 58%|█████▊    | 30000/52000 [1:47:35<1:22:49,  4.43it/s] 

Step: 38000
 Train Loss: 0.26978909969329834
Validation Loss: 0.450605571269989
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 62%|██████▏   | 32000/52000 [1:54:41<1:05:12,  5.11it/s] 

Step: 40000
 Train Loss: 0.3024047911167145
Validation Loss: 0.39471179246902466
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 65%|██████▌   | 34000/52000 [2:01:49<1:05:28,  4.58it/s] 

Step: 42000
 Train Loss: 0.21444904804229736
Validation Loss: 0.339535117149353
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 69%|██████▉   | 36000/52000 [2:08:54<55:24,  4.81it/s]   

Step: 44000
 Train Loss: 0.3342539966106415
Validation Loss: 0.4235295355319977
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 73%|███████▎  | 38000/52000 [2:16:05<52:29,  4.45it/s]   

Step: 46000
 Train Loss: 0.33049532771110535
Validation Loss: 0.3862728774547577
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 77%|███████▋  | 40000/52000 [2:23:22<40:18,  4.96it/s]   

Step: 48000
 Train Loss: 0.3497768044471741
Validation Loss: 0.49080517888069153
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 81%|████████  | 42000/52000 [2:30:36<32:16,  5.16it/s]  

Step: 50000
 Train Loss: 0.2059875875711441
Validation Loss: 0.4776913821697235
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 85%|████████▍ | 44000/52000 [2:37:49<26:56,  4.95it/s]  

Step: 52000
 Train Loss: 0.2734302282333374
Validation Loss: 0.5537917613983154
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 88%|████████▊ | 46000/52000 [2:44:58<22:56,  4.36it/s]  

Step: 54000
 Train Loss: 0.2576228082180023
Validation Loss: 0.324387788772583
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 92%|█████████▏| 48000/52000 [2:52:12<14:00,  4.76it/s]  

Step: 56000
 Train Loss: 0.20802640914916992
Validation Loss: 0.4687481224536896
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


 96%|█████████▌| 50000/52000 [2:59:23<07:09,  4.66it/s]  

Step: 58000
 Train Loss: 0.24838979542255402
Validation Loss: 0.44210702180862427
Saving checkpoint to run/first_part_review/finetune_lora.ckpt


100%|██████████| 52000/52000 [3:06:39<00:00,  4.64it/s]  


**Evaluate using the LoRA finetuned model**

In [7]:
model_config = GPTConfig(use_lora=True,binary_classification_head=True,block_size=128,load_from_checkpoint=True,checkpoint_path="run/first_part_review/finetune_lora.ckpt")
eval_config = EvalConfig(results_path="finetuned_lora.txt",batch_size=2,subset=True)
test_set = reviewsDataset(split="test")
evaluator = Eval(test_set=test_set,eval_config=eval_config,model_config=model_config)
evaluator.evaluate()

Loading pre-trained weights for gpt2
Number of parameters: 123.65M


  ckpt = torch.load(self.model_config.checkpoint_path,map_location=self.eval_config.device)
100%|██████████| 125/125 [00:04<00:00, 27.29it/s]


**Test the performance of the LoRA finetuned model**

In [8]:
res_file = pandas.read_csv("finetuned_lora.txt")
bins = range(0,1500,128)
get_metrics_by_bin(res_file,bins,threshold=0.5)

Metrics
Precision: 0.7698412698412699
Recall: 0.776
Accuracy: 0.772
Metrics by bin
| bin          |   TP |   FP |   FN |   TN |   accuracy |   precision |     recall |
|:-------------|-----:|-----:|-----:|-----:|-----------:|------------:|-----------:|
| (0, 128]     |   20 |    5 |    3 |   10 |   0.789474 |    0.8      |   0.869565 |
| (128, 256]   |   47 |   13 |   11 |   54 |   0.808    |    0.783333 |   0.810345 |
| (256, 384]   |   16 |    4 |    5 |   15 |   0.775    |    0.8      |   0.761905 |
| (384, 512]   |    5 |    5 |    4 |   12 |   0.653846 |    0.5      |   0.555556 |
| (512, 640]   |    7 |    0 |    3 |    2 |   0.75     |    1        |   0.7      |
| (640, 768]   |    1 |    1 |    1 |    2 |   0.6      |    0.5      |   0.5      |
| (768, 896]   |    0 |    1 |    1 |    1 |   0.333333 |    0        |   0        |
| (896, 1024]  |    1 |    0 |    0 |    0 |   1        |    1        |   1        |
| (1024, 1152] |    0 |    0 |    0 |    0 | nan        |  nan     

  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
  metrics_by_bin = results.groupby('bin').apply(lambda x: pandas.Series({"TP": ((x["label"] >= threshold) & (x["prediction"] >= threshold)).sum(),
