# IMDB Sentiment Analysis with BERT (Hugging Face Transformers)

#### This notebook demonstrates a complete deep learning workflow for sentiment analysis on the IMDB movie reviews dataset using a BERT-based model from Hugging Face Transformers. It covers data loading, preprocessing, tokenization, model setup, selective layer fine-tuning, training, evaluation, and error analysis. The notebook includes advanced features such as dynamic padding, early stopping, and saving the trained model and tokenizer. Misclassified examples are analyzed to help understand model limitations and guide further improvements.


## Check Device (CPU/GPU)

In [None]:
# Check if CUDA (GPU) is available and set device
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Install Required Packages

In [None]:
pip install -U "transformers==4.44.2" "datasets==2.21.0" "accelerate>=0.33.0"

Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.21.0
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets==2.21.0)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m98.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-2.21.0-py3-none-any.whl (527 kB)


## Import Libraries

In [None]:
# Import all required libraries for data processing, modeling, and evaluation
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import torch
from sklearn.model_selection import train_test_split
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from torch.optim import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.cuda.amp import GradScaler, autocast

from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer, EarlyStoppingCallback)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_fscore_support

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

## Load Dataset

In [None]:
# Load the IMDB Sentiment dataset and split into train/test/validation sets
ds = load_dataset("Kwaai/IMDB_Sentiment")

df_train = ds["train"].to_pandas()[["text","label"]]
df_test = ds["test"].to_pandas()[["text","label"]]

# 80% train, 20% validation, stratified by label
df_train, df_val = train_test_split(
    df_train,
    test_size=0.2,
    stratify=df_train["label"],
    random_state=42
)

print("Train shape:", df_train.shape, "| Test shape:", df_test.shape)
df_train.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

Train shape: (20000, 2) | Test shape: (25000, 2)


Unnamed: 0,text,label
20022,I have always been a huge James Bond fanatic! ...,1
4993,I am a Christian and I say this movie had terr...,0
24760,"Neatly sandwiched between THE STRANGER, a smal...",1
13775,Years ago I did follow a soap on TV. So I was ...,1
20504,"Here's a gritty, get-the-bad guys revenge stor...",1


## Set Random Seed

In [None]:
# Set random seed for reproducibility
import random
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

<torch._C.Generator at 0x7f46cc144ab0>

## Model and Tokenizer Setup 

In [None]:
# Define model parameters and load tokenizer. 
MODEL_NAME = "bert-base-uncased"
NUM_CLASSES = 2
MAX_LEN = 256
UNFREEZE_LAST_K = 2    # Train only the last K transformer blocks + classifier

# Load Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_NAME)



## Convert dataframes to Hugging Face datasets 

In [None]:
# Convert Pandas DataFrames to HuggingFace Datasets. 
def to_hfds(df):
    ds = Dataset.from_pandas(df[["text", "label"]].reset_index(drop=True))
    ds = ds.rename_column("label", "labels")
    return ds

train_hf = to_hfds(df_train)
val_hf   = to_hfds(df_val)
test_hf  = to_hfds(df_test)

## Tokenize data 

In [None]:
# Tokenize datasets
def tokenize_batch(batch):
    return tok(batch["text"], truncation=True, max_length=MAX_LEN)

train_hf = train_hf.map(tokenize_batch, batched=True, remove_columns=["text"])
val_hf   = val_hf.map(tokenize_batch, batched=True, remove_columns=["text"])
test_hf  = test_hf.map(tokenize_batch, batched=True, remove_columns=["text"])

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

## Format data sets for PyTorch

In [None]:
# Set dataset format for PyTorch 
cols = ["input_ids", "attention_mask", "labels"]
for ds in (train_hf, val_hf, test_hf):
    ds.set_format(type="torch", columns=cols)

# Load model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=NUM_CLASSES
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Freeze/Unfreeze Model Layers 

In [None]:
# Freeze all layers, then unfreeze the last k encoder blocks, pooler, and classifier
def freeze_all(m):
    for p in m.parameters():
        p.requires_grad = False

def unfreeze_last_k_bert_layers(m, k: int):
    # Unfreeze last k BERT encoder layers. 
    for layer in m.bert.encoder.layer[-k:]:
        for p in layer.parameters():
            p.requires_grad = True
    # unfreeze pooler
    if hasattr(m.bert, "pooler") and m.bert.pooler is not None:
        for p in m.bert.pooler.parameters():
            p.requires_grad = True
    # Always unfreeze classifier head
    for p in m.classifier.parameters():
        p.requires_grad = True

freeze_all(model)
unfreeze_last_k_bert_layers(model, UNFREEZE_LAST_K)

# unfreeze all LayerNorms for stability
for n, p in model.named_parameters():
    if "LayerNorm" in n:
        p.requires_grad = True

## Check trainable parameters. 

In [None]:
# Print number of trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable/1e6:.2f}M / {total/1e6:.2f}M")


Trainable params: 14.80M / 109.48M


## Define metrics

In [None]:
# Define metrics for evaluation 
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

## Setup Data Collector 

In [None]:
# Set up data collector for dynamic padding
collator = DataCollatorWithPadding(tokenizer=tok)

## Training arguments and Trainer setup

In [None]:
# Define training arguments and initialise trainer
args = TrainingArguments(
    output_dir="bert_lastk",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    num_train_epochs=20,                 # 3–5 typically
    per_device_train_batch_size=128,     # 32–64 on 22.5GB GPU @ len=256
    per_device_eval_batch_size=128,
    learning_rate=2e-5,                 # conservative when partially freezing
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=True,
    logging_steps=50,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_hf,
    eval_dataset=val_hf,
    tokenizer=tok,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1519,0.237339,0.9106,0.910832
2,0.1772,0.228266,0.9116,0.912752
3,0.1679,0.224609,0.9182,0.91851
4,0.158,0.232339,0.9162,0.916683
5,0.1414,0.243678,0.9156,0.915768


TrainOutput(global_step=785, training_loss=0.16795349607042445, metrics={'train_runtime': 600.2131, 'train_samples_per_second': 666.43, 'train_steps_per_second': 5.231, 'total_flos': 1.3155552768e+16, 'train_loss': 0.16795349607042445, 'epoch': 5.0})

## Evaluate and test set

In [None]:
# Evaluate Model On Test Set 
test_metrics = trainer.evaluate(test_hf)
print(test_metrics)  # accuracy / f1 you defined in compute_metrics


{'eval_loss': 0.2202177494764328, 'eval_accuracy': 0.91528, 'eval_f1': 0.9155704376943316, 'eval_runtime': 50.9052, 'eval_samples_per_second': 491.109, 'eval_steps_per_second': 3.85, 'epoch': 5.0}


## Classification Report, and Confusion Matrix

In [None]:
# Generate classification report and confusion matrix 
pred = trainer.predict(test_hf)
y_true = pred.label_ids
y_pred = pred.predictions.argmax(axis=1)

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred, digits=4))
print(confusion_matrix(y_true, y_pred))

              precision    recall  f1-score   support

           0     0.9182    0.9118    0.9150     12500
           1     0.9124    0.9187    0.9156     12500

    accuracy                         0.9153     25000
   macro avg     0.9153    0.9153    0.9153     25000
weighted avg     0.9153    0.9153    0.9153     25000

[[11398  1102]
 [ 1016 11484]]


## Save model and tokenizer 

In [None]:
# Save trained model and tokenizer 
save_dir = "bert_model"
trainer.save_model(save_dir)          # saves model + config
tok.save_pretrained(save_dir)         # save tokenizer 

('bert_model/tokenizer_config.json',
 'bert_model/special_tokens_map.json',
 'bert_model/vocab.txt',
 'bert_model/added_tokens.json',
 'bert_model/tokenizer.json')

## Analyse false positives and negatives

In [None]:
# Analyse false positives and negatives


# Get predictions on test set 
pred = trainer.predict(test_hf)
logits = pred.predictions                 # (N, num_classes)
y_true = pred.label_ids                   # (N,)

# Softmax → probs, preds, confidence of predicted class
probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
y_pred = probs.argmax(axis=1)
conf   = probs.max(axis=1)

tbl = pd.DataFrame({
    "text":  df_test["text"].reset_index(drop=True),
    "y_true": y_true,
    "y_pred": y_pred,
    "conf":  conf
})
# For binary tasks, also include P(class=1)
if probs.shape[1] == 2:
    tbl["p1"] = probs[:, 1]

# Select FPs (pred=1, true=0) and FNs (pred=0, true=1)
fps = tbl[(tbl.y_true == 0) & (tbl.y_pred == 1)].sort_values("conf", ascending=False).head(10)
fns = tbl[(tbl.y_true == 1) & (tbl.y_pred == 0)].sort_values("conf", ascending=False).head(10)


print("==== FALSE POSITIVES (pred=1, label=0) ====")
for _, r in fps.iterrows():
    if "p1" in r:
        print(f"[conf={r.conf:.3f} | p1={r.p1:.3f}]  {r.text[:400]}")
    else:
        print(f"[conf={r.conf:.3f}]  {r.text[:400]}")
    print()

print("==== FALSE NEGATIVES (pred=0, label=1) ====")
for _, r in fns.iterrows():
    if "p1" in r:
        print(f"[conf={r.conf:.3f} | p1={r.p1:.3f}]  {r.text[:400]}")
    else:
        print(f"[conf={r.conf:.3f}]  {r.text[:400]}")
    print()


==== FALSE POSITIVES (pred=1, label=0) ====
[conf=0.998 | p1=0.998]  I really liked this quirky movie. The characters are not the bland beautiful people that show up in so many movies and on TV. It has a realistic edge, with a captivating story line. The main title sequence alone makes this movie fun to watch.

[conf=0.997 | p1=0.997]  This has to be one of, if not THE greatest Mob/Crime films of all time. Every thing about this movie is great, the acting in this film is of true quality; Master P's acting skills make you actually believe he is Italian! The cinematography is excellent too, probably the best ever. This movie was great; and I have the brain capacity of an earth worm.

[conf=0.997 | p1=0.997]  This has to be one of the all time greatest horror movies. Charles Band made the best movie of 96' in this little seen gem. Highly realistic and , incredibly stylised- with a visual flair David Fincher would envy, its not hard to see why Band went on to make such classics as 'Killjoy