# ModernBERT: Fine-Tuning on the IMDB Dataset

This notebook explores how to fine-tune [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) using the [IMDB reviews dataset](https://huggingface.co/datasets/stanfordnlp/imdb).

## Classification Task

The IMDB reviews dataset contains `25,000` movie reviews binary coded for `negative` and `positive` sentiment. The task is to train a language model that can accurately predict whether a review is `positive` or `negative`.

The features of the dataset are:

| Feature | Description                  |
|---------|------------------------------|
| text    | str: the text of the review. |
| label   | int: 1 = pos, 0 = neg        |


In [1]:
!pip install -U transformers>=4.48.0
!pip install datasets
!pip install evaluate



In [2]:
# imports
import numpy as np
import pandas as pd

import evaluate
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from datasets import load_dataset

**Grab dataset:**

In [3]:
# grab data:
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")

In [4]:
# sample:
train_class_sample_size = 750
test_class_sample_size = 250

# training sample:
pos_train_idx = np.where(np.array(train_dataset["label"]) == 1)
neg_train_idx = np.where(np.array(train_dataset["label"]) == 0)
pos_train_idx = np.random.choice(pos_train_idx[0], train_class_sample_size, replace=False)
neg_train_idx = np.random.choice(neg_train_idx[0], train_class_sample_size, replace=False)
train_dataset = train_dataset.select(np.sort(np.concatenate([pos_train_idx, neg_train_idx])))

# testing sample:
pos_test_idx = np.where(np.array(test_dataset["label"]) == 1)
neg_test_idx = np.where(np.array(test_dataset["label"]) == 0)
pos_test_idx = np.random.choice(pos_test_idx[0], test_class_sample_size, replace=False)
neg_test_idx = np.random.choice(neg_test_idx[0], test_class_sample_size, replace=False)
test_dataset = test_dataset.select(np.sort(np.concatenate([pos_test_idx, neg_test_idx])))

In [5]:
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 1500
})

In [6]:
test_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 500
})

In [7]:
train_dataset["label"][0], train_dataset["text"][0]

(0,
 "I have read all of the Love Come Softly books. Knowing full well that movies can not use all aspects of the book,but generally they at least have the main point of the book. I was highly disappointed in this movie. The only thing that they have in this movie that is in the book is that Missy's father comes to visit,(although in the book both parents come). That is all. The story line was so twisted and far fetch and yes, sad, from the book, that I just couldn't enjoy it. Even if I didn't read the book it was too sad. I do know that Pioneer life was rough,but the whole movie was a downer. The rating is for having the same family orientation of the film that makes them great.")

In [8]:
train_dataset["label"][-1], train_dataset["text"][-1]

(1,
 "Very smart, sometimes shocking, I just love it. It shoved one more side of David's brilliant talent. He impressed me greatly! David is the best. The movie captivates your attention for every second.")

In [9]:
test_dataset["label"][0], test_dataset["text"][0]

(0,
 "STAR RATING: ***** Saturday Night **** Friday Night *** Friday Morning ** Sunday Night * Monday Morning <br /><br />Former New Orleans homicide cop Jack Robideaux (Jean Claude Van Damme) is re-assigned to Columbus, a small but violent town in Mexico to help the police there with their efforts to stop a major heroin smuggling operation into their town. The culprits turn out to be ex-military, lead by former commander Benjamin Meyers (Stephen Lord, otherwise known as Jase from East Enders) who is using a special method he learned in Afghanistan to fight off his opponents. But Jack has a more personal reason for taking him down, that draws the two men into an explosive final showdown where only one will walk away alive.<br /><br />After Until Death, Van Damme appeared to be on a high, showing he could make the best straight to video films in the action market. While that was a far more drama oriented film, with The Shepherd he has returned to the high-kicking, no brainer action that

In [10]:
test_dataset["label"][-1], test_dataset["text"][-1]

(1,
 'H.G. Cluozot had difficulties working in France after he had made "Le Corbeau" in 1943 which was produced by the German company and later judged by French as a piece of anti-French propaganda. Louis Jouvet, an admirer of Clouzot\'s work, invited him to direct a thriller "Quai des Orfevres" where he played an ambiguous police inspector investigating a murder that happened in Paris Music Hall. Without each other knowledge, the seductive cabaret singer Jenny Lamoure (Suzy Delair) and her jealous piano-accompanist husband Maurice who is madly in love with her (Bertrand Blier, father of director Bertrand Blier) trying to cover up (without each other\'s knowledge) what they believe to be their involvement in the murder? Enters tenacious policeman (Louis Jouvet) who is determined to discover the truth. Jouvet practically stole the movie with wonderfully cynic and sentimental in the same time performance. "His character, his eagle-like profile and his unique way of speaking made him unfo

### Create classification label mappings:

In [11]:
num_labels = 2
id2label = {0: "neg", 1: "pos"}
label2id = {v:k for k,v in id2label.items()}

In [12]:
id2label

{0: 'neg', 1: 'pos'}

In [13]:
label2id

{'neg': 0, 'pos': 1}

### Model Features

- Load the model, `ModernBERT-base`.
- Load the model's tokenizer.

In [14]:
# model ID on Hugging Face:
model_id = "answerdotai/ModernBERT-base"

# model itself:
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, id2label=id2label, label2id=label2id
)

# tokenizer:
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
model.to("cuda")

ModernBertForSequenceClassification(
  (model): ModernBertModel(
    (embeddings): ModernBertEmbeddings(
      (tok_embeddings): Embedding(50368, 768, padding_idx=50283)
      (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (layers): ModuleList(
      (0): ModernBertEncoderLayer(
        (attn_norm): Identity()
        (attn): ModernBertAttention(
          (Wqkv): Linear(in_features=768, out_features=2304, bias=False)
          (rotary_emb): ModernBertRotaryEmbedding()
          (Wo): Linear(in_features=768, out_features=768, bias=False)
          (out_drop): Identity()
        )
        (mlp_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): ModernBertMLP(
          (Wi): Linear(in_features=768, out_features=2304, bias=False)
          (act): GELUActivation()
          (drop): Dropout(p=0.0, inplace=False)
          (Wo): Linear(in_features=1152, out_features=768, bias=False)
        )
      

In [16]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=1024,
        return_tensors="pt"
    )

In [17]:
training_tokenized = train_dataset.map(tokenize_function, batched=True, batch_size=1000)
testing_tokenized = test_dataset.map(tokenize_function, batched=True, batch_size=1000)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

### Train

In [18]:
# metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1_score = evaluate.load("f1")

In [19]:
def compute_metrics(eval_pred):
    # logits, labels, and predictions:
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    acc = accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    pre = precision.compute(predictions=predictions, references=labels)["precision"]
    rec = recall.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_score.compute(predictions=predictions, references=labels, average="binary")["f1"]

    return {"accuracy": acc, "precision": pre, "recall": rec, "f1": f1}

In [20]:
# training arguments
training_args = TrainingArguments(
    output_dir="ModernBERT-imdb-classifier",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    learning_rate=8e-5,
    num_train_epochs=3,
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none"
)

In [21]:
# trainer:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_tokenized,
    eval_dataset=testing_tokenized,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.391,0.460599,0.896,0.84375,0.972,0.903346
2,0.1557,0.294641,0.948,0.940945,0.956,0.948413
3,0.0667,0.281353,0.954,0.959514,0.948,0.953722


TrainOutput(global_step=1125, training_loss=0.24709596106720466, metrics={'train_runtime': 1774.1529, 'train_samples_per_second': 2.536, 'train_steps_per_second': 0.634, 'total_flos': 3066820614144000.0, 'train_loss': 0.24709596106720466, 'epoch': 3.0})

In [23]:
trainer.save_model("ModernBERT-imdb-classifier")

In [24]:
metric_df = pd.DataFrame(trainer.state.log_history)
metric_df

Unnamed: 0,loss,grad_norm,learning_rate,epoch,step,eval_loss,eval_accuracy,eval_precision,eval_recall,eval_f1,eval_runtime,eval_samples_per_second,eval_steps_per_second,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
0,0.8877,71.058342,7.3e-05,0.266667,100,,,,,,,,,,,,,
1,0.5248,25.423561,6.6e-05,0.533333,200,,,,,,,,,,,,,
2,0.391,51.821434,5.9e-05,0.8,300,,,,,,,,,,,,,
3,,,,1.0,375,0.460599,0.896,0.84375,0.972,0.903346,53.034,9.428,4.714,,,,,
4,0.3223,8.487826,5.2e-05,1.066667,400,,,,,,,,,,,,,
5,0.1133,0.008962,4.4e-05,1.333333,500,,,,,,,,,,,,,
6,0.2199,0.014823,3.7e-05,1.6,600,,,,,,,,,,,,,
7,0.1557,0.00699,3e-05,1.866667,700,,,,,,,,,,,,,
8,,,,2.0,750,0.294641,0.948,0.940945,0.956,0.948413,52.2923,9.562,4.781,,,,,
9,0.0622,0.00413,2.3e-05,2.133333,800,,,,,,,,,,,,,


In [25]:
trainer.evaluate()

{'eval_loss': 0.28135251998901367,
 'eval_accuracy': 0.954,
 'eval_precision': 0.9595141700404858,
 'eval_recall': 0.948,
 'eval_f1': 0.9537223340040242,
 'eval_runtime': 52.0983,
 'eval_samples_per_second': 9.597,
 'eval_steps_per_second': 4.799,
 'epoch': 3.0}