<a href="https://colab.research.google.com/github/simon-mellergaard/GAI-with-LLMs/blob/main/Project%20codes/Assignment04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4

> 1. *Fine-tune answerdotai/ModernBERT-large on the clinc_oos (load_dataset("clinc_oos", "plus")) dataset. Show the main training results and show the result from giving the model the following query: “Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in Paris and I need a 15 passenger van”.*
2. *Follow the example from Chapter 8 in Natural Language Processing with Transformers and produce a plot where you compare the accuracy, latency and size of 1) your fine tuned ModernBert-large baseline, 2) a distilled version, 3) a distilled version using Optuna and 4) a quantized-distilled version. Think carefully about the appropriate
student model to select.*

## Setup

In [8]:
# Installing packages
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [9]:
# Loading libraries
import evaluate
import torch
import numpy as np
import os

# Loading functions
from pathlib import Path
from time import perf_counter
from transformers import pipeline
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import TrainingArguments
from transformers import Trainer
from google.colab import userdata
from huggingface_hub import login as login_hf
from wandb import login as login_wandb


In [2]:
# Logging in to Huggingface and wandb
os.environ['HF_TOKEN'] = userdata.get('HF')
os.environ['WANDB_TOKEN'] = userdata.get('wandb')
login_hf(os.environ['HF_TOKEN'])
login_wandb(key = os.environ['WANDB_TOKEN'])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msimonmellergaard[0m ([33msimonmellergaard-aarhus-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [3]:
# Setting up the device (GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

## 1 Fine-tuning ModernBERT

The [CLINC150](https://huggingface.co/datasets/clinc/clinc_oos) dataset is loaded. This dataset contains 150 different intents and corresponding queries, as well as one out-of-scope category, which is a total of 151 different labels:

In [4]:
from datasets import load_dataset

dataset_clinc = load_dataset("clinc_oos", "plus")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plus/train-00000-of-00001.parquet:   0%|          | 0.00/312k [00:00<?, ?B/s]

plus/validation-00000-of-00001.parquet:   0%|          | 0.00/77.8k [00:00<?, ?B/s]

plus/test-00000-of-00001.parquet:   0%|          | 0.00/136k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5500 [00:00<?, ? examples/s]

The [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) model is loaded, and the tokenizer is defined.

In [15]:
# Tokenizing
model_id = "answerdotai/ModernBERT-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def tokenize_function(batch):
    return tokenizer(batch["text"], truncation=True)
tokenized_datasets = dataset_clinc.map(tokenize_function,batched=True)

tokenized_datasets = tokenized_datasets.rename_column("intent", "label")

# Loading the base model
num_labels = len(set(tokenized_datasets["train"]["intent"])) # 151
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels
).to(device)

During training, the F1 and accuracy metrics are measured.

In [20]:
f1_score = evaluate.load("f1")
accuracy = evaluate.load("accuracy")

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Compute accuracy and F1 Score
    acc_result = accuracy.compute(references=labels, predictions=preds)
    acc = acc_result["accuracy"]

    f1_result = f1_score.compute(
        references=labels, predictions=preds, average="weighted"
    )
    f1 = f1_result["f1"]

    return {"accuracy": acc, "f1": f1}

Downloading builder script: 0.00B [00:00, ?B/s]

The training arguments are defined below. It uses 4 epochs, and a lot of different variables. The validation set is used for evaluating during training.

In [40]:
training_args = TrainingArguments(
    "modernbert-clinc",
    # push_to_hub=True,
    per_device_train_batch_size=32,
    weight_decay=0.1,
    lr_scheduler_type="cosine",
    learning_rate=7e-5,
    num_train_epochs=6,
    eval_steps=300,
    logging_steps=300,
    eval_strategy="steps",
    save_steps=300,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_datasets["train"], # Using all training data
    eval_dataset=tokenized_datasets["validation"],
)

  trainer = Trainer(


In [41]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1
100,2.712,2.65992,0.374516,0.328377
200,2.0817,2.260207,0.45129,0.425979
300,1.6662,1.935358,0.518065,0.495253
400,1.4309,1.644865,0.601935,0.587358
500,1.1682,1.453648,0.64871,0.637002
600,0.8532,1.309207,0.684516,0.678649
700,0.7879,1.265814,0.696129,0.693231
800,0.6966,1.144547,0.733871,0.727975
900,0.6659,1.118481,0.736452,0.732372
1000,0.498,1.052789,0.74871,0.748719


TrainOutput(global_step=1908, training_loss=0.7961263657865785, metrics={'train_runtime': 2204.5966, 'train_samples_per_second': 27.669, 'train_steps_per_second': 0.865, 'total_flos': 2650336549787568.0, 'train_loss': 0.7961263657865785, 'epoch': 4.0})

In [42]:
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...t-clinc/training_args.bin: 100%|##########| 5.78kB / 5.78kB            

  ...77176.5dd68f7b86a5.1098.3: 100%|##########| 12.2kB / 12.2kB            

  ...75755.5dd68f7b86a5.1098.0: 100%|##########| 24.1kB / 24.1kB            

  ...77229.5dd68f7b86a5.1098.4: 100%|##########| 16.0kB / 16.0kB            

  ...75924.5dd68f7b86a5.1098.1: 100%|##########| 12.3kB / 12.3kB            

  ...77869.5dd68f7b86a5.1098.5: 100%|##########| 12.4kB / 12.4kB            

  ...76020.5dd68f7b86a5.1098.2: 100%|##########| 13.1kB / 13.1kB            

  ...77925.5dd68f7b86a5.1098.6: 100%|##########| 23.5kB / 23.5kB            

  ...t-clinc/model.safetensors:   3%|2         | 41.9MB / 1.58GB            

CommitInfo(commit_url='https://huggingface.co/simon-mellergaard/modernbert-clinc/commit/10686c67532f11e756c1582c931d59426886f5c3', commit_message='End of training', commit_description='', oid='10686c67532f11e756c1582c931d59426886f5c3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/simon-mellergaard/modernbert-clinc', endpoint='https://huggingface.co', repo_type='model', repo_id='simon-mellergaard/modernbert-clinc'), pr_revision=None, pr_num=None)

A long comment on how good the model is. We can load it to do prediction:

In [54]:
# Loading the model
bert_ckpt = "simon-mellergaard/modernbert-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)

Device set to use cuda:0


The model is now loaded, and can be evaluated based on the output:

In [55]:
query = "Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in Paris and I need a 15 passenger van"
pipe(query)

[{'label': 'car_rental', 'score': 0.9931524991989136}]

The output looks pretty good or pretty bad.

## 2 Improving models and comparing

The following class is used to compare the models (taken from chapter 8 in NLP):

In [59]:
accuracy_score = evaluate.load('accuracy')

class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type

    def compute_accuracy(self):
        intents = dataset_clinc["test"].features["intent"]
        preds, labels = [], []
        for example in self.dataset:
            pred = self.pipeline(example["text"])[0]["label"]
            label = example["intent"]
            preds.append(intents.str2int(pred))
            labels.append(label)
        accuracy = accuracy_score.compute(predictions=preds, references=labels)
        print(f"Accuracy on test set - {accuracy['accuracy']:.3f}")
        return accuracy

    def compute_size(self):
        state_dict = self.pipeline.model.state_dict()
        tmp_path = Path("model.pt")
        torch.save(state_dict, tmp_path)
        # Calculate size in megabytes
        size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
        # Delete temporary file
        tmp_path.unlink()
        print(f"Model size (MB) - {size_mb:.2f}")
        return {"size_mb": size_mb}

    def time_pipeline(self):
        latencies = []
        # Warmup
        for _ in range(10):
            _ = self.pipeline(query)
        # Timed run
        for _ in range(100):
            start_time = perf_counter()
            _ = self.pipeline(query)
            latency = perf_counter() - start_time
            latencies.append(latency)
        # Compute run statistics
        time_avg_ms = 1000 * np.mean(latencies)
        time_std_ms = 1000 * np.std(latencies)
        print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
        return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

  print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")


This is used to compare the size, latency and accuracy of the different models. To start off, the baseline model with the fine-tuned ModernBert is calculated:

In [62]:
pb = PerformanceBenchmark(pipe, dataset_clinc["test"])
perf_metrics = pb.run_benchmark()

Model size (MB) - 1510.63
Average latency (ms) - 33.97 +\- 2.40
Accuracy on test set - 0.715


### Making a distilled version

This can reduce the size quite significantly, which is why it is a good idea.