# Getting Started with ModernBERT & GLUE

Created by: [Wayde Gilliam](https://twitter.com/waydegilliam)

## Encoders Strike Back!

Like many, I have fond memories of finetuning deberta, roberta and bert models for a number of Kaggle comps and real-world problems (e.g., NER, sentiment analysis, etc.).  Encoder models were "the thing" back in the day and continue to be the primary workhorse for many ML pipelines today though they have been eclipsed by recent advancements in LLMs which typically are based on decoder-only architectures. Long have we awaited a return to an encoder model for the modern world. With ModernBERT, that wait is over! ModernBERT is a new encoder-only model that incorporates the latest features in making neural networks more efficient, faster, and better at handling tasks that encoder models have long excelled at such at text classification.  In addition, ModernBERT allows us to break out of that max 512 token limit with their long context capabilities which give us 8,192 tokens to play with.

In this tutorial, we'll go through the steps of fine-tuning ModernBERT for one of the GLUE tasks, MRPC.  We'll cover some key settings required to use it with the HuggingFace trainer and include with some recommended hyperparameters that have served us well in fine-tuning ModernBERT for GLUE.  We'll also see how to use the model for inference and cleanup the model from the GPU to free up resources.

As an aside, I'm running all this code on a single 3090 with plenty of GPU memory to spare.

Though not strictly necessary, **ModernBERT trains better with FlashAttention!**. Training and inference will be much faster with it installed. See below:

ModernBERT is built on top of FlashAttention which is a highly optimized implementation of the attention mechanism that is faster and more memory efficient than the standard implementation.  ***The beauty of this is all you need to do is install it for ModernBERT to work with it!***  Here's how ...

For NVIDIA GPUs with compute capability 8.0+ (Ampere/Ada/Hopper architecture - A100, A6000, RTX 3090, RTX 4090, H100 etc):
```python
pip install flash-attn --no-build-isolation
```

For older NVIDIA GPUs (pre-Ampere):
```python
pip install flash-attn --no-deps
```


In [None]:
#! pip install setuptools transformers datasets accelerate scikit-learn -Uqq
# install setuptools and do this before installing flash-attn
# pip install flash-attn --no-build-isolation


In [1]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1"


In [2]:
import numpy as np
import pandas as pd
import torch
from functools import partial
import gc

from datasets import load_dataset
from sklearn.metrics import f1_score
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    TrainerCallback,
)

from sklearn.metrics import matthews_corrcoef, accuracy_score, f1_score
from scipy.stats import pearsonr, spearmanr

os.environ["TOKENIZERS_PARALLELISM"] = "false"


  from .autonotebook import tqdm as notebook_tqdm


## What is GLUE?

The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine diverse natural language understanding tasks designed to evaluate and compare the performance of NLP models across various language comprehension challenges. By providing a standardized framework, GLUE facilitates the development of models that generalize well across multiple tasks, promoting advancements in creating robust and versatile language understanding systems. 

Let's put this all these tasks in a dictionary along with some other helpful metadata about each one that might prove useful to iteratting over all of them.



In [12]:
glue_tasks = {

    "stsb": {
        "abbr": "SST-B",
        "name": "Semantic Textual Similarity Benchmark",
        "description": "Predict the similarity score for two sentences on a scale from 1 to 5",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Misc.",
        "size": "7k",
        "metrics": "Pearson/Spearman corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [pearsonr, spearmanr],
        "n_labels": 1,
    },
    "qqp": {
        "abbr": "QQP",
        "name": "Quora question pair",
        "description": "Predict if two questions are a paraphrase of one another",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Social QA questions",
        "size": "364k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question1", "question2"],
        "target": "label",
        "metric_funcs": [f1_score, accuracy_score],
        "n_labels": 2,
    },
    "mnli-matched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_matched", "test": "test_matched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
    },
    "mnli-mismatched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_mismatched", "test": "test_mismatched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
    },
    "qnli": {
        "abbr": "QNLI",
        "name": "Stanford Question Answering Dataset",
        "description": "Predict whether the context sentence contains the answer to the question",
        "task_type": "Inference Tasks",
        "domain": "Wikipedia",
        "size": "105k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question", "sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
    "rte": {
        "abbr": "RTE",
        "name": "Recognize Textual Entailment",
        "description": "Predict whether one sentece entails another",
        "task_type": "Inference Tasks",
        "domain": "News, Wikipedia",
        "size": "2.5k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
    "wnli": {
        "abbr": "WNLI",
        "name": "Winograd Schema Challenge",
        "description": "Predict if the sentence with the pronoun substituted is entailed by the original sentence",
        "task_type": "Inference Tasks",
        "domain": "Fiction books",
        "size": "634",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
    },
}

# for v in glue_tasks.values(): print(v)
glue_tasks.values()

glue_df = pd.DataFrame(glue_tasks.values(), columns=["abbr", "name", "task_type", "description", "size", "metrics"])
glue_df.columns = glue_df.columns.str.replace("_", " ").str.capitalize()
display(glue_df.style.set_properties(**{"text-align": "left"}))


Unnamed: 0,Abbr,Name,Task type,Description,Size,Metrics
0,SST-B,Semantic Textual Similarity Benchmark,Similarity and Paraphrase Tasks,Predict the similarity score for two sentences on a scale from 1 to 5,7k,Pearson/Spearman corr.
1,QQP,Quora question pair,Similarity and Paraphrase Tasks,Predict if two questions are a paraphrase of one another,364k,F1/Accuracy
2,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
3,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
4,QNLI,Stanford Question Answering Dataset,Inference Tasks,Predict whether the context sentence contains the answer to the question,105k,Accuracy
5,RTE,Recognize Textual Entailment,Inference Tasks,Predict whether one sentece entails another,2.5k,Accuracy
6,WNLI,Winograd Schema Challenge,Inference Tasks,Predict if the sentence with the pronoun substituted is entailed by the original sentence,634,Accuracy


## Let's Fine-Tune ModernBERT for MRPC

### Configuration

ModernBERT currently comes in two flavors, base and large. To keep things lean and mean, we'll use the "answerdotai/ModernBERT-base" checkpoint for this example.

In [7]:
task = "mrpc"
task_meta = glue_tasks[task]
train_ds_name = task_meta["dataset_names"]["train"]
valid_ds_name = task_meta["dataset_names"]["valid"]
test_ds_name = task_meta["dataset_names"]["test"]

task_inputs = task_meta["inputs"]
task_target = task_meta["target"]
n_labels = task_meta["n_labels"]
task_metrics = task_meta["metric_funcs"]

checkpoint = "output_model/modernbert-diffusion-1b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"

### Data

We'll use the `Datasets` library to load the data.  As its always recommended to "look at your data" before we get training, we'll also print out a single example to see what we're working with as well as the features of the dataset.

In [8]:
raw_datasets = load_dataset("glue", task)

print(f"{raw_datasets}\n")
print(f"{raw_datasets[train_ds_name][0]}\n")
print(f"{raw_datasets[train_ds_name].features}\n")

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}

{'sentence1': Value('string'), 'sentence2': Value('string'), 'label': ClassLabel(names=['not_equivalent', 'equivalent']), 'idx': Value('int32')}



We can use the following dictionaries when building our model with `AutoModelForSequenceClassification` to map between the label ids and names.

In [4]:
def get_label_maps(raw_datasets, train_ds_name):
    labels = raw_datasets[train_ds_name].features["label"]

    id2label = {idx: name.upper() for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None
    label2id = {name.upper(): idx for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None

    return id2label, label2id

In [10]:
id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

print(f"{id2label}")
print(f"{label2id}")


{0: 'NOT_EQUIVALENT', 1: 'EQUIVALENT'}
{'NOT_EQUIVALENT': 0, 'EQUIVALENT': 1}


MRPC is a sentence-pair classification task where we're given two sentences and asked to predict whether they are paraphrases of one another.  The dataset is split into train, validation and test sets. We'll need to keep all this in mind when we set up tokenization next with `AutoTokenizer`.

### Tokenizer

Next we define our Tokenizer and a preprocess function to create the input_ids, attention_mask, and token_type_ids the model nees to train.  For this example, including `truncation=True` is enough as we'll rely on our data collation function below to put our batches into the correct shape.

In [11]:
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [12]:
task_inputs

['sentence1', 'sentence2']

In [5]:
def preprocess_function(examples, task_inputs):
    inps = [examples[inp] for inp in task_inputs]
    tokenized = hf_tokenizer(*inps, truncation=True)
    return tokenized

In [14]:
tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

print(f"{tokenized_datasets}\n")
print(f"{tokenized_datasets[train_ds_name][0]}\n")
print(f"{tokenized_datasets[train_ds_name].features}\n")

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map: 100%|██████████| 408/408 [00:00<00:00, 12356.59 examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1725
    })
})

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [50281, 8096, 287, 9877, 10145, 521, 4929, 1157, 5207, 344, 1925, 346, 253, 5517, 346, 1157, 273, 21547, 940, 12655, 521, 1941, 964, 50282, 7676, 24247, 281, 779, 347, 760, 346, 253, 5517, 346, 1157, 3052, 287, 9877, 10145, 521, 4929, 273, 21547, 940, 12655, 521,




It's always good to see what the tokenizer is doing to our data to ensure the special tokens are where we expect them to be!

In [15]:
hf_tokenizer.decode(tokenized_datasets[train_ds_name][0]["input_ids"])

'[CLS]Amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence.[SEP]Referring to him as only " the witness ", Amrozi accused his brother of deliberately distorting his evidence.[SEP]'

### Metrics

We'll use our `task_metrics` to compute the metrics for our model.  We'll return a dictionary of the metric name and value for each metric we're interested in.

In [6]:
def compute_metrics(eval_pred, task_metrics):
    predictions, labels = eval_pred

    metrics_d = {}
    for metric_func in task_metrics:
        metric_name = metric_func.__name__
        if metric_name in ["pearsonr", "spearmanr"]:
            score = metric_func(labels, np.squeeze(predictions))
        else:
            score = metric_func(np.argmax(predictions, axis=-1), labels)

        if isinstance(score, tuple):
            metrics_d[metric_func.__name__] = score[0]
        else:
            metrics_d[metric_func.__name__] = score

    return metrics_d

### Train

This is where the fun begins! Here we setup a few hyperparameters than have proven to work well for us in fine-tuning ModernBERT-base on GLUE tasks.  We'll also setup our model, data collator, and training arguments.

In [17]:
train_bsz, val_bsz = 32, 32
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

When configuring `AutoModelForSequenceClassification`, two settings are critical to get things working with the HuggingFace `Trainer`. One is the `num_labels` we're expecting and the other is to set `compile=False` to avoid using the `torch.compile` function which is not supported in Transformers at the time of this writing.

In [18]:
hf_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=n_labels, id2label=id2label, label2id=label2id
)


Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertForSequenceClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)`
Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)`
Some weights of ModernBertForSequenceClassification were not initialized from the model ch

Collation is easy for GLUE tasks as we can use the `DataCollatorWithPadding` class to pad our input_ids and attention_mask to the max length in the batch.

**Note**: If you have installed Flash Attention, ModernBERT removes the padding internally, which makes it the fastest version. SPDA and Eager mode will be slower.

In [19]:
hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

With all the pieces in place, we can now setup our `TrainingArguments` and `Trainer` and get to training! Lots of customization is possible here and it is recommended to play with different schedulers and the hyperparameters we've started y'all off with above to improve results.

In [20]:
training_args = TrainingArguments(
    output_dir=f"aai_ModernBERT_{task}_ft",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

We define `TrainerCallback` so that we can capture all the training and evaluation logs and store them for later analysis. By default, the `Trainer` class will only keep the latest logs.


In [7]:
class MetricsCallback(TrainerCallback):
    def __init__(self):
        self.training_history = {"train": [], "eval": []}

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            if "loss" in logs:  # Training logs
                self.training_history["train"].append(logs)
            elif "eval_loss" in logs:  # Evaluation logs
                self.training_history["eval"].append(logs)

In [22]:
trainer = Trainer(
    model=hf_model,
    args=training_args,
    train_dataset=tokenized_datasets[train_ds_name],
    eval_dataset=tokenized_datasets[valid_ds_name],
    processing_class=hf_tokenizer,
    data_collator=hf_data_collator,
    compute_metrics=partial(compute_metrics, task_metrics=task_metrics),
)

metrics_callback = MetricsCallback()
trainer.add_callback(metrics_callback)

trainer.train()

train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
train_history_df = train_history_df.add_prefix("train_")
eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

args_df = pd.DataFrame([training_args.to_dict()])

display(train_res_df)
display(args_df)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7283,0.599436,0.718137,0.821151
2,0.5146,0.588149,0.715686,0.813505


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7283,4.591911,4.034783e-05,1.0,0.599436,0.718137,0.821151
1,0.5146,3.610186,3.478261e-07,2.0,0.588149,0.715686,0.813505


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


### Inference

There'a number of options for inference within the HuggingFace ecosystem.  We'll go a bit old school here and just use the `forward` method of the model. We're not uploading this model to the hub, but this is an easy enough task for you to try out on your own should you like to share your ModernBERT finetune :).

In [23]:
ex_1 = "The quick brown fox jumps over the lazy dog."
ex_2 = "I love lamp!"

inf_inputs = hf_tokenizer(ex_1, ex_2, return_tensors="pt")
inf_inputs = inf_inputs.to("cuda")

with torch.no_grad():
    logits = hf_model(**inf_inputs).logits

print(logits)
print(f"Prediction: {hf_model.config.id2label[logits.argmax().item()]}")


tensor([[-1.1406, -1.3672]], device='cuda:0')
Prediction: NOT_EQUIVALENT


### Cleanup

In [8]:
def cleanup(things_to_delete: list | None = None):
    if things_to_delete is not None:
        for thing in things_to_delete:
            if thing is not None:
                del thing

    gc.collect()
    torch.cuda.empty_cache()


In [25]:
cleanup(things_to_delete=[hf_model, trainer])

## Train all the GLUE!

If you got this far you're probably wondering why I put together that dictionary of GLUE tasks if all we're doing is finetuning a single model. The answer is basically that I'm a good and lazy programmer who would like to easily run hyperparameter sweeps and/or fine-tunes on all the GLUE tasks. So ... let's do that!

We'll run with the training hyperparameters specified above and I leave it to the reader to improve the method below to be able to override these values should folks be looking for something to do :)

In [26]:


def finetune_glue_task(
    task: str, checkpoint: str = "answerdotai/ModernBERT-base", train_subset: int | None = None, do_cleanup: bool = True
):  # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]

    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    if train_subset is not None and len(raw_datasets["train"]) > train_subset:
        raw_datasets["train"] = raw_datasets["train"].shuffle(seed=42).select(range(train_subset))

    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

    # 3. Load the tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
    tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

    # 4. Define the compute metrics function
    task_compute_metrics = partial(compute_metrics, task_metrics=task_metrics)

    # 5. Load the model and data collator
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, num_labels=n_labels, **model_additional_kwargs
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # 6. Define the training arguments and trainer
    training_args = TrainingArguments(
        output_dir=f"aai_ModernBERT_{task}_ft",
        learning_rate=lr,
        per_device_train_batch_size=train_bsz,
        per_device_eval_batch_size=val_bsz,
        num_train_epochs=n_epochs,
        lr_scheduler_type="linear",
        optim="adamw_torch",
        adam_beta1=betas[0],
        adam_beta2=betas[1],
        adam_epsilon=eps,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        bf16=True,
        bf16_full_eval=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=hf_model,
        args=training_args,
        train_dataset=tokenized_datasets[train_ds_name],
        eval_dataset=tokenized_datasets[valid_ds_name],
        processing_class=hf_tokenizer,
        data_collator=hf_data_collator,
        compute_metrics=task_compute_metrics,
    )

    # Add callback to trainer
    metrics_callback = MetricsCallback()
    trainer.add_callback(metrics_callback)

    trainer.train()

    # 7. Get the training results and hyperparameters
    train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
    train_history_df = train_history_df.add_prefix("train_")
    eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
    train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

    args_df = pd.DataFrame([training_args.to_dict()])

    # 8. Cleanup (optional)
    if do_cleanup:
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer, tokenized_datasets, raw_datasets])

    return train_res_df, args_df, hf_model, hf_tokenizer

This helpful function encapsulates all the steps we've been through above and allows us to easily run a fine-tune on a single task. In addition to the HuggingFace objects, it returns the training results, training hyperparameters (all potentially helpful for performing sweeps and or documenting your results).

Let's give it a go on both MRPC and CoLA.


In [24]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "mrpc", checkpoint=checkpoint, do_cleanup=True
)

display(train_res_df)
display(args_df)


Map: 100%|██████████| 3668/3668 [00:00<00:00, 13951.57 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 13413.67 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 14059.15 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7243,0.569719,0.730392,0.832827
2,0.5081,0.557317,0.740196,0.833856


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7243,3.341632,4.034783e-05,1.0,0.569719,0.730392,0.832827
1,0.5081,4.132639,3.478261e-07,2.0,0.557317,0.740196,0.833856


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


In [25]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "cola", checkpoint=checkpoint, do_cleanup=True
)

display(train_res_df)
display(args_df)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,0.6401,0.599142,0.098772,0.690316
2,0.5104,0.607198,0.123063,0.689358


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_matthews_corrcoef,eval_accuracy_score
0,0.6401,3.88066,4.014925e-05,1.0,0.599142,0.098772,0.690316
1,0.5104,4.748318,1.492537e-07,2.0,0.607198,0.123063,0.689358


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_cola_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


**Send it!**

Grab yourself a good cup of coffee, take your pups out for a walk, or whatever as your GPU purrs along while finetuning all things GLUE!

Note the `train_subset` parameter which allows us to train on a subset of the dataset. This is helpful for quickly testing the model on a small dataset to make sure all the bits work as expected.  Feel free to set it to `None` for a full send!

In [26]:
for task in glue_tasks.keys():
    print(f"----- Finetuning {task} -----")
    train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
        task, checkpoint=checkpoint, train_subset=None, do_cleanup=True
    )

    print(":: Results ::")
    display(train_res_df)
    display(args_df)


----- Finetuning cola -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,0.6402,0.59916,0.109335,0.692234
2,0.5105,0.606568,0.120251,0.688399


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_matthews_corrcoef,eval_accuracy_score
0,0.6402,3.848574,4.014925e-05,1.0,0.59916,0.109335,0.692234
1,0.5105,4.70933,1.492537e-07,2.0,0.606568,0.120251,0.688399


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_cola_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning sst2 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.3691,0.40196,0.832569
2,0.1786,0.45999,0.833716


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.3691,2.681734,4.0019e-05,1.0,0.40196,0.832569
1,0.1786,2.931861,1.900238e-08,2.0,0.45999,0.833716


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_sst2_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mrpc -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7242,0.569723,0.727941,0.83105
2,0.5081,0.557451,0.740196,0.833856


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7242,3.333426,4.034783e-05,1.0,0.569723,0.727941,0.83105
1,0.5081,4.196676,3.478261e-07,2.0,0.557451,0.740196,0.833856


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr
1,2.7346,1.888843,0.504394,0.501757
2,1.3771,1.676585,0.529895,0.523065


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr
0,2.7346,35.054741,4.022222e-05,1.0,1.888843,0.504394,0.501757
1,1.3771,17.149046,2.222222e-07,2.0,1.676585,0.529895,0.523065


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.3961,0.344694,0.789399,0.842172
2,0.2687,0.331557,0.803835,0.854242


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.3961,7.801034,4.000352e-05,1.0,0.344694,0.789399,0.842172
1,0.2687,9.245846,3.517721e-09,2.0,0.331557,0.803835,0.854242


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8384,0.745386,0.67458
2,0.6661,0.697327,0.703821


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8384,3.123421,4.000326e-05,1.0,0.745386,0.67458
1,0.6661,3.934601,3.259452e-09,2.0,0.697327,0.703821


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8384,0.722191,0.688059
2,0.6658,0.675425,0.713792


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8384,3.106269,4.000326e-05,1.0,0.722191,0.688059
1,0.6658,4.120486,3.259452e-09,2.0,0.675425,0.713792


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.5913,0.511567,0.749039
2,0.4674,0.499597,0.753066


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.5913,7.092668,4.001222e-05,1.0,0.511567,0.749039
1,0.4674,5.704651,1.221747e-08,2.0,0.499597,0.753066


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8394,0.773243,0.516245
2,0.6313,0.729293,0.505415


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8394,16.39027,4.051282e-05,1.0,0.773243,0.516245
1,0.6313,8.232226,5.128205e-07,2.0,0.729293,0.505415


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning wnli -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.0604,0.911257,0.478873
2,0.7852,0.921793,0.394366


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.0604,14.32718,4.2e-05,1.0,0.911257,0.478873
1,0.7852,6.899503,2e-06,2.0,0.921793,0.394366


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_wnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


## Conclusion

With ModernBERT encoders are back baby!  We've seen that ModernBERT-base can compete with the best of them on GLUE tasks and with a little more tuning, we'll see that ModernBERT-large can do even better.  I'm excited to see what the community will do with this model and I'm looking forward to seeing what you all build with it! We'll be exploring more of the capabilities of ModernBERT in future tutorials.

Until next time, happy coding!


# Test All models

In [9]:
train_bsz, val_bsz = 32, 32
betas = (0.9, 0.98)
n_epochs = 10
eps = 1e-6

def finetune_glue_task(
    lr, wd, task: str, checkpoint: str = "answerdotai/ModernBERT-base", train_subset: int | None = None, do_cleanup: bool = True
):  # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]

    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    if train_subset is not None and len(raw_datasets["train"]) > train_subset:
        raw_datasets["train"] = raw_datasets["train"].shuffle(seed=42).select(range(train_subset))

    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

    # 3. Load the tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

    # 4. Define the compute metrics function
    task_compute_metrics = partial(compute_metrics, task_metrics=task_metrics)

    # 5. Load the model and data collator
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, num_labels=n_labels, **model_additional_kwargs
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # 6. Define the training arguments and trainer
    training_args = TrainingArguments(
        output_dir=f"aai_ModernBERT_{task}_ft",
        learning_rate=lr,
        per_device_train_batch_size=train_bsz,
        per_device_eval_batch_size=val_bsz,
        num_train_epochs=n_epochs,
        lr_scheduler_type="linear",
        optim="adamw_torch",
        adam_beta1=betas[0],
        adam_beta2=betas[1],
        adam_epsilon=eps,
        weight_decay=wd,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        bf16=True,
        bf16_full_eval=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=hf_model,
        args=training_args,
        train_dataset=tokenized_datasets[train_ds_name],
        eval_dataset=tokenized_datasets[valid_ds_name],
        processing_class=hf_tokenizer,
        data_collator=hf_data_collator,
        compute_metrics=task_compute_metrics,
    )

    # Add callback to trainer
    metrics_callback = MetricsCallback()
    trainer.add_callback(metrics_callback)

    trainer.train()

    # 7. Get the training results and hyperparameters
    train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
    train_history_df = train_history_df.add_prefix("train_")
    eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
    train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

    args_df = pd.DataFrame([training_args.to_dict()])

    # 8. Cleanup (optional)
    if do_cleanup:
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer, tokenized_datasets, raw_datasets])

    return train_res_df, args_df, hf_model, hf_tokenizer

In [11]:
checkpoint = "output_model/modernbert-kan-1.5b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"
learning_rates = [1e-5, 3e-5, 5e-5, 8e-5]
weight_decays = [1e-6, 5e-6, 8e-6, 1e-5]

hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
for task in glue_tasks.keys():
    for lr in learning_rates:
        for wd in weight_decays:
            print(f"----- Finetuning {task} | lr={lr} | wd={wd} -----")
            train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
                lr, wd, task, checkpoint=checkpoint, train_subset=None, do_cleanup=True
            )

            print(":: Results ::")
            display(train_res_df)
            display(args_df)

----- Finetuning cola | lr=1e-05 | wd=1e-06 -----


Map: 100%|██████████| 1063/1063 [00:00<00:00, 18573.02 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model co

Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,0.6786,0.634761,-0.018741,0.657718
2,0.6203,0.619031,0.023867,0.688399
3,0.6054,0.616659,0.025127,0.681687
4,0.6005,0.62085,0.046356,0.692234


KeyboardInterrupt: 

# K-Folds

In [24]:
from sklearn.model_selection import StratifiedKFold, KFold
from transformers import set_seed
import warnings
warnings.filterwarnings("ignore")

def kfold_glue_task(
    task: str, 
    checkpoint: str = "answerdotai/ModernBERT-base", 
    k_folds: int = 5,
    lr: float = 8e-5,
    wd: float = 8e-6,
    n_epochs: int = 3,
    train_subset: int | None = None,
    random_seed: int = 42
):
    """
    Perform k-fold cross-validation for a GLUE task
    """
    set_seed(random_seed)
    
    # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]
    
    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    train_dataset = raw_datasets[train_ds_name]
    
    # Get label maps from original dataset BEFORE any splitting
    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)
    
    if train_subset is not None and len(train_dataset) > train_subset:
        train_dataset = train_dataset.shuffle(seed=random_seed).select(range(train_subset))
    
    # 3. Convert to pandas for easier splitting
    train_df = train_dataset.to_pandas()
    
    # 4. Set up k-fold split (stratified for classification, regular for regression)
    if task == "stsb":  # regression task
        kfold = KFold(n_splits=k_folds, shuffle=True, random_state=random_seed)
        splits = list(kfold.split(train_df))
    else:  # classification tasks
        kfold = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=random_seed)
        splits = list(kfold.split(train_df, train_df['label']))
    
    # 5. Initialize results storage
    fold_results = []
    
    print(f"Starting {k_folds}-fold CV for {task}")
    
    # 6. Perform k-fold training
    for fold, (train_idx, val_idx) in enumerate(splits):
        print(f"\n--- Fold {fold + 1}/{k_folds} ---")
        
        # Create fold datasets
        fold_train_df = train_df.iloc[train_idx].reset_index(drop=True)
        fold_val_df = train_df.iloc[val_idx].reset_index(drop=True)
        
        # Convert back to HF datasets and preserve original features
        from datasets import Dataset
        fold_train_dataset = Dataset.from_pandas(fold_train_df)
        fold_val_dataset = Dataset.from_pandas(fold_val_df)
        
        # Preserve original features structure
        if hasattr(train_dataset.features['label'], 'names'):
            from datasets.features import ClassLabel
            label_feature = ClassLabel(names=train_dataset.features['label'].names)
            fold_train_dataset = fold_train_dataset.cast_column('label', label_feature)
            fold_val_dataset = fold_val_dataset.cast_column('label', label_feature)
        
        # Tokenize datasets
        hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
        fold_train_tokenized = fold_train_dataset.map(
            partial(preprocess_function, task_inputs=task_inputs), batched=True
        )
        fold_val_tokenized = fold_val_dataset.map(
            partial(preprocess_function, task_inputs=task_inputs), batched=True
        )
        
        # Load model with consistent label mapping
        model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
        hf_model = AutoModelForSequenceClassification.from_pretrained(
            checkpoint, num_labels=n_labels, **model_additional_kwargs
        )
        
        # Data collator
        hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir=f"kfold_{task}_fold_{fold}",
            learning_rate=lr,
            per_device_train_batch_size=32,
            per_device_eval_batch_size=32,
            num_train_epochs=n_epochs,
            lr_scheduler_type="linear",
            optim="adamw_torch",
            adam_beta1=0.9,
            adam_beta2=0.98,
            adam_epsilon=1e-6,
            weight_decay=wd,
            logging_strategy="epoch",
            eval_strategy="epoch",
            save_strategy="no",
            load_best_model_at_end=False,
            bf16=True,
            bf16_full_eval=True,
            push_to_hub=False,
            report_to=None,
            seed=random_seed,
        )
        
        # Trainer
        trainer = Trainer(
            model=hf_model,
            args=training_args,
            train_dataset=fold_train_tokenized,
            eval_dataset=fold_val_tokenized,
            processing_class=hf_tokenizer,
            data_collator=hf_data_collator,
            compute_metrics=partial(compute_metrics, task_metrics=task_metrics),
        )
        
        # Train
        train_result = trainer.train()
        
        # Evaluate
        eval_result = trainer.evaluate()
        
        # Store results
        fold_result = {
            'fold': fold + 1,
            'train_loss': train_result.training_loss,
            'eval_loss': eval_result['eval_loss'],
        }
        
        # Add task-specific metrics
        for metric_name in [m.__name__ for m in task_metrics]:
            if f'eval_{metric_name}' in eval_result:
                fold_result[f'eval_{metric_name}'] = eval_result[f'eval_{metric_name}']
        
        fold_results.append(fold_result)
        
        print(f"Fold {fold + 1} results: {fold_result}")
        
        # Cleanup
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer])
    
    
    # 7. Aggregate results
    results_df = pd.DataFrame(fold_results)
    
    # Calculate mean and std for each metric
    summary_stats = {}
    for col in results_df.columns:
        if col != 'fold':
            summary_stats[f'{col}_mean'] = results_df[col].mean()
            summary_stats[f'{col}_std'] = results_df[col].std()
    
    summary_df = pd.DataFrame([summary_stats])
    
    print(f"\n=== {task} K-Fold Results Summary ===")
    display(results_df)
    print("\nMean ± Std:")
    display(summary_df)
    
    return results_df, summary_df

def run_kfold_all_glue(
    checkpoint: str = "answerdotai/ModernBERT-base",
    k_folds: int = 5,
    lr: float = 8e-5,
    wd: float = 8e-6,
    train_subset: int | None = None
):
    """
    Run k-fold CV on all GLUE tasks
    """
    all_results = {}
    all_summaries = {}
    
    for task in glue_tasks.keys():
        print(f"\n{'='*50}")
        print(f"Running K-Fold CV for {task.upper()}")
        print(f"{'='*50}")
        
        try:
            results_df, summary_df = kfold_glue_task(
                task=task,
                checkpoint=checkpoint,
                k_folds=k_folds,
                lr=lr,
                wd=wd,
                train_subset=train_subset
            )
            
            all_results[task] = results_df
            all_summaries[task] = summary_df
            
        except Exception as e:
            print(f"Error with task {task}: {e}")
            continue
    
    return all_results, all_summaries

# Example usage:
# Run k-fold on a single task
results_df, summary_df = kfold_glue_task(
    task="cola",
    checkpoint=checkpoint,
    k_folds=10,
    lr=5e-5,
    wd=8e-6,
    train_subset=None  # Use subset for testing
)

# Run k-fold on all GLUE tasks
# all_results, all_summaries = run_kfold_all_glue(
#     checkpoint="output_model/modernbert-mask-1b",
#     k_folds=5,
#     lr=5e-5,
#     wd=8e-6,
#     train_subset=None  # Use full datasets
# )

AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
