# Getting Started with ModernBERT & GLUE

Created by: [Wayde Gilliam](https://twitter.com/waydegilliam)

## Encoders Strike Back!

Like many, I have fond memories of finetuning deberta, roberta and bert models for a number of Kaggle comps and real-world problems (e.g., NER, sentiment analysis, etc.).  Encoder models were "the thing" back in the day and continue to be the primary workhorse for many ML pipelines today though they have been eclipsed by recent advancements in LLMs which typically are based on decoder-only architectures. Long have we awaited a return to an encoder model for the modern world. With ModernBERT, that wait is over! ModernBERT is a new encoder-only model that incorporates the latest features in making neural networks more efficient, faster, and better at handling tasks that encoder models have long excelled at such at text classification.  In addition, ModernBERT allows us to break out of that max 512 token limit with their long context capabilities which give us 8,192 tokens to play with.

In this tutorial, we'll go through the steps of fine-tuning ModernBERT for one of the GLUE tasks, MRPC.  We'll cover some key settings required to use it with the HuggingFace trainer and include with some recommended hyperparameters that have served us well in fine-tuning ModernBERT for GLUE.  We'll also see how to use the model for inference and cleanup the model from the GPU to free up resources.

As an aside, I'm running all this code on a single 3090 with plenty of GPU memory to spare.

Though not strictly necessary, **ModernBERT trains better with FlashAttention!**. Training and inference will be much faster with it installed. See below:

ModernBERT is built on top of FlashAttention which is a highly optimized implementation of the attention mechanism that is faster and more memory efficient than the standard implementation.  ***The beauty of this is all you need to do is install it for ModernBERT to work with it!***  Here's how ...

For NVIDIA GPUs with compute capability 8.0+ (Ampere/Ada/Hopper architecture - A100, A6000, RTX 3090, RTX 4090, H100 etc):
```python
pip install flash-attn --no-build-isolation
```

For older NVIDIA GPUs (pre-Ampere):
```python
pip install flash-attn --no-deps
```


In [None]:
#! pip install setuptools transformers datasets accelerate scikit-learn -Uqq
# install setuptools and do this before installing flash-attn
# pip install flash-attn --no-build-isolation


In [2]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"


In [3]:
import numpy as np
import pandas as pd
import torch
from functools import partial
import gc

from datasets import load_dataset
from sklearn.metrics import f1_score
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    TrainerCallback,
)

from sklearn.metrics import matthews_corrcoef, accuracy_score, f1_score
from scipy.stats import pearsonr, spearmanr

os.environ["TOKENIZERS_PARALLELISM"] = "false"


  from .autonotebook import tqdm as notebook_tqdm


## What is GLUE?

The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine diverse natural language understanding tasks designed to evaluate and compare the performance of NLP models across various language comprehension challenges. By providing a standardized framework, GLUE facilitates the development of models that generalize well across multiple tasks, promoting advancements in creating robust and versatile language understanding systems. 

Let's put this all these tasks in a dictionary along with some other helpful metadata about each one that might prove useful to iteratting over all of them.



In [29]:
# Diffusion 1.5B
glue_tasks = {
    "cola": {
        "abbr": "CoLA",
        "name": "Corpus of Linguistic Acceptability",
        "description": "Predict whether a sequence is a grammatical English sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Misc.",
        "size": "8.5k",
        "metrics": "Matthews corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [matthews_corrcoef, accuracy_score],
        "n_labels": 2,
        "learning_rate": 5e-5,
        "weight_decay": 8e-6,
        "epochs": 2
    },
    "sst2": {
        "abbr": "SST-2",
        "name": "Stanford Sentiment Treebank",
        "description": "Predict the sentiment of a given sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Movie reviews",
        "size": "67k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-5,
        "epochs": 2
    },
    "mrpc": {
        "abbr": "MRPC",
        "name": "Microsoft Research Paraphrase Corpus",
        "description": "Predict whether two sentences are semantically equivalent",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "News",
        "size": "3.7k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score, f1_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 2
    },
    "stsb": {
        "abbr": "SST-B",
        "name": "Semantic Textual Similarity Benchmark",
        "description": "Predict the similarity score for two sentences on a scale from 1 to 5",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Misc.",
        "size": "7k",
        "metrics": "Pearson/Spearman corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [pearsonr, spearmanr, accuracy_score],
        "n_labels": 1,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 7
    },
    "qqp": {
        "abbr": "QQP",
        "name": "Quora question pair",
        "description": "Predict if two questions are a paraphrase of one another",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Social QA questions",
        "size": "364k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question1", "question2"],
        "target": "label",
        "metric_funcs": [f1_score, accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-5,
        "epochs": 10
    },
    "mnli-matched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_matched", "test": "test_matched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
        "learning_rate": 8e-5,
        "weight_decay": 8e-6,
        "epochs": 2
    },
    "mnli-mismatched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_mismatched", "test": "test_mismatched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
        "learning_rate": 8e-5,
        "weight_decay": 1e-5,
        "epochs": 2
    },
    "qnli": {
        "abbr": "QNLI",
        "name": "Stanford Question Answering Dataset",
        "description": "Predict whether the context sentence contains the answer to the question",
        "task_type": "Inference Tasks",
        "domain": "Wikipedia",
        "size": "105k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question", "sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-5,
        "epochs": 3
    },
    "rte": {
        "abbr": "RTE",
        "name": "Recognize Textual Entailment",
        "description": "Predict whether one sentece entails another",
        "task_type": "Inference Tasks",
        "domain": "News, Wikipedia",
        "size": "2.5k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 10
    },
}

# for v in glue_tasks.values(): print(v)
glue_tasks.values()

glue_df = pd.DataFrame(glue_tasks.values(), columns=["abbr", "name", "task_type", "description", "size", "metrics"])
glue_df.columns = glue_df.columns.str.replace("_", " ").str.capitalize()
display(glue_df.style.set_properties(**{"text-align": "left"}))

Unnamed: 0,Abbr,Name,Task type,Description,Size,Metrics
0,CoLA,Corpus of Linguistic Acceptability,Single-Sentence Task,Predict whether a sequence is a grammatical English sentence,8.5k,Matthews corr.
1,SST-2,Stanford Sentiment Treebank,Single-Sentence Task,Predict the sentiment of a given sentence,67k,Accuracy
2,MRPC,Microsoft Research Paraphrase Corpus,Similarity and Paraphrase Tasks,Predict whether two sentences are semantically equivalent,3.7k,F1/Accuracy
3,SST-B,Semantic Textual Similarity Benchmark,Similarity and Paraphrase Tasks,Predict the similarity score for two sentences on a scale from 1 to 5,7k,Pearson/Spearman corr.
4,QQP,Quora question pair,Similarity and Paraphrase Tasks,Predict if two questions are a paraphrase of one another,364k,F1/Accuracy
5,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
6,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
7,QNLI,Stanford Question Answering Dataset,Inference Tasks,Predict whether the context sentence contains the answer to the question,105k,Accuracy
8,RTE,Recognize Textual Entailment,Inference Tasks,Predict whether one sentece entails another,2.5k,Accuracy


In [None]:
# Mask 1.5B
glue_tasks = {
    "cola": {
        "abbr": "CoLA",
        "name": "Corpus of Linguistic Acceptability",
        "description": "Predict whether a sequence is a grammatical English sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Misc.",
        "size": "8.5k",
        "metrics": "Matthews corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [matthews_corrcoef, accuracy_score],
        "n_labels": 2,
        "learning_rate": 1e-5,
        "weight_decay": 1e-6,
        "epochs": 9
    },
    "sst2": {
        "abbr": "SST-2",
        "name": "Stanford Sentiment Treebank",
        "description": "Predict the sentiment of a given sentence",
        "task_type": "Single-Sentence Task",
        "domain": "Movie reviews",
        "size": "67k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 5e-5,
        "weight_decay": 1e-5,
        "epochs": 6
    },
    "mrpc": {
        "abbr": "MRPC",
        "name": "Microsoft Research Paraphrase Corpus",
        "description": "Predict whether two sentences are semantically equivalent",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "News",
        "size": "3.7k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score, f1_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 2
    },
    "stsb": {
        "abbr": "SST-B",
        "name": "Semantic Textual Similarity Benchmark",
        "description": "Predict the similarity score for two sentences on a scale from 1 to 5",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Misc.",
        "size": "7k",
        "metrics": "Pearson/Spearman corr.",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [pearsonr, spearmanr, accuracy_score],
        "n_labels": 1,
        "learning_rate": 8e-5,
        "weight_decay": 8e-6,
        "epochs": 9
    },
    "qqp": {
        "abbr": "QQP",
        "name": "Quora question pair",
        "description": "Predict if two questions are a paraphrase of one another",
        "task_type": "Similarity and Paraphrase Tasks",
        "domain": "Social QA questions",
        "size": "364k",
        "metrics": "F1/Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question1", "question2"],
        "target": "label",
        "metric_funcs": [f1_score, accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-5,
        "epochs": 10
    },
    "mnli-matched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_matched", "test": "test_matched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 5
    },
    "mnli-mismatched": {
        "abbr": "MNLI",
        "name": "Mulit-Genre Natural Language Inference",
        "description": "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
        "task_type": "Inference Tasks",
        "domain": "Misc.",
        "size": "393k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation_mismatched", "test": "test_mismatched"},
        "inputs": ["premise", "hypothesis"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 3,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 4
    },
    "qnli": {
        "abbr": "QNLI",
        "name": "Stanford Question Answering Dataset",
        "description": "Predict whether the context sentence contains the answer to the question",
        "task_type": "Inference Tasks",
        "domain": "Wikipedia",
        "size": "105k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["question", "sentence"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 5e-5,
        "weight_decay": 5e-6,
        "epochs": 4
    },
    "rte": {
        "abbr": "RTE",
        "name": "Recognize Textual Entailment",
        "description": "Predict whether one sentece entails another",
        "task_type": "Inference Tasks",
        "domain": "News, Wikipedia",
        "size": "2.5k",
        "metrics": "Accuracy",
        "dataset_names": {"train": "train", "valid": "validation", "test": "test"},
        "inputs": ["sentence1", "sentence2"],
        "target": "label",
        "metric_funcs": [accuracy_score],
        "n_labels": 2,
        "learning_rate": 8e-5,
        "weight_decay": 1e-6,
        "epochs": 4
    },
}

# for v in glue_tasks.values(): print(v)
glue_tasks.values()

glue_df = pd.DataFrame(glue_tasks.values(), columns=["abbr", "name", "task_type", "description", "size", "metrics"])
glue_df.columns = glue_df.columns.str.replace("_", " ").str.capitalize()
display(glue_df.style.set_properties(**{"text-align": "left"}))

Unnamed: 0,Abbr,Name,Task type,Description,Size,Metrics
0,CoLA,Corpus of Linguistic Acceptability,Single-Sentence Task,Predict whether a sequence is a grammatical English sentence,8.5k,Matthews corr.
1,SST-2,Stanford Sentiment Treebank,Single-Sentence Task,Predict the sentiment of a given sentence,67k,Accuracy
2,MRPC,Microsoft Research Paraphrase Corpus,Similarity and Paraphrase Tasks,Predict whether two sentences are semantically equivalent,3.7k,F1/Accuracy
3,SST-B,Semantic Textual Similarity Benchmark,Similarity and Paraphrase Tasks,Predict the similarity score for two sentences on a scale from 1 to 5,7k,Pearson/Spearman corr.
4,QQP,Quora question pair,Similarity and Paraphrase Tasks,Predict if two questions are a paraphrase of one another,364k,F1/Accuracy
5,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
6,MNLI,Mulit-Genre Natural Language Inference,Inference Tasks,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,Accuracy
7,QNLI,Stanford Question Answering Dataset,Inference Tasks,Predict whether the context sentence contains the answer to the question,105k,Accuracy
8,RTE,Recognize Textual Entailment,Inference Tasks,Predict whether one sentece entails another,2.5k,Accuracy


## Let's Fine-Tune ModernBERT for MRPC

### Configuration

ModernBERT currently comes in two flavors, base and large. To keep things lean and mean, we'll use the "answerdotai/ModernBERT-base" checkpoint for this example.

In [5]:
task = "mrpc"
task_meta = glue_tasks[task]
train_ds_name = task_meta["dataset_names"]["train"]
valid_ds_name = task_meta["dataset_names"]["valid"]
test_ds_name = task_meta["dataset_names"]["test"]

task_inputs = task_meta["inputs"]
task_target = task_meta["target"]
n_labels = task_meta["n_labels"]
task_metrics = task_meta["metric_funcs"]

checkpoint = "output_model/modernbert-diffusion-1b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"

KeyError: 'mrpc'

### Data

We'll use the `Datasets` library to load the data.  As its always recommended to "look at your data" before we get training, we'll also print out a single example to see what we're working with as well as the features of the dataset.

In [5]:
raw_datasets = load_dataset("glue", task)

print(f"{raw_datasets}\n")
print(f"{raw_datasets[train_ds_name][0]}\n")
print(f"{raw_datasets[train_ds_name].features}\n")

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}

{'sentence1': Value('string'), 'sentence2': Value('string'), 'label': ClassLabel(names=['not_equivalent', 'equivalent']), 'idx': Value('int32')}



We can use the following dictionaries when building our model with `AutoModelForSequenceClassification` to map between the label ids and names.

In [5]:
def get_label_maps(raw_datasets, train_ds_name):
    labels = raw_datasets[train_ds_name].features["label"]

    id2label = {idx: name.upper() for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None
    label2id = {name.upper(): idx for idx, name in enumerate(labels.names)} if hasattr(labels, "names") else None

    return id2label, label2id

In [7]:
id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

print(f"{id2label}")
print(f"{label2id}")


{0: 'NOT_EQUIVALENT', 1: 'EQUIVALENT'}
{'NOT_EQUIVALENT': 0, 'EQUIVALENT': 1}


MRPC is a sentence-pair classification task where we're given two sentences and asked to predict whether they are paraphrases of one another.  The dataset is split into train, validation and test sets. We'll need to keep all this in mind when we set up tokenization next with `AutoTokenizer`.

### Tokenizer

Next we define our Tokenizer and a preprocess function to create the input_ids, attention_mask, and token_type_ids the model nees to train.  For this example, including `truncation=True` is enough as we'll rely on our data collation function below to put our batches into the correct shape.

In [8]:
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [9]:
task_inputs

['sentence1', 'sentence2']

In [6]:
def preprocess_function(examples, task_inputs):
    inps = [examples[inp] for inp in task_inputs]
    tokenized = hf_tokenizer(*inps, truncation=True)
    return tokenized

In [12]:
tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

print(f"{tokenized_datasets}\n")
print(f"{tokenized_datasets[train_ds_name][0]}\n")
print(f"{tokenized_datasets[train_ds_name].features}\n")

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1725
    })
})

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [50281, 8096, 287, 9877, 10145, 521, 4929, 1157, 5207, 344, 1925, 346, 253, 5517, 346, 1157, 273, 21547, 940, 12655, 521, 1941, 964, 50282, 7676, 24247, 281, 779, 347, 760, 346, 253, 5517, 346, 1157, 3052, 287, 9877, 10145, 521, 4929, 273, 21547, 940, 12655, 521,

It's always good to see what the tokenizer is doing to our data to ensure the special tokens are where we expect them to be!

In [13]:
hf_tokenizer.decode(tokenized_datasets[train_ds_name][0]["input_ids"])

'[CLS]Amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence.[SEP]Referring to him as only " the witness ", Amrozi accused his brother of deliberately distorting his evidence.[SEP]'

### Metrics

We'll use our `task_metrics` to compute the metrics for our model.  We'll return a dictionary of the metric name and value for each metric we're interested in.

In [19]:
def compute_metrics(eval_pred, task_metrics):
    predictions, labels = eval_pred

    metrics_d = {}
    for metric_func in task_metrics:
        metric_name = metric_func.__name__
        if metric_name in ["pearsonr", "spearmanr"]:
            score = metric_func(labels, np.squeeze(predictions))
        elif metric_name == "accuracy_score":
            # For STSB: threshold-based binary accuracy
            if len(predictions.shape) == 1 or predictions.shape[1] == 1:
                # Regression task (STSB) - convert to binary classification
                pred_binary = (np.squeeze(predictions) >= 3.0).astype(int)
                label_binary = (labels >= 3.0).astype(int)
                score = metric_func(label_binary, pred_binary)
            else:
                # Classification task - use argmax
                score = metric_func(np.argmax(predictions, axis=-1), labels)
        else:
            # Other classification metrics
            if len(predictions.shape) > 1 and predictions.shape[1] > 1:
                score = metric_func(np.argmax(predictions, axis=-1), labels)
            else:
                # For regression, most metrics don't apply
                continue

        if isinstance(score, tuple):
            metrics_d[metric_func.__name__] = score[0]
        else:
            metrics_d[metric_func.__name__] = score

    return metrics_d

### Train

This is where the fun begins! Here we setup a few hyperparameters than have proven to work well for us in fine-tuning ModernBERT-base on GLUE tasks.  We'll also setup our model, data collator, and training arguments.

In [15]:
train_bsz, val_bsz = 32, 32
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

When configuring `AutoModelForSequenceClassification`, two settings are critical to get things working with the HuggingFace `Trainer`. One is the `num_labels` we're expecting and the other is to set `compile=False` to avoid using the `torch.compile` function which is not supported in Transformers at the time of this writing.

In [16]:
hf_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=n_labels, id2label=id2label, label2id=label2id
)


Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertForSequenceClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)`
Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)`
Some weights of ModernBertForSequenceClassification were not initialized from the model ch

Collation is easy for GLUE tasks as we can use the `DataCollatorWithPadding` class to pad our input_ids and attention_mask to the max length in the batch.

**Note**: If you have installed Flash Attention, ModernBERT removes the padding internally, which makes it the fastest version. SPDA and Eager mode will be slower.

In [17]:
hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

With all the pieces in place, we can now setup our `TrainingArguments` and `Trainer` and get to training! Lots of customization is possible here and it is recommended to play with different schedulers and the hyperparameters we've started y'all off with above to improve results.

In [18]:
training_args = TrainingArguments(
    output_dir=f"aai_ModernBERT_{task}_ft",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

We define `TrainerCallback` so that we can capture all the training and evaluation logs and store them for later analysis. By default, the `Trainer` class will only keep the latest logs.


In [8]:
class MetricsCallback(TrainerCallback):
    def __init__(self):
        self.training_history = {"train": [], "eval": []}

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            if "loss" in logs:  # Training logs
                self.training_history["train"].append(logs)
            elif "eval_loss" in logs:  # Evaluation logs
                self.training_history["eval"].append(logs)

In [20]:
trainer = Trainer(
    model=hf_model,
    args=training_args,
    train_dataset=tokenized_datasets[train_ds_name],
    eval_dataset=tokenized_datasets[valid_ds_name],
    processing_class=hf_tokenizer,
    data_collator=hf_data_collator,
    compute_metrics=partial(compute_metrics, task_metrics=task_metrics),
)

metrics_callback = MetricsCallback()
trainer.add_callback(metrics_callback)

trainer.train()

train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
train_history_df = train_history_df.add_prefix("train_")
eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

args_df = pd.DataFrame([training_args.to_dict()])

display(train_res_df)
display(args_df)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7004,0.612666,0.705882,0.815385
2,0.5139,0.601638,0.705882,0.807692


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7004,3.449601,4.034783e-05,1.0,0.612666,0.705882,0.815385
1,0.5139,3.151908,3.478261e-07,2.0,0.601638,0.705882,0.807692


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


### Inference

There'a number of options for inference within the HuggingFace ecosystem.  We'll go a bit old school here and just use the `forward` method of the model. We're not uploading this model to the hub, but this is an easy enough task for you to try out on your own should you like to share your ModernBERT finetune :).

In [21]:
ex_1 = "The quick brown fox jumps over the lazy dog."
ex_2 = "I love lamp!"

inf_inputs = hf_tokenizer(ex_1, ex_2, return_tensors="pt")
inf_inputs = inf_inputs.to("cuda")

with torch.no_grad():
    logits = hf_model(**inf_inputs).logits

print(logits)
print(f"Prediction: {hf_model.config.id2label[logits.argmax().item()]}")


tensor([[2.8281, 2.3281]], device='cuda:0')
Prediction: NOT_EQUIVALENT


### Cleanup

In [9]:
def cleanup(things_to_delete: list | None = None):
    if things_to_delete is not None:
        for thing in things_to_delete:
            if thing is not None:
                del thing

    gc.collect()
    torch.cuda.empty_cache()


In [9]:
cleanup(things_to_delete=[hf_model, trainer])

NameError: name 'hf_model' is not defined

## Train all the GLUE!

If you got this far you're probably wondering why I put together that dictionary of GLUE tasks if all we're doing is finetuning a single model. The answer is basically that I'm a good and lazy programmer who would like to easily run hyperparameter sweeps and/or fine-tunes on all the GLUE tasks. So ... let's do that!

We'll run with the training hyperparameters specified above and I leave it to the reader to improve the method below to be able to override these values should folks be looking for something to do :)

In [13]:


def finetune_glue_task(
    task: str, checkpoint: str = "answerdotai/ModernBERT-base", train_subset: int | None = None, do_cleanup: bool = True
):  # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]

    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    if train_subset is not None and len(raw_datasets["train"]) > train_subset:
        raw_datasets["train"] = raw_datasets["train"].shuffle(seed=42).select(range(train_subset))

    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

    # 3. Load the tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
    tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

    # 4. Define the compute metrics function
    task_compute_metrics = partial(compute_metrics, task_metrics=task_metrics)

    # 5. Load the model and data collator
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, num_labels=n_labels, **model_additional_kwargs
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # 6. Define the training arguments and trainer
    training_args = TrainingArguments(
        output_dir=f"aai_ModernBERT_{task}_ft",
        learning_rate=lr,
        per_device_train_batch_size=train_bsz,
        per_device_eval_batch_size=val_bsz,
        num_train_epochs=n_epochs,
        lr_scheduler_type="linear",
        optim="adamw_torch",
        adam_beta1=betas[0],
        adam_beta2=betas[1],
        adam_epsilon=eps,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        bf16=True,
        bf16_full_eval=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=hf_model,
        args=training_args,
        train_dataset=tokenized_datasets[train_ds_name],
        eval_dataset=tokenized_datasets[valid_ds_name],
        processing_class=hf_tokenizer,
        data_collator=hf_data_collator,
        compute_metrics=task_compute_metrics,
    )

    # Add callback to trainer
    metrics_callback = MetricsCallback()
    trainer.add_callback(metrics_callback)

    trainer.train()

    # 7. Get the training results and hyperparameters
    train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
    train_history_df = train_history_df.add_prefix("train_")
    eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
    train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

    args_df = pd.DataFrame([training_args.to_dict()])

    # 8. Cleanup (optional)
    if do_cleanup:
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer, tokenized_datasets, raw_datasets])

    return train_res_df, args_df, hf_model, hf_tokenizer

This helpful function encapsulates all the steps we've been through above and allows us to easily run a fine-tune on a single task. In addition to the HuggingFace objects, it returns the training results, training hyperparameters (all potentially helpful for performing sweeps and or documenting your results).

Let's give it a go on both MRPC and CoLA.


In [24]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "mrpc", checkpoint=checkpoint, do_cleanup=True
)

display(train_res_df)
display(args_df)


Map: 100%|██████████| 3668/3668 [00:00<00:00, 13951.57 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 13413.67 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 14059.15 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7243,0.569719,0.730392,0.832827
2,0.5081,0.557317,0.740196,0.833856


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7243,3.341632,4.034783e-05,1.0,0.569719,0.730392,0.832827
1,0.5081,4.132639,3.478261e-07,2.0,0.557317,0.740196,0.833856


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


In [25]:
train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
    "cola", checkpoint=checkpoint, do_cleanup=True
)

display(train_res_df)
display(args_df)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,0.6401,0.599142,0.098772,0.690316
2,0.5104,0.607198,0.123063,0.689358


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_matthews_corrcoef,eval_accuracy_score
0,0.6401,3.88066,4.014925e-05,1.0,0.599142,0.098772,0.690316
1,0.5104,4.748318,1.492537e-07,2.0,0.607198,0.123063,0.689358


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_cola_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


**Send it!**

Grab yourself a good cup of coffee, take your pups out for a walk, or whatever as your GPU purrs along while finetuning all things GLUE!

Note the `train_subset` parameter which allows us to train on a subset of the dataset. This is helpful for quickly testing the model on a small dataset to make sure all the bits work as expected.  Feel free to set it to `None` for a full send!

In [26]:
for task in glue_tasks.keys():
    print(f"----- Finetuning {task} -----")
    train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
        task, checkpoint=checkpoint, train_subset=None, do_cleanup=True
    )

    print(":: Results ::")
    display(train_res_df)
    display(args_df)


----- Finetuning cola -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,0.6402,0.59916,0.109335,0.692234
2,0.5105,0.606568,0.120251,0.688399


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_matthews_corrcoef,eval_accuracy_score
0,0.6402,3.848574,4.014925e-05,1.0,0.59916,0.109335,0.692234
1,0.5105,4.70933,1.492537e-07,2.0,0.606568,0.120251,0.688399


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_cola_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning sst2 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.3691,0.40196,0.832569
2,0.1786,0.45999,0.833716


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.3691,2.681734,4.0019e-05,1.0,0.40196,0.832569
1,0.1786,2.931861,1.900238e-08,2.0,0.45999,0.833716


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_sst2_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mrpc -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,0.7242,0.569723,0.727941,0.83105
2,0.5081,0.557451,0.740196,0.833856


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score,eval_f1_score
0,0.7242,3.333426,4.034783e-05,1.0,0.569723,0.727941,0.83105
1,0.5081,4.196676,3.478261e-07,2.0,0.557451,0.740196,0.833856


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mrpc_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr
1,2.7346,1.888843,0.504394,0.501757
2,1.3771,1.676585,0.529895,0.523065


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr
0,2.7346,35.054741,4.022222e-05,1.0,1.888843,0.504394,0.501757
1,1.3771,17.149046,2.222222e-07,2.0,1.676585,0.529895,0.523065


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.3961,0.344694,0.789399,0.842172
2,0.2687,0.331557,0.803835,0.854242


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.3961,7.801034,4.000352e-05,1.0,0.344694,0.789399,0.842172
1,0.2687,9.245846,3.517721e-09,2.0,0.331557,0.803835,0.854242


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8384,0.745386,0.67458
2,0.6661,0.697327,0.703821


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8384,3.123421,4.000326e-05,1.0,0.745386,0.67458
1,0.6661,3.934601,3.259452e-09,2.0,0.697327,0.703821


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8384,0.722191,0.688059
2,0.6658,0.675425,0.713792


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8384,3.106269,4.000326e-05,1.0,0.722191,0.688059
1,0.6658,4.120486,3.259452e-09,2.0,0.675425,0.713792


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.5913,0.511567,0.749039
2,0.4674,0.499597,0.753066


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.5913,7.092668,4.001222e-05,1.0,0.511567,0.749039
1,0.4674,5.704651,1.221747e-08,2.0,0.499597,0.753066


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.8394,0.773243,0.516245
2,0.6313,0.729293,0.505415


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.8394,16.39027,4.051282e-05,1.0,0.773243,0.516245
1,0.6313,8.232226,5.128205e-07,2.0,0.729293,0.505415


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning wnli -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.0604,0.911257,0.478873
2,0.7852,0.921793,0.394366


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.0604,14.32718,4.2e-05,1.0,0.911257,0.478873
1,0.7852,6.899503,2e-06,2.0,0.921793,0.394366


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_wnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


## Conclusion

With ModernBERT encoders are back baby!  We've seen that ModernBERT-base can compete with the best of them on GLUE tasks and with a little more tuning, we'll see that ModernBERT-large can do even better.  I'm excited to see what the community will do with this model and I'm looking forward to seeing what you all build with it! We'll be exploring more of the capabilities of ModernBERT in future tutorials.

Until next time, happy coding!


# Test All models

In [10]:
train_bsz, val_bsz = 32, 32
betas = (0.9, 0.98)
n_epochs = 10
eps = 1e-6

def finetune_glue_task(
    lr, wd, task: str, checkpoint: str = "answerdotai/ModernBERT-base", train_subset: int | None = None, do_cleanup: bool = True
):  # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]

    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load the dataset
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    if train_subset is not None and len(raw_datasets["train"]) > train_subset:
        raw_datasets["train"] = raw_datasets["train"].shuffle(seed=42).select(range(train_subset))

    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)

    # 3. Load the tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    tokenized_datasets = raw_datasets.map(partial(preprocess_function, task_inputs=task_inputs), batched=True)

    # 4. Define the compute metrics function
    task_compute_metrics = partial(compute_metrics, task_metrics=task_metrics)

    # 5. Load the model and data collator
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, num_labels=n_labels, **model_additional_kwargs
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # 6. Define the training arguments and trainer
    training_args = TrainingArguments(
        output_dir=f"aai_ModernBERT_{task}_ft",
        learning_rate=lr,
        per_device_train_batch_size=train_bsz,
        per_device_eval_batch_size=val_bsz,
        num_train_epochs=n_epochs,
        lr_scheduler_type="linear",
        optim="adamw_torch",
        adam_beta1=betas[0],
        adam_beta2=betas[1],
        adam_epsilon=eps,
        weight_decay=wd,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        bf16=True,
        bf16_full_eval=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=hf_model,
        args=training_args,
        train_dataset=tokenized_datasets[train_ds_name],
        eval_dataset=tokenized_datasets[valid_ds_name],
        processing_class=hf_tokenizer,
        data_collator=hf_data_collator,
        compute_metrics=task_compute_metrics,
    )

    # Add callback to trainer
    metrics_callback = MetricsCallback()
    trainer.add_callback(metrics_callback)

    trainer.train()

    # 7. Get the training results and hyperparameters
    train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
    train_history_df = train_history_df.add_prefix("train_")
    eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
    train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

    args_df = pd.DataFrame([training_args.to_dict()])

    # 8. Cleanup (optional)
    if do_cleanup:
        cleanup(things_to_delete=[trainer, hf_model, hf_tokenizer, tokenized_datasets, raw_datasets])

    return train_res_df, args_df, hf_model, hf_tokenizer

In [None]:
checkpoint = "output_model/modernbert-mask-1b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"
learning_rates = [1e-5, 5e-5, 8e-5]
weight_decays = [1e-6, 5e-6, 8e-6, 1e-5]

In [13]:
checkpoint = "output_model/modernbert-kan-1.5b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"
learning_rates = [5e-5, 8e-5]
weight_decays = [1e-6, 5e-6, 8e-6, 1e-5]

hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
for task in glue_tasks.keys():
    for lr in learning_rates:
        for wd in weight_decays:
            print(f"----- Finetuning {task} | lr={lr} | wd={wd} -----")
            train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
                lr, wd, task, checkpoint=checkpoint, train_subset=None, do_cleanup=True
            )

            print(":: Results ::")
            display(train_res_df)
            display(args_df)

----- Finetuning qqp | lr=5e-05 | wd=1e-06 -----


Map: 100%|██████████| 363846/363846 [00:16<00:00, 21933.52 examples/s]
Map: 100%|██████████| 40430/40430 [00:01<00:00, 21500.95 examples/s]
Map: 100%|██████████| 390965/390965 [00:17<00:00, 21860.12 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5307,0.503569,0.632384,0.751571
2,0.4814,0.478566,0.644241,0.76562
3,0.4489,0.456189,0.697899,0.778457
4,0.4213,0.441895,0.705699,0.78976
5,0.401,0.432526,0.720123,0.793149
6,0.3839,0.426735,0.721795,0.800618
7,0.369,0.425275,0.730682,0.8
8,0.3568,0.422153,0.726984,0.804304
9,0.347,0.421607,0.73135,0.802968
10,0.3396,0.425973,0.733045,0.804897


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5307,4.428421,4.500044e-05,1.0,0.503569,0.632384,0.751571
1,0.4814,10.119098,4.000044e-05,2.0,0.478566,0.644241,0.76562
2,0.4489,9.667716,3.500044e-05,3.0,0.456189,0.697899,0.778457
3,0.4213,17.647268,3.000044e-05,4.0,0.441895,0.705699,0.78976
4,0.401,5.257286,2.500044e-05,5.0,0.432526,0.720123,0.793149
5,0.3839,6.184324,2.000044e-05,6.0,0.426735,0.721795,0.800618
6,0.369,5.671434,1.500044e-05,7.0,0.425275,0.730682,0.8
7,0.3568,9.904021,1.000044e-05,8.0,0.422153,0.726984,0.804304
8,0.347,3.979645,5.00044e-06,9.0,0.421607,0.73135,0.802968
9,0.3396,14.293823,4.397151e-10,10.0,0.425973,0.733045,0.804897


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=5e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5323,0.505151,0.626749,0.751299
2,0.4846,0.481477,0.647364,0.764729
3,0.4519,0.458186,0.703451,0.777863
4,0.4246,0.44692,0.699257,0.78788
5,0.4047,0.432011,0.714885,0.795053
6,0.3881,0.428477,0.724371,0.799629
7,0.3745,0.425463,0.732054,0.800668
8,0.3635,0.422293,0.727981,0.803215
9,0.3544,0.422579,0.731023,0.802622
10,0.3476,0.425055,0.732,0.804452


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5323,4.710798,4.500044e-05,1.0,0.505151,0.626749,0.751299
1,0.4846,10.757133,4.000044e-05,2.0,0.481477,0.647364,0.764729
2,0.4519,10.788968,3.500044e-05,3.0,0.458186,0.703451,0.777863
3,0.4246,18.314741,3.000044e-05,4.0,0.44692,0.699257,0.78788
4,0.4047,7.543452,2.500044e-05,5.0,0.432011,0.714885,0.795053
5,0.3881,14.390788,2.000044e-05,6.0,0.428477,0.724371,0.799629
6,0.3745,7.527752,1.500044e-05,7.0,0.425463,0.732054,0.800668
7,0.3635,4.046223,1.000044e-05,8.0,0.422293,0.727981,0.803215
8,0.3544,2.161394,5.00044e-06,9.0,0.422579,0.731023,0.802622
9,0.3476,4.879679,4.397151e-10,10.0,0.425055,0.732,0.804452


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=5e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5323,0.505134,0.626565,0.751397
2,0.4846,0.481539,0.64664,0.764333
3,0.452,0.458339,0.703381,0.777789
4,0.4247,0.446896,0.699471,0.787905
5,0.4047,0.431987,0.715273,0.795226
6,0.3882,0.428357,0.724657,0.799728
7,0.3745,0.425636,0.733079,0.801113
8,0.3635,0.422301,0.727298,0.802943
9,0.3544,0.422437,0.731257,0.802548
10,0.3476,0.424934,0.731848,0.804427


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5323,4.740328,4.500044e-05,1.0,0.505134,0.626565,0.751397
1,0.4846,10.788855,4.000044e-05,2.0,0.481539,0.64664,0.764333
2,0.452,10.790817,3.500044e-05,3.0,0.458339,0.703381,0.777789
3,0.4247,18.333187,3.000044e-05,4.0,0.446896,0.699471,0.787905
4,0.4047,7.476973,2.500044e-05,5.0,0.431987,0.715273,0.795226
5,0.3882,14.278922,2.000044e-05,6.0,0.428357,0.724657,0.799728
6,0.3745,7.332953,1.500044e-05,7.0,0.425636,0.733079,0.801113
7,0.3635,4.036319,1.000044e-05,8.0,0.422301,0.727298,0.802943
8,0.3544,2.180851,5.00044e-06,9.0,0.422437,0.731257,0.802548
9,0.3476,5.10205,4.397151e-10,10.0,0.424934,0.731848,0.804427


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5323,0.505172,0.626489,0.751051
2,0.4846,0.481508,0.646808,0.764507
3,0.4519,0.458406,0.703617,0.777665
4,0.4247,0.447022,0.698962,0.787583
5,0.4048,0.432001,0.715494,0.79535
6,0.3882,0.428514,0.724176,0.799753
7,0.3746,0.425696,0.73213,0.800618
8,0.3635,0.422434,0.72764,0.803265
9,0.3544,0.422626,0.731017,0.802597
10,0.3476,0.425106,0.731712,0.804328


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5323,4.702584,4.500044e-05,1.0,0.505172,0.626489,0.751051
1,0.4846,10.802587,4.000044e-05,2.0,0.481508,0.646808,0.764507
2,0.4519,10.758867,3.500044e-05,3.0,0.458406,0.703617,0.777665
3,0.4247,18.346371,3.000044e-05,4.0,0.447022,0.698962,0.787583
4,0.4048,7.506412,2.500044e-05,5.0,0.432001,0.715494,0.79535
5,0.3882,14.403512,2.000044e-05,6.0,0.428514,0.724176,0.799753
6,0.3746,7.513264,1.500044e-05,7.0,0.425696,0.73213,0.800618
7,0.3635,4.061352,1.000044e-05,8.0,0.422434,0.72764,0.803265
8,0.3544,2.17712,5.00044e-06,9.0,0.422626,0.731017,0.802597
9,0.3476,4.928557,4.397151e-10,10.0,0.425106,0.731712,0.804328


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5231,0.492053,0.638948,0.759016
2,0.4605,0.451205,0.681063,0.780089
3,0.4208,0.433369,0.719821,0.792703
4,0.3923,0.423607,0.721837,0.801632
5,0.3695,0.412856,0.727972,0.807569
6,0.3489,0.410698,0.738998,0.809894
7,0.3301,0.418147,0.745401,0.809325
8,0.3135,0.413474,0.745024,0.813703
9,0.2987,0.423655,0.745566,0.812293
10,0.2867,0.434248,0.74538,0.813579


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5231,3.334714,7.20007e-05,1.0,0.492053,0.638948,0.759016
1,0.4605,9.670066,6.40007e-05,2.0,0.451205,0.681063,0.780089
2,0.4208,9.020641,5.60007e-05,3.0,0.433369,0.719821,0.792703
3,0.3923,16.027401,4.80007e-05,4.0,0.423607,0.721837,0.801632
4,0.3695,5.745407,4.00007e-05,5.0,0.412856,0.727972,0.807569
5,0.3489,8.776779,3.20007e-05,6.0,0.410698,0.738998,0.809894
6,0.3301,6.643105,2.40007e-05,7.0,0.418147,0.745401,0.809325
7,0.3135,1.20899,1.60007e-05,8.0,0.413474,0.745024,0.813703
8,0.2987,6.614382,8.000704e-06,9.0,0.423655,0.745566,0.812293
9,0.2867,4.180085,7.035441e-10,10.0,0.434248,0.74538,0.813579


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5231,0.49206,0.639583,0.759065
2,0.4604,0.451224,0.681184,0.780188
3,0.4207,0.433691,0.719615,0.79253
4,0.3924,0.42364,0.720901,0.800841
5,0.3695,0.413275,0.728219,0.807569
6,0.3488,0.410677,0.738882,0.809745
7,0.3301,0.417981,0.744246,0.808434
8,0.3135,0.413934,0.74261,0.81306
9,0.2987,0.424152,0.743975,0.811872
10,0.2867,0.434773,0.74504,0.813431


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5231,3.350904,7.20007e-05,1.0,0.49206,0.639583,0.759065
1,0.4604,9.691531,6.40007e-05,2.0,0.451224,0.681184,0.780188
2,0.4207,8.927352,5.60007e-05,3.0,0.433691,0.719615,0.79253
3,0.3924,15.902912,4.80007e-05,4.0,0.42364,0.720901,0.800841
4,0.3695,5.729167,4.00007e-05,5.0,0.413275,0.728219,0.807569
5,0.3488,8.829821,3.20007e-05,6.0,0.410677,0.738882,0.809745
6,0.3301,6.465679,2.40007e-05,7.0,0.417981,0.744246,0.808434
7,0.3135,1.147583,1.60007e-05,8.0,0.413934,0.74261,0.81306
8,0.2987,6.329071,8.000704e-06,9.0,0.424152,0.743975,0.811872
9,0.2867,3.21252,7.035441e-10,10.0,0.434773,0.74504,0.813431


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5231,0.492146,0.639731,0.759164
2,0.4605,0.451172,0.680926,0.780089
3,0.4208,0.43352,0.720134,0.792852
4,0.3924,0.4236,0.722468,0.801534
5,0.3695,0.412672,0.728143,0.807569
6,0.3488,0.409813,0.740167,0.810784
7,0.33,0.418172,0.74591,0.809077
8,0.3134,0.413819,0.744352,0.813604
9,0.2986,0.422251,0.744378,0.812763
10,0.2866,0.433934,0.745175,0.81353


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5231,3.315574,7.20007e-05,1.0,0.492146,0.639731,0.759164
1,0.4605,9.853072,6.40007e-05,2.0,0.451172,0.680926,0.780089
2,0.4208,8.992543,5.60007e-05,3.0,0.43352,0.720134,0.792852
3,0.3924,16.184109,4.80007e-05,4.0,0.4236,0.722468,0.801534
4,0.3695,5.791317,4.00007e-05,5.0,0.412672,0.728143,0.807569
5,0.3488,9.00701,3.20007e-05,6.0,0.409813,0.740167,0.810784
6,0.33,6.49174,2.40007e-05,7.0,0.418172,0.74591,0.809077
7,0.3134,1.067463,1.60007e-05,8.0,0.413819,0.744352,0.813604
8,0.2986,7.509246,8.000704e-06,9.0,0.422251,0.744378,0.812763
9,0.2866,5.144277,7.035441e-10,10.0,0.433934,0.745175,0.81353


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qqp | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,F1 Score,Accuracy Score
1,0.5231,0.492085,0.639547,0.759139
2,0.4604,0.451147,0.682126,0.780633
3,0.4207,0.433295,0.720032,0.792901
4,0.3924,0.423874,0.72222,0.801707
5,0.3696,0.412838,0.728306,0.807321
6,0.3488,0.409997,0.739707,0.810314
7,0.33,0.417222,0.744677,0.808682
8,0.3135,0.412975,0.744705,0.81395
9,0.2987,0.421362,0.745492,0.813233
10,0.2868,0.433481,0.745363,0.813579


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_f1_score,eval_accuracy_score
0,0.5231,3.309764,7.20007e-05,1.0,0.492085,0.639547,0.759139
1,0.4604,9.806194,6.40007e-05,2.0,0.451147,0.682126,0.780633
2,0.4207,9.026125,5.60007e-05,3.0,0.433295,0.720032,0.792901
3,0.3924,15.939571,4.80007e-05,4.0,0.423874,0.72222,0.801707
4,0.3696,5.70437,4.00007e-05,5.0,0.412838,0.728306,0.807321
5,0.3488,8.955913,3.20007e-05,6.0,0.409997,0.739707,0.810314
6,0.33,6.122435,2.40007e-05,7.0,0.417222,0.744677,0.808682
7,0.3135,1.249936,1.60007e-05,8.0,0.412975,0.744705,0.81395
8,0.2987,6.999358,8.000704e-06,9.0,0.421362,0.745492,0.813233
9,0.2868,4.196815,7.035441e-10,10.0,0.433481,0.745363,0.813579


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qqp_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=5e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.935116,0.536933
2,0.8977,0.874996,0.591544
3,0.8476,0.87696,0.596332
4,0.8076,0.818901,0.633316
5,0.7735,0.798099,0.650127
6,0.7447,0.786284,0.653693
7,0.7205,0.791442,0.658788
8,0.7008,0.781774,0.664493
9,0.6857,0.77709,0.663271
10,0.6748,0.778343,0.665716


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.578614,4.500041e-05,1.0,0.935116,0.536933
1,0.8977,3.837679,4.000041e-05,2.0,0.874996,0.591544
2,0.8476,1.850675,3.500041e-05,3.0,0.87696,0.596332
3,0.8076,4.085679,3.000041e-05,4.0,0.818901,0.633316
4,0.7735,5.90219,2.500041e-05,5.0,0.798099,0.650127
5,0.7447,3.694272,2.000041e-05,6.0,0.786284,0.653693
6,0.7205,5.231406,1.500041e-05,7.0,0.791442,0.658788
7,0.7008,4.766755,1.000041e-05,8.0,0.781774,0.664493
8,0.6857,4.450247,5.000407e-06,9.0,0.77709,0.663271
9,0.6748,7.655546,4.074316e-10,10.0,0.778343,0.665716


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=5e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.934976,0.537545
2,0.8977,0.875155,0.59134
3,0.8476,0.877187,0.595925
4,0.8077,0.81906,0.633011
5,0.7735,0.797942,0.649822
6,0.7448,0.786471,0.652165
7,0.7206,0.791321,0.658788
8,0.7009,0.781869,0.665104
9,0.6859,0.777331,0.663576
10,0.675,0.778486,0.665512


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.590229,4.500041e-05,1.0,0.934976,0.537545
1,0.8977,3.83303,4.000041e-05,2.0,0.875155,0.59134
2,0.8476,1.850501,3.500041e-05,3.0,0.877187,0.595925
3,0.8077,4.081359,3.000041e-05,4.0,0.81906,0.633011
4,0.7735,5.879056,2.500041e-05,5.0,0.797942,0.649822
5,0.7448,3.70153,2.000041e-05,6.0,0.786471,0.652165
6,0.7206,5.210847,1.500041e-05,7.0,0.791321,0.658788
7,0.7009,4.732634,1.000041e-05,8.0,0.781869,0.665104
8,0.6859,4.466926,5.000407e-06,9.0,0.777331,0.663576
9,0.675,7.591711,4.074316e-10,10.0,0.778486,0.665512


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=5e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.935188,0.536831
2,0.8977,0.875142,0.591645
3,0.8476,0.877401,0.595925
4,0.8077,0.818981,0.632705
5,0.7735,0.797968,0.650331
6,0.7448,0.786301,0.653591
7,0.7205,0.791311,0.658074
8,0.7008,0.781917,0.663882
9,0.6858,0.777081,0.66378
10,0.6749,0.7784,0.666021


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.569683,4.500041e-05,1.0,0.935188,0.536831
1,0.8977,3.828856,4.000041e-05,2.0,0.875142,0.591645
2,0.8476,1.858199,3.500041e-05,3.0,0.877401,0.595925
3,0.8077,4.084092,3.000041e-05,4.0,0.818981,0.632705
4,0.7735,5.897169,2.500041e-05,5.0,0.797968,0.650331
5,0.7448,3.684602,2.000041e-05,6.0,0.786301,0.653591
6,0.7205,5.232413,1.500041e-05,7.0,0.791311,0.658074
7,0.7008,4.745811,1.000041e-05,8.0,0.781917,0.663882
8,0.6858,4.461806,5.000407e-06,9.0,0.777081,0.66378
9,0.6749,7.617694,4.074316e-10,10.0,0.7784,0.666021


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.0089,0.934977,0.537239
2,0.8976,0.875153,0.591136
3,0.8476,0.876632,0.595415
4,0.8076,0.818827,0.632399
5,0.7735,0.798046,0.649516
6,0.7447,0.786338,0.653184
7,0.7205,0.791242,0.658482
8,0.7007,0.781688,0.664799
9,0.6857,0.777076,0.662761
10,0.6748,0.778526,0.665003


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.0089,2.577267,4.500041e-05,1.0,0.934977,0.537239
1,0.8976,3.841139,4.000041e-05,2.0,0.875153,0.591136
2,0.8476,1.852755,3.500041e-05,3.0,0.876632,0.595415
3,0.8076,4.07824,3.000041e-05,4.0,0.818827,0.632399
4,0.7735,5.917461,2.500041e-05,5.0,0.798046,0.649516
5,0.7447,3.676414,2.000041e-05,6.0,0.786338,0.653184
6,0.7205,5.275321,1.500041e-05,7.0,0.791242,0.658482
7,0.7007,4.770028,1.000041e-05,8.0,0.781688,0.664799
8,0.6857,4.495145,5.000407e-06,9.0,0.777076,0.662761
9,0.6748,7.630964,4.074316e-10,10.0,0.778526,0.665003


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9812,0.906343,0.564646
2,0.8665,0.845631,0.610902
3,0.8078,0.858952,0.606521
4,0.7593,0.793422,0.647071
5,0.7189,0.77618,0.659297
6,0.6843,0.776728,0.666633
7,0.6549,0.786673,0.666021
8,0.6294,0.801664,0.666225
9,0.6076,0.803186,0.664391
10,0.5905,0.815904,0.664799


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9812,3.546556,7.200065e-05,1.0,0.906343,0.564646
1,0.8665,3.198687,6.400065e-05,2.0,0.845631,0.610902
2,0.8078,2.002473,5.600065e-05,3.0,0.858952,0.606521
3,0.7593,4.242486,4.800065e-05,4.0,0.793422,0.647071
4,0.7189,3.970252,4.000065e-05,5.0,0.77618,0.659297
5,0.6843,3.422607,3.200065e-05,6.0,0.776728,0.666633
6,0.6549,4.519104,2.400065e-05,7.0,0.786673,0.666021
7,0.6294,3.727036,1.600065e-05,8.0,0.801664,0.666225
8,0.6076,3.481067,8.000652e-06,9.0,0.803186,0.664391
9,0.5905,5.653382,6.518905e-10,10.0,0.815904,0.664799


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9811,0.90583,0.564137
2,0.8665,0.845789,0.6108
3,0.8083,0.861281,0.604279
4,0.76,0.793795,0.647682
5,0.7197,0.776992,0.660418
6,0.6851,0.778101,0.66755
7,0.6558,0.786749,0.665614
8,0.6305,0.801278,0.666429
9,0.6087,0.802875,0.666021
10,0.5915,0.815959,0.664901


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9811,3.538836,7.200065e-05,1.0,0.90583,0.564137
1,0.8665,3.215607,6.400065e-05,2.0,0.845789,0.6108
2,0.8083,1.972062,5.600065e-05,3.0,0.861281,0.604279
3,0.76,4.279224,4.800065e-05,4.0,0.793795,0.647682
4,0.7197,3.860758,4.000065e-05,5.0,0.776992,0.660418
5,0.6851,3.421545,3.200065e-05,6.0,0.778101,0.66755
6,0.6558,4.354138,2.400065e-05,7.0,0.786749,0.665614
7,0.6305,3.672747,1.600065e-05,8.0,0.801278,0.666429
8,0.6087,3.522782,8.000652e-06,9.0,0.802875,0.666021
9,0.5915,5.453065,6.518905e-10,10.0,0.815959,0.664901


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9811,0.905966,0.564035
2,0.8664,0.845919,0.610087
3,0.8082,0.860771,0.605298
4,0.7598,0.793538,0.646969
5,0.7194,0.77657,0.659908
6,0.6848,0.778178,0.666531
7,0.6556,0.785973,0.666429
8,0.6302,0.800875,0.666735
9,0.6084,0.802548,0.665716
10,0.5912,0.815734,0.664391


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9811,3.526989,7.200065e-05,1.0,0.905966,0.564035
1,0.8664,3.221545,6.400065e-05,2.0,0.845919,0.610087
2,0.8082,1.94329,5.600065e-05,3.0,0.860771,0.605298
3,0.7598,4.262865,4.800065e-05,4.0,0.793538,0.646969
4,0.7194,3.906947,4.000065e-05,5.0,0.77657,0.659908
5,0.6848,3.462326,3.200065e-05,6.0,0.778178,0.666531
6,0.6556,4.370262,2.400065e-05,7.0,0.785973,0.666429
7,0.6302,3.711073,1.600065e-05,8.0,0.800875,0.666735
8,0.6084,3.548995,8.000652e-06,9.0,0.802548,0.665716
9,0.5912,5.478768,6.518905e-10,10.0,0.815734,0.664391


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-matched | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9811,0.906207,0.565257
2,0.8665,0.845961,0.610902
3,0.808,0.859638,0.605909
4,0.7596,0.793787,0.647784
5,0.7193,0.776758,0.658686
6,0.6847,0.7768,0.666531
7,0.6554,0.787606,0.664697
8,0.6299,0.802114,0.66541
9,0.608,0.8037,0.663678
10,0.5908,0.816591,0.664799


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9811,3.538939,7.200065e-05,1.0,0.906207,0.565257
1,0.8665,3.209416,6.400065e-05,2.0,0.845961,0.610902
2,0.808,2.037462,5.600065e-05,3.0,0.859638,0.605909
3,0.7596,4.279786,4.800065e-05,4.0,0.793787,0.647784
4,0.7193,3.921552,4.000065e-05,5.0,0.776758,0.658686
5,0.6847,3.397992,3.200065e-05,6.0,0.7768,0.666531
6,0.6554,4.569971,2.400065e-05,7.0,0.787606,0.664697
7,0.6299,3.677959,1.600065e-05,8.0,0.802114,0.66541
8,0.608,3.522162,8.000652e-06,9.0,0.8037,0.663678
9,0.5908,5.612611,6.518905e-10,10.0,0.816591,0.664799


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-matched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=5e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.0089,0.92693,0.53926
2,0.8976,0.876518,0.587571
3,0.8475,0.887806,0.583808
4,0.8076,0.822498,0.631408
5,0.7735,0.799199,0.650122
6,0.7447,0.788827,0.655411
7,0.7203,0.789704,0.659072
8,0.7006,0.782481,0.661819
9,0.6856,0.776104,0.662429
10,0.6747,0.778114,0.664361


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.0089,2.584716,4.500041e-05,1.0,0.92693,0.53926
1,0.8976,3.836071,4.000041e-05,2.0,0.876518,0.587571
2,0.8475,1.85236,3.500041e-05,3.0,0.887806,0.583808
3,0.8076,4.075291,3.000041e-05,4.0,0.822498,0.631408
4,0.7735,5.890446,2.500041e-05,5.0,0.799199,0.650122
5,0.7447,3.693268,2.000041e-05,6.0,0.788827,0.655411
6,0.7203,5.235649,1.500041e-05,7.0,0.789704,0.659072
7,0.7006,4.738777,1.000041e-05,8.0,0.782481,0.661819
8,0.6856,4.485253,5.000407e-06,9.0,0.776104,0.662429
9,0.6747,7.598418,4.074316e-10,10.0,0.778114,0.664361


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=5e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.927024,0.539666
2,0.8978,0.876757,0.587368
3,0.8476,0.88757,0.583503
4,0.8077,0.822488,0.630391
5,0.7736,0.799478,0.649614
6,0.745,0.789435,0.655207
7,0.7208,0.790333,0.658259
8,0.7011,0.783013,0.661412
9,0.6861,0.776657,0.662531
10,0.6752,0.778491,0.665887


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.56865,4.500041e-05,1.0,0.927024,0.539666
1,0.8978,3.832816,4.000041e-05,2.0,0.876757,0.587368
2,0.8476,1.856597,3.500041e-05,3.0,0.88757,0.583503
3,0.8077,4.059544,3.000041e-05,4.0,0.822488,0.630391
4,0.7736,5.879162,2.500041e-05,5.0,0.799478,0.649614
5,0.745,3.694394,2.000041e-05,6.0,0.789435,0.655207
6,0.7208,5.228707,1.500041e-05,7.0,0.790333,0.658259
7,0.7011,4.757323,1.000041e-05,8.0,0.783013,0.661412
8,0.6861,4.432555,5.000407e-06,9.0,0.776657,0.662531
9,0.6752,7.704016,4.074316e-10,10.0,0.778491,0.665887


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=5e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.926891,0.539361
2,0.8977,0.876734,0.587673
3,0.8475,0.887642,0.584113
4,0.8076,0.822823,0.630085
5,0.7735,0.799506,0.649105
6,0.7448,0.789351,0.655411
7,0.7205,0.790097,0.657547
8,0.7008,0.782905,0.661717
9,0.6858,0.776622,0.661717
10,0.6749,0.77847,0.665378


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.564945,4.500041e-05,1.0,0.926891,0.539361
1,0.8977,3.838824,4.000041e-05,2.0,0.876734,0.587673
2,0.8475,1.855172,3.500041e-05,3.0,0.887642,0.584113
3,0.8076,4.063422,3.000041e-05,4.0,0.822823,0.630085
4,0.7735,5.904194,2.500041e-05,5.0,0.799506,0.649105
5,0.7448,3.699251,2.000041e-05,6.0,0.789351,0.655411
6,0.7205,5.225732,1.500041e-05,7.0,0.790097,0.657547
7,0.7008,4.74455,1.000041e-05,8.0,0.782905,0.661717
8,0.6858,4.463809,5.000407e-06,9.0,0.776622,0.661717
9,0.6749,7.661366,4.074316e-10,10.0,0.77847,0.665378


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,1.009,0.926992,0.538954
2,0.8977,0.876633,0.587571
3,0.8476,0.887644,0.584215
4,0.8077,0.822408,0.631306
5,0.7736,0.799545,0.649105
6,0.745,0.78957,0.655207
7,0.7209,0.790285,0.657954
8,0.7012,0.783158,0.662022
9,0.6862,0.776738,0.662124
10,0.6753,0.778697,0.66548


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,1.009,2.577824,4.500041e-05,1.0,0.926992,0.538954
1,0.8977,3.822299,4.000041e-05,2.0,0.876633,0.587571
2,0.8476,1.871962,3.500041e-05,3.0,0.887644,0.584215
3,0.8077,4.070261,3.000041e-05,4.0,0.822408,0.631306
4,0.7736,5.895524,2.500041e-05,5.0,0.799545,0.649105
5,0.745,3.690662,2.000041e-05,6.0,0.78957,0.655207
6,0.7209,5.206637,1.500041e-05,7.0,0.790285,0.657954
7,0.7012,4.733611,1.000041e-05,8.0,0.783158,0.662022
8,0.6862,4.455688,5.000407e-06,9.0,0.776738,0.662124
9,0.6753,7.649499,4.074316e-10,10.0,0.778697,0.66548


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9812,0.898294,0.569264
2,0.8667,0.84709,0.608218
3,0.8083,0.871084,0.600386
4,0.7598,0.79695,0.648495
5,0.7193,0.774195,0.659581
6,0.6847,0.77691,0.663039
7,0.6553,0.781376,0.662327
8,0.6299,0.790818,0.661107
9,0.608,0.794919,0.657445
10,0.5909,0.805664,0.659581


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9812,3.535302,7.200065e-05,1.0,0.898294,0.569264
1,0.8667,3.191069,6.400065e-05,2.0,0.84709,0.608218
2,0.8083,2.019267,5.600065e-05,3.0,0.871084,0.600386
3,0.7598,4.276262,4.800065e-05,4.0,0.79695,0.648495
4,0.7193,3.914962,4.000065e-05,5.0,0.774195,0.659581
5,0.6847,3.42803,3.200065e-05,6.0,0.77691,0.663039
6,0.6553,4.275761,2.400065e-05,7.0,0.781376,0.662327
7,0.6299,3.783288,1.600065e-05,8.0,0.790818,0.661107
8,0.608,3.483202,8.000652e-06,9.0,0.794919,0.657445
9,0.5909,5.517904,6.518905e-10,10.0,0.805664,0.659581


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9812,0.898162,0.56906
2,0.8665,0.846882,0.609947
3,0.8078,0.871985,0.599471
4,0.7594,0.796542,0.647884
5,0.719,0.774343,0.661717
6,0.6845,0.777051,0.663954
7,0.6552,0.781168,0.664463
8,0.6297,0.791132,0.662632
9,0.6079,0.795742,0.6607
10,0.5907,0.805215,0.662632


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9812,3.527708,7.200065e-05,1.0,0.898162,0.56906
1,0.8665,3.232081,6.400065e-05,2.0,0.846882,0.609947
2,0.8078,2.020553,5.600065e-05,3.0,0.871985,0.599471
3,0.7594,4.233264,4.800065e-05,4.0,0.796542,0.647884
4,0.719,3.975698,4.000065e-05,5.0,0.774343,0.661717
5,0.6845,3.44208,3.200065e-05,6.0,0.777051,0.663954
6,0.6552,4.591588,2.400065e-05,7.0,0.781168,0.664463
7,0.6297,3.672906,1.600065e-05,8.0,0.791132,0.662632
8,0.6079,3.510344,8.000652e-06,9.0,0.795742,0.6607
9,0.5907,5.692046,6.518905e-10,10.0,0.805215,0.662632


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9812,0.898389,0.568857
2,0.8665,0.84654,0.608218
3,0.8076,0.871909,0.599369
4,0.7592,0.796539,0.647783
5,0.7189,0.773872,0.660598
6,0.6843,0.77626,0.664361
7,0.6549,0.781261,0.66487
8,0.6294,0.790519,0.660496
9,0.6076,0.795198,0.661513
10,0.5904,0.804615,0.661513


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9812,3.524212,7.200065e-05,1.0,0.898389,0.568857
1,0.8665,3.206391,6.400065e-05,2.0,0.84654,0.608218
2,0.8076,2.056675,5.600065e-05,3.0,0.871909,0.599369
3,0.7592,4.191925,4.800065e-05,4.0,0.796539,0.647783
4,0.7189,4.060833,4.000065e-05,5.0,0.773872,0.660598
5,0.6843,3.430127,3.200065e-05,6.0,0.77626,0.664361
6,0.6549,4.556543,2.400065e-05,7.0,0.781261,0.66487
7,0.6294,3.708026,1.600065e-05,8.0,0.790519,0.660496
8,0.6076,3.509227,8.000652e-06,9.0,0.795198,0.661513
9,0.5904,5.745219,6.518905e-10,10.0,0.804615,0.661513


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning mnli-mismatched | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.9811,0.898107,0.568959
2,0.8664,0.846593,0.610456
3,0.8079,0.872233,0.599064
4,0.7595,0.796586,0.647783
5,0.7192,0.774721,0.66131
6,0.6846,0.777834,0.662836
7,0.6553,0.781031,0.66548
8,0.6298,0.791101,0.661615
9,0.608,0.795726,0.659378
10,0.5909,0.80534,0.66192


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.9811,3.536294,7.200065e-05,1.0,0.898107,0.568959
1,0.8664,3.217821,6.400065e-05,2.0,0.846593,0.610456
2,0.8079,1.984881,5.600065e-05,3.0,0.872233,0.599064
3,0.7595,4.219394,4.800065e-05,4.0,0.796586,0.647783
4,0.7192,3.954699,4.000065e-05,5.0,0.774721,0.66131
5,0.6846,3.460091,3.200065e-05,6.0,0.777834,0.662836
6,0.6553,4.329827,2.400065e-05,7.0,0.781031,0.66548
7,0.6298,3.666073,1.600065e-05,8.0,0.791101,0.661615
8,0.608,3.522502,8.000652e-06,9.0,0.795726,0.659378
9,0.5909,5.617026,6.518905e-10,10.0,0.80534,0.66192


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_mnli-mismatched_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=5e-05 | wd=1e-06 -----


Map: 100%|██████████| 104743/104743 [00:08<00:00, 12570.24 examples/s]
Map: 100%|██████████| 5463/5463 [00:00<00:00, 13716.14 examples/s]
Map: 100%|██████████| 5463/5463 [00:00<00:00, 13660.44 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a d

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6766,0.665938,0.590152
2,0.6601,0.65313,0.605345
3,0.6459,0.650207,0.616694
4,0.6316,0.661561,0.610104
5,0.6163,0.680468,0.592532
6,0.6029,0.66567,0.607359
7,0.5917,0.669339,0.604979
8,0.5822,0.67926,0.597291
9,0.5745,0.686738,0.598938
10,0.5686,0.694734,0.598755


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6766,4.169216,4.500153e-05,1.0,0.665938,0.590152
1,0.6601,5.564226,4.000153e-05,2.0,0.65313,0.605345
2,0.6459,6.162128,3.500153e-05,3.0,0.650207,0.616694
3,0.6316,2.949098,3.000153e-05,4.0,0.661561,0.610104
4,0.6163,5.523984,2.500153e-05,5.0,0.680468,0.592532
5,0.6029,2.222785,2.000153e-05,6.0,0.66567,0.607359
6,0.5917,4.548532,1.500153e-05,7.0,0.669339,0.604979
7,0.5822,11.992335,1.000153e-05,8.0,0.67926,0.597291
8,0.5745,4.744533,5.001527e-06,9.0,0.686738,0.598938
9,0.5686,11.097974,1.527184e-09,10.0,0.694734,0.598755


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=5e-05 | wd=5e-06 -----


Map: 100%|██████████| 5463/5463 [00:00<00:00, 13130.87 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model co

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6766,0.666048,0.58942
2,0.6601,0.653148,0.604979
3,0.6459,0.650231,0.616145
4,0.6315,0.661289,0.610104
5,0.6163,0.680866,0.591982
6,0.6029,0.665752,0.609006
7,0.5917,0.669376,0.605162
8,0.5822,0.679213,0.597474
9,0.5746,0.686758,0.598206
10,0.5686,0.694822,0.598389


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6766,4.161219,4.500153e-05,1.0,0.666048,0.58942
1,0.6601,5.514252,4.000153e-05,2.0,0.653148,0.604979
2,0.6459,6.150647,3.500153e-05,3.0,0.650231,0.616145
3,0.6315,2.960356,3.000153e-05,4.0,0.661289,0.610104
4,0.6163,5.526634,2.500153e-05,5.0,0.680866,0.591982
5,0.6029,2.216413,2.000153e-05,6.0,0.665752,0.609006
6,0.5917,4.575377,1.500153e-05,7.0,0.669376,0.605162
7,0.5822,11.993254,1.000153e-05,8.0,0.679213,0.597474
8,0.5746,4.740321,5.001527e-06,9.0,0.686758,0.598206
9,0.5686,11.124548,1.527184e-09,10.0,0.694822,0.598389


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=5e-05 | wd=8e-06 -----


Map: 100%|██████████| 5463/5463 [00:00<00:00, 14268.82 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model co

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6766,0.665839,0.589786
2,0.6601,0.653165,0.605162
3,0.6459,0.650079,0.615962
4,0.6315,0.661289,0.609921
5,0.6163,0.680639,0.591982
6,0.6029,0.665694,0.608091
7,0.5917,0.669351,0.605345
8,0.5822,0.679049,0.597657
9,0.5746,0.686564,0.598389
10,0.5686,0.694764,0.598755


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6766,4.176658,4.500153e-05,1.0,0.665839,0.589786
1,0.6601,5.549973,4.000153e-05,2.0,0.653165,0.605162
2,0.6459,6.164438,3.500153e-05,3.0,0.650079,0.615962
3,0.6315,2.961268,3.000153e-05,4.0,0.661289,0.609921
4,0.6163,5.526524,2.500153e-05,5.0,0.680639,0.591982
5,0.6029,2.216169,2.000153e-05,6.0,0.665694,0.608091
6,0.5917,4.531575,1.500153e-05,7.0,0.669351,0.605345
7,0.5822,11.992459,1.000153e-05,8.0,0.679049,0.597657
8,0.5746,4.742888,5.001527e-06,9.0,0.686564,0.598389
9,0.5686,11.047654,1.527184e-09,10.0,0.694764,0.598755


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6766,0.665975,0.589054
2,0.6601,0.653117,0.605528
3,0.6459,0.650024,0.616877
4,0.6316,0.661687,0.610287
5,0.6163,0.680369,0.592532
6,0.6029,0.665703,0.609372
7,0.5917,0.669395,0.605162
8,0.5822,0.679188,0.597474
9,0.5746,0.686646,0.598206
10,0.5686,0.694795,0.598206


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6766,4.143357,4.500153e-05,1.0,0.665975,0.589054
1,0.6601,5.570116,4.000153e-05,2.0,0.653117,0.605528
2,0.6459,6.210609,3.500153e-05,3.0,0.650024,0.616877
3,0.6316,2.931389,3.000153e-05,4.0,0.661687,0.610287
4,0.6163,5.530557,2.500153e-05,5.0,0.680369,0.592532
5,0.6029,2.214881,2.000153e-05,6.0,0.665703,0.609372
6,0.5917,4.522542,1.500153e-05,7.0,0.669395,0.605162
7,0.5822,11.975261,1.000153e-05,8.0,0.679188,0.597474
8,0.5746,4.760415,5.001527e-06,9.0,0.686646,0.598206
9,0.5686,11.109859,1.527184e-09,10.0,0.694795,0.598206


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6755,0.663102,0.590518
2,0.6544,0.646174,0.615779
3,0.6347,0.646472,0.625114
4,0.6128,0.660493,0.617792
5,0.5896,0.679491,0.596742
6,0.5685,0.667924,0.625847
7,0.5511,0.659652,0.621087
8,0.5361,0.675028,0.617243
9,0.524,0.691598,0.617243
10,0.5149,0.701193,0.617426


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6755,2.436867,7.200244e-05,1.0,0.663102,0.590518
1,0.6544,4.142736,6.400244e-05,2.0,0.646174,0.615779
2,0.6347,4.681017,5.600244e-05,3.0,0.646472,0.625114
3,0.6128,2.260216,4.800244e-05,4.0,0.660493,0.617792
4,0.5896,6.055718,4.000244e-05,5.0,0.679491,0.596742
5,0.5685,4.990744,3.200244e-05,6.0,0.667924,0.625847
6,0.5511,3.062626,2.400244e-05,7.0,0.659652,0.621087
7,0.5361,10.58942,1.600244e-05,8.0,0.675028,0.617243
8,0.524,2.737286,8.002443e-06,9.0,0.691598,0.617243
9,0.5149,7.533902,2.443494e-09,10.0,0.701193,0.617426


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6755,0.662997,0.590335
2,0.6543,0.646191,0.616145
3,0.6347,0.646261,0.624565
4,0.6127,0.660386,0.616877
5,0.5896,0.679448,0.596742
6,0.5684,0.667749,0.626213
7,0.551,0.659496,0.620904
8,0.5359,0.674646,0.617792
9,0.5239,0.691548,0.618159
10,0.5148,0.700962,0.615413


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6755,2.498394,7.200244e-05,1.0,0.662997,0.590335
1,0.6543,4.155247,6.400244e-05,2.0,0.646191,0.616145
2,0.6347,4.737343,5.600244e-05,3.0,0.646261,0.624565
3,0.6127,2.28723,4.800244e-05,4.0,0.660386,0.616877
4,0.5896,6.075803,4.000244e-05,5.0,0.679448,0.596742
5,0.5684,5.022288,3.200244e-05,6.0,0.667749,0.626213
6,0.551,3.088498,2.400244e-05,7.0,0.659496,0.620904
7,0.5359,10.587175,1.600244e-05,8.0,0.674646,0.617792
8,0.5239,2.735348,8.002443e-06,9.0,0.691548,0.618159
9,0.5148,7.478501,2.443494e-09,10.0,0.700962,0.615413


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6755,0.662961,0.590884
2,0.6543,0.646172,0.615962
3,0.6347,0.646359,0.624382
4,0.6127,0.660081,0.617243
5,0.5897,0.679801,0.596925
6,0.5685,0.667986,0.626579
7,0.551,0.659524,0.620355
8,0.536,0.674679,0.618342
9,0.524,0.691564,0.617426
10,0.5149,0.701162,0.616694


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6755,2.469314,7.200244e-05,1.0,0.662961,0.590884
1,0.6543,4.125077,6.400244e-05,2.0,0.646172,0.615962
2,0.6347,4.718029,5.600244e-05,3.0,0.646359,0.624382
3,0.6127,2.263319,4.800244e-05,4.0,0.660081,0.617243
4,0.5897,6.035298,4.000244e-05,5.0,0.679801,0.596925
5,0.5685,5.058905,3.200244e-05,6.0,0.667986,0.626579
6,0.551,3.064452,2.400244e-05,7.0,0.659524,0.620355
7,0.536,10.546488,1.600244e-05,8.0,0.674679,0.618342
8,0.524,2.745219,8.002443e-06,9.0,0.691564,0.617426
9,0.5149,7.5273,2.443494e-09,10.0,0.701162,0.616694


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning qnli | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.6755,0.663032,0.58942
2,0.6544,0.646177,0.615962
3,0.6347,0.646243,0.624748
4,0.6128,0.660379,0.61706
5,0.5897,0.680014,0.596925
6,0.5686,0.668329,0.624748
7,0.5512,0.659564,0.619989
8,0.5362,0.675583,0.617426
9,0.5242,0.691667,0.616145
10,0.515,0.701173,0.617975


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.6755,2.447531,7.200244e-05,1.0,0.663032,0.58942
1,0.6544,4.103011,6.400244e-05,2.0,0.646177,0.615962
2,0.6347,4.718501,5.600244e-05,3.0,0.646243,0.624748
3,0.6128,2.250936,4.800244e-05,4.0,0.660379,0.61706
4,0.5897,6.062785,4.000244e-05,5.0,0.680014,0.596925
5,0.5686,4.968988,3.200244e-05,6.0,0.668329,0.624748
6,0.5512,3.053701,2.400244e-05,7.0,0.659564,0.619989
7,0.5362,10.594063,1.600244e-05,8.0,0.675583,0.617426
8,0.5242,2.73493,8.002443e-06,9.0,0.691667,0.616145
9,0.515,7.551405,2.443494e-09,10.0,0.701173,0.617975


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_qnli_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=5e-05 | wd=1e-06 -----


Map: 100%|██████████| 2490/2490 [00:00<00:00, 10634.83 examples/s]
Map: 100%|██████████| 277/277 [00:00<00:00, 9665.42 examples/s]
Map: 100%|██████████| 3000/3000 [00:00<00:00, 11466.93 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-str

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7644,0.700365,0.570397
2,0.6917,0.69349,0.527076
3,0.658,0.683079,0.566787
4,0.6336,0.690317,0.555957
5,0.6005,0.70297,0.541516
6,0.5748,0.722307,0.534296
7,0.5352,0.740572,0.527076
8,0.5073,0.759274,0.534296
9,0.4812,0.766976,0.501805
10,0.4668,0.772057,0.523466


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7644,26.885214,4.50641e-05,1.0,0.700365,0.570397
1,0.6917,12.119349,4.00641e-05,2.0,0.69349,0.527076
2,0.658,33.428204,3.50641e-05,3.0,0.683079,0.566787
3,0.6336,21.194365,3.00641e-05,4.0,0.690317,0.555957
4,0.6005,20.462158,2.50641e-05,5.0,0.70297,0.541516
5,0.5748,18.529848,2.00641e-05,6.0,0.722307,0.534296
6,0.5352,18.906212,1.50641e-05,7.0,0.740572,0.527076
7,0.5073,29.870289,1.00641e-05,8.0,0.759274,0.534296
8,0.4812,16.95437,5.064103e-06,9.0,0.766976,0.501805
9,0.4668,20.501951,6.410256e-08,10.0,0.772057,0.523466


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=5e-05 | wd=5e-06 -----


Map: 100%|██████████| 277/277 [00:00<00:00, 9799.94 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model confi

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7643,0.700424,0.574007
2,0.6917,0.693204,0.523466
3,0.658,0.683351,0.563177
4,0.6335,0.689753,0.548736
5,0.6005,0.703677,0.537906
6,0.5747,0.722615,0.537906
7,0.5351,0.740576,0.527076
8,0.5072,0.758788,0.530686
9,0.481,0.767085,0.505415
10,0.4667,0.77186,0.527076


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7643,26.9991,4.50641e-05,1.0,0.700424,0.574007
1,0.6917,12.12096,4.00641e-05,2.0,0.693204,0.523466
2,0.658,33.432587,3.50641e-05,3.0,0.683351,0.563177
3,0.6335,21.320621,3.00641e-05,4.0,0.689753,0.548736
4,0.6005,20.63851,2.50641e-05,5.0,0.703677,0.537906
5,0.5747,18.821421,2.00641e-05,6.0,0.722615,0.537906
6,0.5351,18.882942,1.50641e-05,7.0,0.740576,0.527076
7,0.5072,29.785284,1.00641e-05,8.0,0.758788,0.530686
8,0.481,16.960321,5.064103e-06,9.0,0.767085,0.505415
9,0.4667,20.466106,6.410256e-08,10.0,0.77186,0.527076


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=5e-05 | wd=8e-06 -----


Map: 100%|██████████| 3000/3000 [00:00<00:00, 11184.88 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model co

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7643,0.700511,0.570397
2,0.6916,0.693426,0.527076
3,0.658,0.683146,0.566787
4,0.6336,0.689978,0.545126
5,0.6005,0.703379,0.541516
6,0.5745,0.723202,0.534296
7,0.5351,0.740728,0.530686
8,0.5072,0.759584,0.523466
9,0.4811,0.767062,0.505415
10,0.4667,0.772027,0.523466


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7643,27.043472,4.50641e-05,1.0,0.700511,0.570397
1,0.6916,12.131401,4.00641e-05,2.0,0.693426,0.527076
2,0.658,33.318375,3.50641e-05,3.0,0.683146,0.566787
3,0.6336,21.168537,3.00641e-05,4.0,0.689978,0.545126
4,0.6005,20.638136,2.50641e-05,5.0,0.703379,0.541516
5,0.5745,18.932919,2.00641e-05,6.0,0.723202,0.534296
6,0.5351,18.862265,1.50641e-05,7.0,0.740728,0.530686
7,0.5072,29.823963,1.00641e-05,8.0,0.759584,0.523466
8,0.4811,16.942871,5.064103e-06,9.0,0.767062,0.505415
9,0.4667,20.500526,6.410256e-08,10.0,0.772027,0.523466


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7644,0.699852,0.574007
2,0.6916,0.693405,0.523466
3,0.658,0.683104,0.566787
4,0.6335,0.689964,0.548736
5,0.6004,0.703527,0.541516
6,0.5746,0.723081,0.534296
7,0.535,0.740705,0.527076
8,0.5072,0.759146,0.527076
9,0.481,0.767122,0.505415
10,0.4667,0.772132,0.523466


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7644,27.028467,4.50641e-05,1.0,0.699852,0.574007
1,0.6916,12.121691,4.00641e-05,2.0,0.693405,0.523466
2,0.658,33.478798,3.50641e-05,3.0,0.683104,0.566787
3,0.6335,21.174776,3.00641e-05,4.0,0.689964,0.548736
4,0.6004,20.580406,2.50641e-05,5.0,0.703527,0.541516
5,0.5746,19.06591,2.00641e-05,6.0,0.723081,0.534296
6,0.535,18.949591,1.50641e-05,7.0,0.740705,0.527076
7,0.5072,29.90778,1.00641e-05,8.0,0.759146,0.527076
8,0.481,16.951477,5.064103e-06,9.0,0.767122,0.505415
9,0.4667,20.513948,6.410256e-08,10.0,0.772132,0.523466


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7496,0.692422,0.563177
2,0.6816,0.715882,0.527076
3,0.6429,0.693878,0.545126
4,0.5922,0.727257,0.534296
5,0.5132,0.806019,0.512635
6,0.4458,0.893839,0.509025
7,0.3741,1.00128,0.494585
8,0.3239,1.06606,0.519856
9,0.293,1.071061,0.498195
10,0.2647,1.089689,0.501805


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7496,19.068108,7.210256e-05,1.0,0.692422,0.563177
1,0.6816,9.640022,6.410256e-05,2.0,0.715882,0.527076
2,0.6429,27.514086,5.610256e-05,3.0,0.693878,0.545126
3,0.5922,19.665632,4.810256e-05,4.0,0.727257,0.534296
4,0.5132,14.659527,4.010256e-05,5.0,0.806019,0.512635
5,0.4458,19.10935,3.210256e-05,6.0,0.893839,0.509025
6,0.3741,46.404461,2.410256e-05,7.0,1.00128,0.494585
7,0.3239,42.112499,1.610256e-05,8.0,1.06606,0.519856
8,0.293,14.147307,8.102564e-06,9.0,1.071061,0.498195
9,0.2647,18.373932,1.025641e-07,10.0,1.089689,0.501805


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7497,0.692065,0.559567
2,0.6817,0.716212,0.527076
3,0.643,0.693768,0.545126
4,0.5922,0.727068,0.530686
5,0.5133,0.8058,0.516245
6,0.4456,0.893939,0.509025
7,0.3739,1.002034,0.494585
8,0.3237,1.067384,0.516245
9,0.2931,1.071728,0.498195
10,0.2648,1.090465,0.498195


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7497,19.12307,7.210256e-05,1.0,0.692065,0.559567
1,0.6817,9.642652,6.410256e-05,2.0,0.716212,0.527076
2,0.643,27.553505,5.610256e-05,3.0,0.693768,0.545126
3,0.5922,19.589676,4.810256e-05,4.0,0.727068,0.530686
4,0.5133,14.159642,4.010256e-05,5.0,0.8058,0.516245
5,0.4456,19.151899,3.210256e-05,6.0,0.893939,0.509025
6,0.3739,46.528721,2.410256e-05,7.0,1.002034,0.494585
7,0.3237,42.550983,1.610256e-05,8.0,1.067384,0.516245
8,0.2931,14.019723,8.102564e-06,9.0,1.071728,0.498195
9,0.2648,18.407572,1.025641e-07,10.0,1.090465,0.498195


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7499,0.692584,0.570397
2,0.6815,0.716159,0.527076
3,0.643,0.693825,0.548736
4,0.5923,0.726893,0.534296
5,0.5134,0.805326,0.519856
6,0.4459,0.892883,0.512635
7,0.3743,1.000291,0.494585
8,0.324,1.065859,0.516245
9,0.2931,1.070727,0.501805
10,0.2648,1.090263,0.494585


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7499,19.096334,7.210256e-05,1.0,0.692584,0.570397
1,0.6815,9.655931,6.410256e-05,2.0,0.716159,0.527076
2,0.643,27.532532,5.610256e-05,3.0,0.693825,0.548736
3,0.5923,19.582499,4.810256e-05,4.0,0.726893,0.534296
4,0.5134,14.56604,4.010256e-05,5.0,0.805326,0.519856
5,0.4459,19.046404,3.210256e-05,6.0,0.892883,0.512635
6,0.3743,45.274406,2.410256e-05,7.0,1.000291,0.494585
7,0.324,42.754986,1.610256e-05,8.0,1.065859,0.516245
8,0.2931,14.046024,8.102564e-06,9.0,1.070727,0.501805
9,0.2648,18.366131,1.025641e-07,10.0,1.090263,0.494585


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning rte | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Accuracy Score
1,0.7498,0.691836,0.559567
2,0.6817,0.715767,0.527076
3,0.643,0.693733,0.545126
4,0.5922,0.727088,0.534296
5,0.5132,0.805836,0.519856
6,0.446,0.892868,0.509025
7,0.3744,1.00151,0.494585
8,0.3242,1.066062,0.519856
9,0.2933,1.070396,0.498195
10,0.2649,1.090363,0.501805


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_accuracy_score
0,0.7498,19.173941,7.210256e-05,1.0,0.691836,0.559567
1,0.6817,9.617178,6.410256e-05,2.0,0.715767,0.527076
2,0.643,27.365223,5.610256e-05,3.0,0.693733,0.545126
3,0.5922,19.501883,4.810256e-05,4.0,0.727088,0.534296
4,0.5132,14.722438,4.010256e-05,5.0,0.805836,0.519856
5,0.446,19.065058,3.210256e-05,6.0,0.892868,0.509025
6,0.3744,45.435341,2.410256e-05,7.0,1.00151,0.494585
7,0.3242,42.566277,1.610256e-05,8.0,1.066062,0.519856
8,0.2933,14.105376,8.102564e-06,9.0,1.070396,0.498195
9,0.2649,18.517746,1.025641e-07,10.0,1.090363,0.501805


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_rte_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


In [22]:
checkpoint = "output_model/modernbert-kan-1.5b"  # "answerdotai/ModernBERT-base", "answerdotai/ModernBERT-large"
learning_rates = [5e-5, 8e-5]
weight_decays = [1e-6, 5e-6, 8e-6, 1e-5]

hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
for task in glue_tasks.keys():
    for lr in learning_rates:
        for wd in weight_decays:
            print(f"----- Finetuning {task} | lr={lr} | wd={wd} -----")
            train_res_df, args_df, hf_model, hf_tokenizer = finetune_glue_task(
                lr, wd, task, checkpoint=checkpoint, train_subset=None, do_cleanup=True
            )

            print(":: Results ::")
            display(train_res_df)
            display(args_df)

----- Finetuning stsb | lr=5e-05 | wd=1e-06 -----


Map: 100%|██████████| 1500/1500 [00:00<00:00, 20970.05 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model co

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.5194,2.66522,0.119624,0.133326,0.571333
2,2.0372,2.620943,0.121742,0.119475,0.561333
3,1.8966,2.611137,0.077178,0.080723,0.552667
4,1.8026,2.562148,0.144315,0.137203,0.578
5,1.7013,2.536248,0.154153,0.144463,0.580667
6,1.5869,2.630199,0.130979,0.13234,0.581333
7,1.5137,2.650815,0.134476,0.132864,0.578667
8,1.4483,2.705155,0.152244,0.151331,0.58
9,1.3972,2.611263,0.147366,0.146839,0.594667
10,1.3587,2.650681,0.14567,0.145455,0.588


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.5194,132.235825,4.502778e-05,1.0,2.66522,0.119624,0.133326,0.571333
1,2.0372,110.535271,4.002778e-05,2.0,2.620943,0.121742,0.119475,0.561333
2,1.8966,91.087616,3.502778e-05,3.0,2.611137,0.077178,0.080723,0.552667
3,1.8026,72.223007,3.002778e-05,4.0,2.562148,0.144315,0.137203,0.578
4,1.7013,55.210609,2.502778e-05,5.0,2.536248,0.154153,0.144463,0.580667
5,1.5869,86.483932,2.002778e-05,6.0,2.630199,0.130979,0.13234,0.581333
6,1.5137,55.07832,1.502778e-05,7.0,2.650815,0.134476,0.132864,0.578667
7,1.4483,68.377365,1.002778e-05,8.0,2.705155,0.152244,0.151331,0.58
8,1.3972,74.766937,5.027778e-06,9.0,2.611263,0.147366,0.146839,0.594667
9,1.3587,55.79776,2.777778e-08,10.0,2.650681,0.14567,0.145455,0.588


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=5e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.5199,2.664297,0.119637,0.133457,0.572
2,2.0373,2.621564,0.121521,0.119443,0.562667
3,1.8966,2.609884,0.077236,0.080826,0.550667
4,1.8024,2.565258,0.143933,0.136719,0.573333
5,1.7016,2.539867,0.153564,0.143782,0.581333
6,1.5866,2.63079,0.1312,0.132713,0.582
7,1.5137,2.651563,0.13413,0.132416,0.578667
8,1.4484,2.704077,0.152551,0.151644,0.579333
9,1.3972,2.612683,0.146924,0.146365,0.596
10,1.3592,2.652444,0.14544,0.145351,0.587333


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.5199,132.231476,4.502778e-05,1.0,2.664297,0.119637,0.133457,0.572
1,2.0373,110.738533,4.002778e-05,2.0,2.621564,0.121521,0.119443,0.562667
2,1.8966,91.124069,3.502778e-05,3.0,2.609884,0.077236,0.080826,0.550667
3,1.8024,72.141853,3.002778e-05,4.0,2.565258,0.143933,0.136719,0.573333
4,1.7016,55.173386,2.502778e-05,5.0,2.539867,0.153564,0.143782,0.581333
5,1.5866,85.76403,2.002778e-05,6.0,2.63079,0.1312,0.132713,0.582
6,1.5137,54.876011,1.502778e-05,7.0,2.651563,0.13413,0.132416,0.578667
7,1.4484,68.229424,1.002778e-05,8.0,2.704077,0.152551,0.151644,0.579333
8,1.3972,74.803551,5.027778e-06,9.0,2.612683,0.146924,0.146365,0.596
9,1.3592,55.96146,2.777778e-08,10.0,2.652444,0.14544,0.145351,0.587333


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=5e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.5193,2.66629,0.119741,0.133308,0.572
2,2.0371,2.618818,0.121986,0.119674,0.564
3,1.8963,2.609417,0.077275,0.080908,0.550667
4,1.8025,2.562283,0.143942,0.136862,0.574667
5,1.7015,2.540158,0.15368,0.143638,0.579333
6,1.5868,2.630633,0.130807,0.132446,0.578667
7,1.5137,2.652283,0.13402,0.132265,0.578667
8,1.448,2.706071,0.152453,0.151582,0.579333
9,1.3967,2.611745,0.14696,0.146378,0.595333
10,1.3591,2.652384,0.145377,0.144998,0.588


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.5193,132.365219,4.502778e-05,1.0,2.66629,0.119741,0.133308,0.572
1,2.0371,110.191948,4.002778e-05,2.0,2.618818,0.121986,0.119674,0.564
2,1.8963,91.049316,3.502778e-05,3.0,2.609417,0.077275,0.080908,0.550667
3,1.8025,72.072731,3.002778e-05,4.0,2.562283,0.143942,0.136862,0.574667
4,1.7015,54.782871,2.502778e-05,5.0,2.540158,0.15368,0.143638,0.579333
5,1.5868,86.314751,2.002778e-05,6.0,2.630633,0.130807,0.132446,0.578667
6,1.5137,54.980927,1.502778e-05,7.0,2.652283,0.13402,0.132265,0.578667
7,1.448,68.507263,1.002778e-05,8.0,2.706071,0.152453,0.151582,0.579333
8,1.3967,74.718155,5.027778e-06,9.0,2.611745,0.14696,0.146378,0.595333
9,1.3591,55.867165,2.777778e-08,10.0,2.652384,0.145377,0.144998,0.588


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=5e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.5198,2.666361,0.119926,0.134007,0.572
2,2.0372,2.621025,0.121753,0.119745,0.562667
3,1.8961,2.609121,0.077936,0.081606,0.552667
4,1.8027,2.564941,0.144225,0.137047,0.575333
5,1.7015,2.538934,0.153877,0.14393,0.582
6,1.5865,2.627382,0.131419,0.132987,0.581333
7,1.5139,2.651285,0.134154,0.132294,0.579333
8,1.4481,2.704526,0.152581,0.151646,0.578667
9,1.3975,2.610531,0.147115,0.146426,0.594667
10,1.3589,2.652246,0.145513,0.145305,0.587333


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.5198,132.307938,4.502778e-05,1.0,2.666361,0.119926,0.134007,0.572
1,2.0372,110.468018,4.002778e-05,2.0,2.621025,0.121753,0.119745,0.562667
2,1.8961,91.267456,3.502778e-05,3.0,2.609121,0.077936,0.081606,0.552667
3,1.8027,72.085251,3.002778e-05,4.0,2.564941,0.144225,0.137047,0.575333
4,1.7015,55.432133,2.502778e-05,5.0,2.538934,0.153877,0.14393,0.582
5,1.5865,86.575432,2.002778e-05,6.0,2.627382,0.131419,0.132987,0.581333
6,1.5139,54.981133,1.502778e-05,7.0,2.651285,0.134154,0.132294,0.579333
7,1.4481,68.408943,1.002778e-05,8.0,2.704526,0.152581,0.151646,0.578667
8,1.3975,74.79673,5.027778e-06,9.0,2.610531,0.147115,0.146426,0.594667
9,1.3589,56.004341,2.777778e-08,10.0,2.652246,0.145513,0.145305,0.587333


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=8e-05 | wd=1e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.4442,2.597105,0.131317,0.134895,0.580667
2,1.9676,2.598578,0.138797,0.129879,0.552667
3,1.8065,2.820398,0.094564,0.101574,0.540667
4,1.6742,2.430531,0.170869,0.164241,0.603333
5,1.5313,2.741854,0.167054,0.162614,0.559333
6,1.3738,2.572315,0.161559,0.166701,0.596667
7,1.2648,2.842278,0.156287,0.158705,0.570667
8,1.181,2.837043,0.168898,0.175631,0.586
9,1.1169,2.71425,0.168111,0.173921,0.594
10,1.0738,2.737685,0.164007,0.171227,0.594667


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.4442,106.570923,7.204444e-05,1.0,2.597105,0.131317,0.134895,0.580667
1,1.9676,84.577385,6.404444e-05,2.0,2.598578,0.138797,0.129879,0.552667
2,1.8065,54.393162,5.604444e-05,3.0,2.820398,0.094564,0.101574,0.540667
3,1.6742,35.490131,4.804444e-05,4.0,2.430531,0.170869,0.164241,0.603333
4,1.5313,45.608967,4.004444e-05,5.0,2.741854,0.167054,0.162614,0.559333
5,1.3738,75.718262,3.204444e-05,6.0,2.572315,0.161559,0.166701,0.596667
6,1.2648,43.930325,2.404444e-05,7.0,2.842278,0.156287,0.158705,0.570667
7,1.181,49.65617,1.604444e-05,8.0,2.837043,0.168898,0.175631,0.586
8,1.1169,44.615856,8.044444e-06,9.0,2.71425,0.168111,0.173921,0.594
9,1.0738,42.222725,4.444444e-08,10.0,2.737685,0.164007,0.171227,0.594667


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=8e-05 | wd=5e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.4445,2.598882,0.130944,0.134324,0.580667
2,1.9675,2.595374,0.138398,0.129578,0.552667
3,1.8058,2.814128,0.094801,0.101845,0.541333
4,1.6741,2.428748,0.171056,0.164473,0.604
5,1.5308,2.741552,0.167524,0.163298,0.56
6,1.3735,2.572128,0.161815,0.167013,0.597333
7,1.2645,2.83995,0.156272,0.158719,0.570667
8,1.1808,2.833787,0.169453,0.176348,0.587333
9,1.1163,2.715747,0.167868,0.173626,0.593333
10,1.0737,2.737839,0.163962,0.17114,0.594667


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.4445,106.085449,7.204444e-05,1.0,2.598882,0.130944,0.134324,0.580667
1,1.9675,84.318825,6.404444e-05,2.0,2.595374,0.138398,0.129578,0.552667
2,1.8058,54.334564,5.604444e-05,3.0,2.814128,0.094801,0.101845,0.541333
3,1.6741,35.611324,4.804444e-05,4.0,2.428748,0.171056,0.164473,0.604
4,1.5308,45.21714,4.004444e-05,5.0,2.741552,0.167524,0.163298,0.56
5,1.3735,75.574455,3.204444e-05,6.0,2.572128,0.161815,0.167013,0.597333
6,1.2645,43.954777,2.404444e-05,7.0,2.83995,0.156272,0.158719,0.570667
7,1.1808,49.637066,1.604444e-05,8.0,2.833787,0.169453,0.176348,0.587333
8,1.1163,44.456734,8.044444e-06,9.0,2.715747,0.167868,0.173626,0.593333
9,1.0737,42.139095,4.444444e-08,10.0,2.737839,0.163962,0.17114,0.594667


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=8e-05 | wd=8e-06 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.4441,2.596758,0.13099,0.134673,0.580667
2,1.9678,2.598824,0.139148,0.130411,0.553333
3,1.8059,2.812067,0.094655,0.10164,0.544667
4,1.6745,2.431303,0.170731,0.163961,0.603333
5,1.5313,2.739719,0.167568,0.163246,0.560667
6,1.3738,2.57202,0.161474,0.166764,0.597333
7,1.2644,2.843688,0.156318,0.158832,0.57
8,1.181,2.833473,0.16935,0.17613,0.587333
9,1.1164,2.71441,0.168098,0.173883,0.594667
10,1.0735,2.738597,0.163864,0.171061,0.595333


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.4441,106.188232,7.204444e-05,1.0,2.596758,0.13099,0.134673,0.580667
1,1.9678,84.680916,6.404444e-05,2.0,2.598824,0.139148,0.130411,0.553333
2,1.8059,54.272522,5.604444e-05,3.0,2.812067,0.094655,0.10164,0.544667
3,1.6745,36.110775,4.804444e-05,4.0,2.431303,0.170731,0.163961,0.603333
4,1.5313,45.565807,4.004444e-05,5.0,2.739719,0.167568,0.163246,0.560667
5,1.3738,75.69326,3.204444e-05,6.0,2.57202,0.161474,0.166764,0.597333
6,1.2644,43.97332,2.404444e-05,7.0,2.843688,0.156318,0.158832,0.57
7,1.181,49.575882,1.604444e-05,8.0,2.833473,0.16935,0.17613,0.587333
8,1.1164,44.582687,8.044444e-06,9.0,2.71441,0.168098,0.173883,0.594667
9,1.0735,42.176983,4.444444e-08,10.0,2.738597,0.163864,0.171061,0.595333


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


----- Finetuning stsb | lr=8e-05 | wd=1e-05 -----


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-kan-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight', 'head.dense.weight', 'model.layers.0.attn.Wo.weight', 'model.layers.0.attn.Wqkv.weight', 'model.layers.0.mlp.Wi.weight', 'model.layers.0.mlp.Wo.weight', 'model.layers.1.attn.Wo.weight', 'model.layers.1.attn.Wqkv.weight', 'model.layers.1.mlp.Wi.weight', 'model.layers.1.mlp.Wo.weight', 'model.layers.2.attn.Wo.weight', 'model.layers.2.attn.Wqkv.weight', 'model.layers.2.mlp.Wi.weight', 'model.layers.2.mlp.Wo.weight', 'model.layers.3.attn.Wo.weight', 'model.layers.3.attn.Wqkv.weight', 'model.layers.3.mlp.Wi.weight', 'model.layers.3.mlp.Wo.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config 

Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,2.4442,2.596641,0.130662,0.134036,0.58
2,1.9677,2.599096,0.138391,0.129635,0.554
3,1.8059,2.81728,0.094159,0.101228,0.541333
4,1.6742,2.431752,0.170331,0.163641,0.604
5,1.5311,2.740855,0.167199,0.162911,0.559333
6,1.3739,2.573315,0.161535,0.166688,0.598667
7,1.2646,2.841943,0.156453,0.158819,0.57
8,1.1811,2.834309,0.169436,0.176334,0.586
9,1.1164,2.71679,0.167857,0.173849,0.594
10,1.0738,2.738267,0.164031,0.171321,0.595333


:: Results ::


Unnamed: 0,train_loss,train_grad_norm,train_learning_rate,train_epoch,eval_loss,eval_pearsonr,eval_spearmanr,eval_accuracy_score
0,2.4442,106.219749,7.204444e-05,1.0,2.596641,0.130662,0.134036,0.58
1,1.9677,84.785568,6.404444e-05,2.0,2.599096,0.138391,0.129635,0.554
2,1.8059,54.432022,5.604444e-05,3.0,2.81728,0.094159,0.101228,0.541333
3,1.6742,35.723022,4.804444e-05,4.0,2.431752,0.170331,0.163641,0.604
4,1.5311,45.546944,4.004444e-05,5.0,2.740855,0.167199,0.162911,0.559333
5,1.3739,75.478722,3.204444e-05,6.0,2.573315,0.161535,0.166688,0.598667
6,1.2646,43.953743,2.404444e-05,7.0,2.841943,0.156453,0.158819,0.57
7,1.1811,49.666836,1.604444e-05,8.0,2.834309,0.169436,0.176334,0.586
8,1.1164,44.424091,8.044444e-06,9.0,2.71679,0.167857,0.173849,0.594
9,1.0738,42.30283,4.444444e-08,10.0,2.738267,0.164031,0.171321,0.595333


Unnamed: 0,output_dir,overwrite_output_dir,do_train,do_eval,do_predict,eval_strategy,prediction_loss_only,per_device_train_batch_size,per_device_eval_batch_size,per_gpu_train_batch_size,...,include_tokens_per_second,include_num_input_tokens_seen,neftune_noise_alpha,optim_target_modules,batch_eval_metrics,eval_on_start,use_liger_kernel,liger_kernel_config,eval_use_gather_object,average_tokens_across_devices
0,aai_ModernBERT_stsb_ft,False,False,True,False,epoch,False,32,32,,...,False,False,,,False,False,False,,False,False


# K-Folds

In [30]:
from sklearn.model_selection import StratifiedKFold, KFold
from transformers import set_seed
import tempfile
from datasets import concatenate_datasets
import os

def finetune_glue_task_kfold(
    lr, wd, task: str, 
    checkpoint: str = "answerdotai/ModernBERT-base", 
    train_subset: int | None = None, 
    do_cleanup: bool = True,
    k_folds: int = 5,
    random_seed: int = 42,
    train_bsz: int = 32,  # ✅ Add as parameter
    val_bsz: int = 32,    # ✅ Add as parameter  
    n_epochs: int = 2     # ✅ Add as parameter
):
    """Finetune GLUE task with k-fold cross-validation"""
    set_seed(random_seed)
    
    # ✅ Check if checkpoint path exists
    if not os.path.exists(checkpoint):
        print(f"Error: Checkpoint path '{checkpoint}' does not exist!")
        print("Please check the path or use a valid HuggingFace model identifier.")
        return None, None, None, None
    
    # 1. Load the task metadata
    task_meta = glue_tasks[task]
    train_ds_name = task_meta["dataset_names"]["train"]
    valid_ds_name = task_meta["dataset_names"]["valid"]
    task_inputs = task_meta["inputs"]
    n_labels = task_meta["n_labels"]
    task_metrics = task_meta["metric_funcs"]

    # 2. Load and combine datasets
    raw_datasets = load_dataset("glue", task.split("-")[0] if "-" in task else task)
    
    # ✅ Combine train + validation datasets
    combined_dataset = concatenate_datasets([
        raw_datasets[train_ds_name], 
        raw_datasets[valid_ds_name]
    ])
    
    print(f"Combined dataset size: {len(combined_dataset)}")
    print(f"Original train: {len(raw_datasets[train_ds_name])}, val: {len(raw_datasets[valid_ds_name])}")
    
    if train_subset is not None and len(combined_dataset) > train_subset:
        combined_dataset = combined_dataset.shuffle(seed=random_seed).select(range(train_subset))

    # Get label maps from original training set
    id2label, label2id = get_label_maps(raw_datasets, train_ds_name)
    
    # ✅ Create k-fold splits WITHOUT pandas - use dataset.select() directly
    if task == "stsb":
        # Regression - use regular KFold
        kfold = KFold(n_splits=k_folds, shuffle=True, random_state=random_seed)
        indices = list(range(len(combined_dataset)))
        splits = list(kfold.split(indices))
    else:
        # Classification - use StratifiedKFold
        labels = combined_dataset['label']
        kfold = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=random_seed)
        splits = list(kfold.split(range(len(combined_dataset)), labels))
    
    fold_results = []
    all_train_dfs = []
    all_args_dfs = []
    
    print(f"Starting {k_folds}-fold CV for {task}")
    
    # Clear original dataset references
    del raw_datasets
    gc.collect()
    
    # Process each fold
    for fold, (train_idx, val_idx) in enumerate(splits):
        print(f"\n--- Fold {fold + 1}/{k_folds} ---")
        print(f"Train size: {len(train_idx)}, Val size: {len(val_idx)}")
        
        try:
            # ✅ Use efficient dataset.select() - no pandas conversion needed!
            fold_train_dataset = combined_dataset.select(train_idx)
            fold_val_dataset = combined_dataset.select(val_idx)
            
            # Run single fold training
            train_res_df, args_df, model, tokenizer = _single_fold_training(
                lr, wd, task, checkpoint, fold_train_dataset, fold_val_dataset,
                task_inputs, n_labels, task_metrics, id2label, label2id, 
                do_cleanup, random_seed, fold + 1, train_bsz, val_bsz, n_epochs  # ✅ Pass parameters
            )
            
            # Store results
            train_res_df['fold'] = fold + 1
            args_df['fold'] = fold + 1
            
            all_train_dfs.append(train_res_df)
            all_args_dfs.append(args_df)
            
            # Extract final metrics for summary
            if len(train_res_df) > 0:
                final_metrics = train_res_df.iloc[-1].to_dict()
                final_metrics['fold'] = fold + 1
                fold_results.append(final_metrics)
            
            # Cleanup fold-specific data
            del fold_train_dataset, fold_val_dataset
            del model, tokenizer
            gc.collect()
            torch.cuda.empty_cache()
            
        except Exception as e:
            print(f"Error in fold {fold + 1}: {e}")
            import traceback
            traceback.print_exc()
    
    # Cleanup combined dataset
    del combined_dataset
    gc.collect()
    
    # Aggregate results
    if fold_results:
        results_df = pd.DataFrame(fold_results)
        combined_train_df = pd.concat(all_train_dfs, ignore_index=True)
        combined_args_df = pd.concat(all_args_dfs, ignore_index=True)
        
        # Calculate summary statistics
        summary_stats = {}
        numeric_cols = results_df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if col != 'fold':
                summary_stats[f'{col}_mean'] = results_df[col].mean()
                summary_stats[f'{col}_std'] = results_df[col].std()
        
        summary_df = pd.DataFrame([summary_stats])
        
        print(f"\n=== {task} K-Fold Results Summary ===")
        print("Individual Fold Results:")
        display(results_df)
        print("\nMean ± Std:")
        display(summary_df)
        
        return combined_train_df, combined_args_df, results_df, summary_df
    else:
        print("No successful folds completed!")
        return None, None, None, None


def _single_fold_training(
    lr, wd, task, checkpoint, train_dataset, val_dataset,
    task_inputs, n_labels, task_metrics, id2label, label2id, 
    do_cleanup, random_seed, fold_num, train_bsz, val_bsz, n_epochs  # ✅ Add missing parameters
):
    """Helper function for single fold training"""
    
    # ✅ Check checkpoint path again before loading
    if not os.path.exists(checkpoint):
        raise ValueError(f"Checkpoint path '{checkpoint}' does not exist!")
    
    # Load tokenizer
    hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    # Create fresh preprocessing function
    def fold_preprocess_function(examples):
        inps = [examples[inp] for inp in task_inputs]
        tokenized = hf_tokenizer(*inps, truncation=True)
        return tokenized
    
    # Tokenize datasets
    fold_train_tokenized = train_dataset.map(
        fold_preprocess_function, 
        batched=True,
        remove_columns=[col for col in train_dataset.column_names if col != 'label'],
        desc="Tokenizing train"
    )
    fold_val_tokenized = val_dataset.map(
        fold_preprocess_function, 
        batched=True,
        remove_columns=[col for col in val_dataset.column_names if col != 'label'],
        desc="Tokenizing val"
    )

    # Load model
    model_additional_kwargs = {"id2label": id2label, "label2id": label2id} if id2label and label2id else {}
    hf_model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint, 
        num_labels=n_labels, 
        **model_additional_kwargs,
        dtype=torch.bfloat16,  # ✅ Use 'dtype' instead of 'torch_dtype'
        local_files_only=True  # ✅ Force local loading
    )

    hf_data_collator = DataCollatorWithPadding(tokenizer=hf_tokenizer)

    # ✅ Define betas and eps locally
    betas = (0.9, 0.98)
    eps = 1e-6

    # Training arguments
    with tempfile.TemporaryDirectory() as temp_dir:
        training_args = TrainingArguments(
            output_dir=temp_dir,
            learning_rate=lr,
            per_device_train_batch_size=train_bsz,  # ✅ Now available as parameter
            per_device_eval_batch_size=val_bsz,     # ✅ Now available as parameter
            num_train_epochs=n_epochs,              # ✅ Now available as parameter
            lr_scheduler_type="linear",
            optim="adamw_torch",
            adam_beta1=betas[0],
            adam_beta2=betas[1],
            adam_epsilon=eps,
            weight_decay=wd,
            logging_strategy="no",
            eval_strategy="epoch",
            save_strategy="no",
            load_best_model_at_end=False,
            bf16=True,
            bf16_full_eval=True,
            push_to_hub=False,
            report_to=None,
            seed=random_seed,
            dataloader_pin_memory=False,
            dataloader_num_workers=0,
        )

        trainer = Trainer(
            model=hf_model,
            args=training_args,
            train_dataset=fold_train_tokenized,
            eval_dataset=fold_val_tokenized,
            processing_class=hf_tokenizer,
            data_collator=hf_data_collator,
            compute_metrics=partial(compute_metrics, task_metrics=task_metrics),
        )

        # Add callback
        metrics_callback = MetricsCallback()
        trainer.add_callback(metrics_callback)

        trainer.train()

        # Get results
        train_history_df = pd.DataFrame(metrics_callback.training_history["train"])
        train_history_df = train_history_df.add_prefix("train_")
        eval_history_df = pd.DataFrame(metrics_callback.training_history["eval"])
        train_res_df = pd.concat([train_history_df, eval_history_df], axis=1)

        args_df = pd.DataFrame([training_args.to_dict()])

        # Cleanup
        if do_cleanup:
            if hasattr(hf_model, 'cpu'):
                hf_model.cpu()
            cleanup_items = [trainer, hf_tokenizer, fold_train_tokenized, fold_val_tokenized]
            for item in cleanup_items:
                try:
                    del item
                except:
                    pass
            gc.collect()
            torch.cuda.empty_cache()

        return train_res_df, args_df, hf_model, hf_tokenizer


# ✅ Run with correct parameter passing
checkpoint = "output_model/modernbert-diffusion-1.5b"

# Check if the path exists
if os.path.exists(checkpoint):
    print(f"✅ Checkpoint found at: {checkpoint}")
    
    # Run k-fold cross-validation with explicit parameters
    for task in glue_tasks.keys():
        lr = glue_tasks[task]['learning_rate']
        wd = glue_tasks[task]['weight_decay']
        epochs = glue_tasks[task]['epochs']
        print(f"----- Finetuning {task} | lr={lr} | wd={wd} | epochs={epochs} -----")
        train_res_df, args_df, fold_results, summary_stats = finetune_glue_task_kfold(
            lr=lr,
            wd=wd,
            task=task,
            checkpoint=checkpoint,
            k_folds=2,
            train_subset=None,  # Use all data
            train_bsz=32,       # ✅ Explicit batch size
            val_bsz=32,         # ✅ Explicit batch size
            n_epochs=epochs
        )
else:
    print(f"❌ Checkpoint NOT found at: {checkpoint}")
    print("Available options:")
    print("1. Use a HuggingFace model: checkpoint = 'answerdotai/ModernBERT-base'")
    print("2. Fix the local path")
    print("3. Check if the model was saved correctly")
    
    # Let's check if there are any models in the output_model directory
    output_dir = "output_model"
    if os.path.exists(output_dir):
        print(f"\nContents of {output_dir}:")
        for item in os.listdir(output_dir):
            item_path = os.path.join(output_dir, item)
            if os.path.isdir(item_path):
                print(f"  📁 {item}/")
                # Check if it contains model files
                sub_files = os.listdir(item_path)
                if any(f.endswith(('.bin', '.safetensors')) for f in sub_files):
                    print(f"    ✅ Contains model files: {[f for f in sub_files if f.endswith(('.bin', '.safetensors', '.json'))]}")
            else:
                print(f"  📄 {item}")

✅ Checkpoint found at: output_model/modernbert-diffusion-1.5b
----- Finetuning cola | lr=5e-05 | wd=8e-06 | epochs=2 -----
Combined dataset size: 9594
Original train: 8551, val: 1043
Starting 2-fold CV for cola


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 1/2 ---
Train size: 4797, Val size: 4797


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,No log,1.091711,0.020737,0.608297
2,No log,1.071679,0.024101,0.607255


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 2/2 ---
Train size: 4797, Val size: 4797


Epoch,Training Loss,Validation Loss,Matthews Corrcoef,Accuracy Score
1,No log,0.987844,0.021081,0.609339
2,No log,0.977411,0.023343,0.608922



=== cola K-Fold Results Summary ===
Individual Fold Results:


Unnamed: 0,eval_loss,eval_matthews_corrcoef,eval_accuracy_score,fold
0,1.071679,0.024101,0.607255,1
1,0.977411,0.023343,0.608922,2



Mean ± Std:


Unnamed: 0,eval_loss_mean,eval_loss_std,eval_matthews_corrcoef_mean,eval_matthews_corrcoef_std,eval_accuracy_score_mean,eval_accuracy_score_std
0,1.024545,0.066657,0.023722,0.000536,0.608088,0.001179


----- Finetuning sst2 | lr=8e-05 | wd=1e-05 | epochs=2 -----
Combined dataset size: 68221
Original train: 67349, val: 872


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Starting 2-fold CV for sst2

--- Fold 1/2 ---
Train size: 34110, Val size: 34111


Epoch,Training Loss,Validation Loss,Accuracy Score
1,No log,0.570407,0.698719
2,No log,0.562826,0.705989


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 2/2 ---
Train size: 34111, Val size: 34110


Epoch,Training Loss,Validation Loss,Accuracy Score
1,No log,0.565224,0.70686
2,No log,0.557856,0.712724



=== sst2 K-Fold Results Summary ===
Individual Fold Results:


Unnamed: 0,eval_loss,eval_accuracy_score,fold
0,0.562826,0.705989,1
1,0.557856,0.712724,2



Mean ± Std:


Unnamed: 0,eval_loss_mean,eval_loss_std,eval_accuracy_score_mean,eval_accuracy_score_std
0,0.560341,0.003514,0.709356,0.004762


----- Finetuning mrpc | lr=8e-05 | wd=1e-06 | epochs=2 -----
Combined dataset size: 4076
Original train: 3668, val: 408
Starting 2-fold CV for mrpc


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 1/2 ---
Train size: 2038, Val size: 2038


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,No log,0.958694,0.630029,0.736364
2,No log,0.94719,0.63052,0.736437


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 2/2 ---
Train size: 2038, Val size: 2038


Epoch,Training Loss,Validation Loss,Accuracy Score,F1 Score
1,No log,0.913576,0.603042,0.717034
2,No log,0.897718,0.603042,0.717627



=== mrpc K-Fold Results Summary ===
Individual Fold Results:


Unnamed: 0,eval_loss,eval_accuracy_score,eval_f1_score,fold
0,0.94719,0.63052,0.736437,1
1,0.897718,0.603042,0.717627,2



Mean ± Std:


Unnamed: 0,eval_loss_mean,eval_loss_std,eval_accuracy_score_mean,eval_accuracy_score_std,eval_f1_score_mean,eval_f1_score_std
0,0.922454,0.034982,0.616781,0.01943,0.727032,0.013301


----- Finetuning stsb | lr=8e-05 | wd=1e-06 | epochs=7 -----
Combined dataset size: 7249
Original train: 5749, val: 1500
Starting 2-fold CV for stsb


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at output_model/modernbert-diffusion-1.5b and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.



--- Fold 1/2 ---
Train size: 3624, Val size: 3625


Epoch,Training Loss,Validation Loss,Pearsonr,Spearmanr,Accuracy Score
1,No log,3.244274,0.144766,0.145722,0.554759
2,No log,2.753454,0.191066,0.192461,0.575448


KeyboardInterrupt: 

# KAN Classification

In [None]:
from transformers import PreTrainedModel, AutoModel, AutoModelForSequenceClassification
from transformers.models.auto import AutoConfig
import torch
import torch.nn as nn


# Add this before your model loading
def load_kan_model_for_classification(checkpoint, num_labels, **kwargs):
    try:
        # Try standard loading first
        return AutoModelForSequenceClassification.from_pretrained(
            checkpoint, num_labels=num_labels, **kwargs
        )
    except Exception as e:
        print(f"Standard loading failed: {e}")
        print("Attempting to load as custom KAN model...")
        
        # Load base model and add classification head
        base_model = AutoModel.from_pretrained(checkpoint)
        
        # Create a simple wrapper
        class KANSequenceClassifier(torch.nn.Module):
            def __init__(self, base_model, num_labels):
                super().__init__()
                self.base_model = base_model
                self.classifier = torch.nn.Linear(base_model.config.hidden_size, num_labels)
                self.config = base_model.config
                self.config.num_labels = num_labels
                if 'id2label' in kwargs: self.config.id2label = kwargs['id2label']
                if 'label2id' in kwargs: self.config.label2id = kwargs['label2id']
            
            def forward(self, **inputs):
                outputs = self.base_model(**inputs)
                pooled_output = outputs.pooler_output
                logits = self.classifier(pooled_output)
                return type('obj', (object,), {'logits': logits, 'loss': None})()
        
        return KANSequenceClassifier(base_model, num_labels)

# Then replace your model loading with:
hf_model = load_kan_model_for_classification(
    checkpoint, num_labels=n_labels, **model_additional_kwargs
)