# Fine-Tuning ModernBert for Arxiv Paper Recommandation

Based on a manually annotated dataset, we want to build a classifier to decide whether or not a paper will be interesting.

In [23]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, f1_score
from sklearn.utils.class_weight import compute_class_weight
from transformers import Trainer, TrainingArguments
from datasets import Dataset


df = pd.read_csv("Dataset.csv")
df['submission_date'] = pd.to_datetime(df['submission_date'])
df.head()

Unnamed: 0,paper_id,title,abstract,categories,submission_date,interest
0,2411.1862,Cross-modal Information Flow in Multimodal Lar...,The recent advancements in auto-regressive m...,"cs.AI, cs.CL, cs.CV",2024-11-27 18:59:26+00:00,0
1,2411.18616,Diffusion Self-Distillation for Zero-Shot Cust...,Text-to-image diffusion models produce impre...,"cs.CV, cs.AI, cs.GR, cs.LG",2024-11-27 18:58:52+00:00,0
2,2411.18615,Proactive Gradient Conflict Mitigation in Mult...,Advancing towards generalist agents necessit...,"cs.LG, cs.AI, cs.CV",2024-11-27 18:58:22+00:00,0
3,2411.18612,Robust Offline Reinforcement Learning with Lin...,The Distributionally Robust Markov Decision ...,"cs.LG, cs.AI, cs.RO, stat.ML",2024-11-27 18:57:03+00:00,0
4,2411.18583,Automated Literature Review Using NLP Techniqu...,This research presents and compares multiple...,"cs.CL, cs.AI, cs.IR, cs.LG",2024-11-27 18:27:07+00:00,1


The classification is imbalanced :

In [24]:
print(f"Proportion of positive class : {100 * df["interest"].mean():.2f}%")

Proportion of positive class : 7.03%


This means that we will have to take the imbalanced into account for the training.

## Preparation

For the classification the abstract of the paper will be the only feature, and we will keep 75% of the dataset for training.

In [25]:
df = df.sort_values('submission_date')
train_size = 0.75
max_train_index = int(train_size * len(df))

abstracts = df['abstract'].tolist()
labels = df['interest'].tolist()

train_texts, val_texts = abstracts[:max_train_index], abstracts[max_train_index:]
train_labels, val_labels = labels[:max_train_index], labels[max_train_index:]

train_dataset = Dataset.from_dict({'text': train_texts, 'label': train_labels})
validation_dataset = Dataset.from_dict({'text': val_texts, 'label': val_labels})

Let's check the imbalance in both datasets:

In [30]:
for (vector, name) in zip([train_labels, val_labels], ["Train", "Validation"]):
    ratio = 100 * np.array(vector).mean()
    print(f"{name}: {ratio:.2f}%")

Train: 7.34%
Validation: 6.11%


In this notebook we'll fine-tune the ModernBERT-base model. One can change the `model_name` variable to switch from one model to another on HuggingFace.

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Before training, we need to follow three steps :
1. Tokenization : transform text to an array of tokens, each representing a sequence of character
2. Define a custom trainer class that take into account the class imbalance
3. Define a function that will measure performance of the model.

In [5]:
import torch
device = 'mps' if torch.backends.mps.is_available() else 'cpu'

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128).to(device)

train_dataset = train_dataset.map(tokenize_function, batched=True)
validation_dataset = validation_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 2943/2943 [00:00<00:00, 7017.25 examples/s]
Map: 100%|██████████| 982/982 [00:00<00:00, 7357.25 examples/s]


The classic Trainer class from HuggingFace doesn't support natively the class imbalanced. Therefore we implement a modification of the CrossEntropyLoss and we make sure everything is avaible on the GPU.

In [6]:
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(train_labels), y=train_labels)
class_weights = torch.tensor(class_weights, dtype=torch.float).clone().detach().to(device)


class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get('logits')
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Since the classification is imbalanced, measuring the performance with accuracy is not rigourous. We will compute precision, recall and f1-score for the threshold that achieve the best f1-score. We will then track accuracy, precision, recall, f1 and the threshold used to compute all of them.

In [7]:
def compute_metrics(p):
    logits, y_true = p
    y_proba = 1/(1+np.exp(-logits))[:, 1]

    best_threshold = 0
    best_f1 = 0
    probabilities = set(y_proba)

    for threshold in probabilities:
        y_pred = (y_proba >= threshold).astype(int)
        f1 = f1_score(y_true, y_pred)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
    y_pred = (y_proba >= best_threshold).astype(int)
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
    answer = {
        'threshold': best_threshold,
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }
    return answer

We checked all the previous tasks, let's start the training.

## Training

We will use the [TrainingArguments](https://huggingface.co/docs/transformers/v4.48.2/en/main_classes/trainer#transformers.TrainingArguments) to feed the CustomTrainer previously defined. All the logs and weights will be saved into a specific folder.

In [9]:
training_arguments = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.05,
    learning_rate=5e-6,
    warmup_ratio=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to=[],
    use_mps_device=True
)

trainer = CustomTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics
)


trainer.train()



Epoch,Training Loss,Validation Loss,Threshold,Accuracy,F1,Precision,Recall
1,No log,0.51246,0.339706,0.89002,0.419355,0.333333,0.565217
2,No log,0.445514,0.348267,0.91446,0.493976,0.42268,0.594203
3,No log,0.462675,0.436915,0.919552,0.496815,0.443182,0.565217
4,No log,0.502517,0.310582,0.905295,0.486188,0.392857,0.637681
5,No log,0.718026,0.143553,0.907332,0.485876,0.398148,0.623188
6,0.369300,0.714898,0.176245,0.912424,0.494118,0.415842,0.608696
7,0.369300,0.712178,0.187776,0.915479,0.49697,0.427083,0.594203


TrainOutput(global_step=644, training_loss=0.32439060685057075, metrics={'train_runtime': 1773.8054, 'train_samples_per_second': 11.614, 'train_steps_per_second': 0.363, 'total_flos': 1754988096443904.0, 'train_loss': 0.32439060685057075, 'epoch': 7.0})

With training finished, we can evaluate the best model on the test set:

In [10]:
evaluation = trainer.evaluate()
print(evaluation)

{'eval_loss': 0.7121782302856445, 'eval_threshold': 0.18777596950531006, 'eval_accuracy': 0.9154786150712831, 'eval_f1': 0.49696969696969695, 'eval_precision': 0.4270833333333333, 'eval_recall': 0.5942028985507246, 'eval_runtime': 30.7812, 'eval_samples_per_second': 31.903, 'eval_steps_per_second': 2.014, 'epoch': 7.0}


## Errors on validation

We want to check manually the errors of the dataset to decide whether is it a prediction error or a labellisation error.

In [12]:
best_threshold = evaluation["eval_threshold"]
y_pred = trainer.predict(validation_dataset)
y_proba = 1/(1+np.exp(-y_pred.predictions))[:, 1]

for (index, proba) in enumerate(y_proba):
  if proba > best_threshold and val_labels[index] == 0:
    print(f"Proba={proba:.4} and label={val_labels[index]}")
    print(val_texts[index])
    print("-" * 25)

Proba=0.2457 and label=0
  The metaphor studies community has developed numerous valuable labelled
corpora in various languages over the years. Many of these resources are not
only unknown to the NLP community, but are also often not easily shared among
the researchers. Both in human sciences and in NLP, researchers could benefit
from a centralised database of labelled resources, easily accessible and
unified under an identical format. To facilitate this, we present
MetaphorShare, a website to integrate metaphor datasets making them open and
accessible. With this effort, our aim is to encourage researchers to share and
upload more datasets in any language in order to facilitate metaphor studies
and the development of future metaphor processing NLP systems. The website is
accessible at www.metaphorshare.com.

-------------------------
Proba=0.2969 and label=0
  Bitcoin has increased investment interests in people during the last decade.
We have seen an increase in the number of posts on

## To be continued

This is the end of the notebook, one shall use it to fine-tune a model on its own use-case !