# Notebook: CamemBERT Finetuning on Natural Language Inference (NLI) 

- **Date**: January 2025  
- **Description**: This notebook focuses on fine-tuning a pre-trained **CamemBERT** model on the **Natural Language Inference (NLI)** task using the **XNLI** dataset. 
- **Data Source**: [Hugging Face Datasets - NLI](https://huggingface.co/datasets/xnli) (392,702 examples in French).  
- **Tools Used**: PyTorch, PyTorch Lightning, Transformers & Datasets libraries.  
- **GPU Used**: Quadro RTX 6000  

___

# Natural Language Inference (NLI)

### What’s the Deal?
NLI is all about figuring out the relationship between two sentences:  
- A **premise** (what you already know), and  
- A **hypothesis** (what you're trying to verify).

Think of it as your model playing detective: Does the hypothesis make sense, is it unrelated, or is it flat-out wrong?

### Why Should We Care?  
NLI is a big deal in NLP because it tests how well a model actually *understands* language, instead of just parroting back what it’s seen.  

In the real world, NLI powers applications like **chatbots** that can handle nuanced customer support queries and **search engines** that understand the intent behind your questions to provide smarter, context-aware results.  

Let’s keep it simple: The goal is to fine-tune the model and check how often it nails the correct verdict.

In [44]:
# Suppress unnecessary warnings and set verbosity for Transformers
import warnings
import transformers
transformers.logging.set_verbosity_error()
warnings.filterwarnings("ignore")

# PyTorch core libraries
import torch
from torch import nn, Tensor
from torch.nn.functional import softmax
from torch.utils.data import Dataset, DataLoader
from torchmetrics import Accuracy

# Transformers and Datasets
from transformers import CamembertModel, CamembertTokenizer, CamembertConfig
from datasets import load_dataset

# PyTorch Lightning and Metrics
import pytorch_lightning as pl
from torchmetrics import Accuracy
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

# Visualization and DataFrame utilities
import matplotlib.pyplot as plt
import pandas as pd


This section prepares the XNLI dataset for training. But before diving in, let’s explore the dataset: how to download it and take a look at some examples.

The XNLI (Cross-lingual Natural Language Inference) dataset is used for classification tasks involving logical relationships between pairs of sentences. It contains three main columns:

1. **Premises**:
   - The base sentence (premise) that serves as the starting point for inference.

2. **Hypotheses**:
   - The hypothesis sentence to be compared with the premise.

3. **Label**:
   - Indicates the logical relationship between the premise and the hypothesis. The possible labels are:
     - **0: entailment** (the premise implies the hypothesis).
     - **1: neutral** (the premise and hypothesis are unrelated or have no direct logical link).
     - **2: contradiction** (the premise contradicts the hypothesis).

### Example:

| Premises                       | Hypotheses                       | Label             |
|--------------------------------|----------------------------------|-------------------|
| "Cats sleep a lot."            | "Cats never sleep."              | 2 (contradiction) |
| "The sun is shining."          | "It’s a beautiful day outside."  | 0 (entailment)    |
| "A man is reading a book."     | "A woman is watching TV."        | 1 (neutral)       |

In [5]:
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

The following class provides a PyTorch-compatible dataset wrapper for the XNLI dataset. 
It handles loading, tokenizing, and preparing premise-hypothesis pairs with their corresponding labels for training, validation, or testing. 
The dataset is dynamically downloaded and cached if not already present.

In [6]:
class XNLIDataset(Dataset):
    def __init__(self, cache_directory, split="train", language="fr", tokenizer=tokenizer, max_length=64):
        """
        PyTorch-compatible dataset for the XNLI dataset.

        Args:
            split (str): Data split to load ("train", "test", or "validation").
            language (str): Target language for the dataset.
            cache_directory (str): Directory to cache the downloaded dataset.
            max_length (int): Maximum sequence length for padding/truncation.
        """
        super(XNLIDataset, self).__init__()
        self.split = split
        self.language = language
        self.cache_directory = cache_directory
        self.max_length = max_length

        # Load the data and the tokenizer
        self.data = load_dataset(
            "facebook/xnli",
            name=self.language,
            cache_dir=self.cache_directory
        )[self.split]  # Load the specified data split

        self.tokenizer = tokenizer  # CamembertTokenizer.from_pretrained("camembert-base")

    def __len__(self):
        """Returns the size of the dataset."""
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieve a specific sample from the dataset.

        Args:
            idx (int): Index of the sample.

        Returns:
            dict: Contains `input_ids`, `attention_mask`, and `label`.
        """
        example = self.data[idx]
        inputs = self.tokenizer(
            example["premise"],
            example["hypothesis"],
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )

        # Add the labels
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension
        inputs["label"] = torch.tensor(example["label"], dtype=torch.long)

        return inputs

### Understanding the Role of the Dataset Class:
Let’s explain how our **XNLIDataset** class works and how it processes the raw data. To do this, we’ll take a raw example from the dataset before it’s passed through the class.

In [47]:
dataset = load_dataset("facebook/xnli", name='fr', cache_dir="../../../data/xnli")

# Display some examples from the dataset
print(f"Premise : {dataset['validation'][2]['premise']}")
print(f"Hypothesis : {dataset['validation'][2]['hypothesis']}")
print(f"Label : {dataset['validation'][2]['label']} (entailment)")

Premise : Et il a dit, maman, je suis à la maison.
Hypothesis : Il a dit à sa mère qu'il était rentré.
Label : 0 (entailment)


This snippet displays the premise, hypothesis, and label before any preprocessing is applied.  
Here, the label `0` corresponds to **entailment**, meaning the hypothesis logically follows from the premise.  

However, the model cannot directly process raw text like the premise and hypothesis. These sentences must be transformed into numerical tensors that the model can understand.  

This is where the `XNLIDataset` class comes into play:  

- It **tokenizes** both the premise and hypothesis into a single input.  
- The two sentences are separated by a special token, `</s>`.  
- The `</s>` token acts as a delimiter, marking the end of the premise and the beginning of the hypothesis.  

This transformation ensures the model can effectively reason about the relationship between the sentences.

In [48]:
data_path = "../../../data/xnli"

xnli_train_dataset = XNLIDataset(split="train", language="fr", cache_directory=data_path, max_length=64)
xnli_val_dataset = XNLIDataset(split="validation", language="fr", cache_directory=data_path, max_length=64)

train_loader = DataLoader(xnli_train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(xnli_val_dataset, batch_size=256, shuffle=False)


batch = next(iter(val_loader))
print(f"Batch shape: {batch['input_ids'].shape}")
print(f"Token IDs (example):\n{batch['input_ids'][2]} \n")
decoded_text = tokenizer.decode(batch['input_ids'][2])
print(f"Decoded text (example):\n{decoded_text}")

Batch shape: torch.Size([256, 64])
Token IDs (example):
tensor([    5,   139,    51,    33,   227,     7,  2699,     7,    50,   146,
           15,    13,   269,     9,     6,     6,    69,    33,   227,    15,
           77,   907,    46,    11,    62,   149, 10540,     9,     6,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1]) 

Decoded text (example):
<s> Et il a dit, maman, je suis à la maison.</s></s> Il a dit à sa mère qu'il était rentré.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


The above code showcases the preprocessing of the XNLI dataset after being tokenized and loaded into a DataLoader. Here's the breakdown:

1. **Shape of the Batch**:  
   `torch.Size([256, 64])` indicates that the batch contains 256 samples, each with a maximum sequence length of 64 tokens. This padding ensures that all sequences are of uniform length for batch processing.

2. **Tokenized Example**:  
   - The tensor output represents the tokenized version of a sentence pair (premise and hypothesis) using the CamemBERT tokenizer. Each number corresponds to a specific token ID in the tokenizer's vocabulary:
     ```
     5,   139,    51,    33,   227,     7,  2699,     7,    50,   146,
     15,    13,   269,     9,     6,     6,    69,    33,   227,    15,
     77,   907,    46,    11,    62,   149, 10540,     9,     6,     1,
      1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
      1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
      1,     1,     1,     1,     1,     1,     1,     1,     1,     1
     ```
   - Notice the padding tokens (`1`) at the end to fill sequences shorter than 128 tokens.

3. **Decoded Tokens**:  
   - Using `tokenizer.decode`, the token IDs are converted back into a human-readable format:  
     ```
     <s> Et il a dit, maman, je suis à la maison.</s></s> Il a dit à sa mère qu'il était rentré.</s><pad><pad><pad>...
     ```
   - The `<s>` and `</s>` tags indicate sentence boundaries, while `<pad>` represents the padding added for uniformity.

4. **Key Insight**:  
   This example demonstrates how the premise and hypothesis are combined into a single input separated by special tokens (`</s>`). This structure ensures the model processes the relationship between the two sentences effectively.

**Once we have these token IDs and the attention mask (to ignore the padding tokens), we can feed them into our model for training or inference.**

## 2. Prepare the model :

In [49]:
class CamemBERTBaseModel(nn.Module):
    def __init__(self, model_path: str, trainable: bool = False):
        """
        Initialize the base CamemBERT model.
        param model_path: Path to the pre-trained CamemBERT model.
        """
        super(CamemBERTBaseModel, self).__init__()
        self.base_model = CamembertModel.from_pretrained(model_path)
        self.tranaible = trainable
        self.config = CamembertConfig()
 
        if not trainable:
            for param in self.base_model.parameters():
                param.requires_grad = False
            self.base_model.eval()
        else :
            self.base_model.train()

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the base model.
        param input_ids: Tensor of token IDs.
        param attention_mask: Tensor of attention masks.
        return: Last hidden states from the base model.
        """
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.last_hidden_state

    def get_hidden_size(self) -> int:
        return self.config.hidden_size

In [50]:
class NLIFinetuningModel(nn.Module):
    def __init__(self, base_model: CamemBERTBaseModel, num_labels: int = 3):
        """
        Initialize the NLI fine-tuning model.
        :param base_model: Instance of the base CamemBERT model.
        :param num_labels: Number of labels for NLI.
        """
        super(NLIFinetuningModel, self).__init__()
        self.base_model = base_model 

        self.hidden_size = base_model.get_hidden_size()
        self.num_labels = num_labels

        self.nli_head = nn.Linear(self.hidden_size, num_labels)
        # self.nli_head = NLIHead(base_model.get_hidden_size(), num_labels)

    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: torch.Tensor = None):
        """
        Forward pass for NLI fine-tuning.
        :param input_ids: Tensor of token IDs.
        :param attention_mask: Tensor of attention masks.
        :param labels: Optional tensor of labels (batch_size).
        :return: Dictionary containing logits and optionally loss.
        """
        # Get last hidden states from the base model
        hidden_states = self.base_model(input_ids=input_ids, attention_mask=attention_mask) # (batch_size, seq_len, hidden_size) -> (batch_size, seq_len, hidden_size)

        # Extract the [CLS] token's representation
        cls_output = hidden_states[:, 0, :]  # Shape: (batch_size, hidden_size)

        # Pass through the NLI head
        logits = self.nli_head(cls_output)  # Shape: (batch_size, num_labels)

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)

        return {"logits": logits, "loss": loss}

### Model Design Logic:

#### 1. **Base Model: `CamemBERTBaseModel`**
The first class, `CamemBERTBaseModel`, handles the **base model**, which is essentially the output of the pretraining phase (like Masked Language Modeling - MLM). 

- It takes care of loading the pre-trained weights for CamemBERT, either from **Hugging Face** or a model we fine-tuned ourselves.
- By including the option to **freeze** the base model's parameters (`trainable=False`), we can easily switch between:
  - Using the pre-trained features directly.
  - Fine-tuning the base model for downstream tasks.
- The `get_hidden_size` method allows us to dynamically retrieve the size of the hidden states, making it flexible for different architectures.

#### 2. **Fine-tuning Model: `NLIFinetuningModel`**
The second class, `NLIFinetuningModel`, builds on the base model to create the full architecture needed for the **Natural Language Inference (NLI)** task.

- It adds a **classification head** (`nli_head`), which is a simple linear layer mapping the hidden size to the number of NLI labels (3 in this case: Entailment, Neutral, Contradiction).
- The forward pass processes the tokenized input using the base model, extracts the hidden state of the `[CLS]` token (first token), and passes it through the classification head.

#### 3. **Why This Design?**
We chose to split the logic into two classes for **modularity** and **flexibility**:
- It allows us to test **different base models** (e.g., the CamemBERT downloaded from Hugging Face versus one we pre-trained ourselves).
- It simplifies experimentation, making it easy to:
  - Swap out the base model.
  - Use the same fine-tuning logic for other pre-trained models.  

This structure ensures our code is clean, reusable, and ready for a variety of tasks or models.


## 3. Finetune the model for Natural Language Inference :

### Why PyTorch Lightning?

We decided to use PyTorch Lightning because it simplifies model training by abstracting repetitive tasks like logging, checkpointing, and optimizer configuration. This allows us to focus on the logic of our model and training steps without getting bogged down by boilerplate code.

In [51]:
class NLI(pl.LightningModule):
    def __init__(self, model, lr=5e-5):
        """
        NLI model for training with PyTorch Lightning.
        :param model: Instance of the fine-tuning model.
        :param lr: Learning rate.
        """
        super(NLI, self).__init__()
        self.model = model
        self.lr = lr

        # Accuracy metrics for training and validation
        self.train_accuracy = Accuracy(task="multiclass", num_classes=3)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=3)

        # Metrics tracked per step
        self.train_losses_step = []
        self.train_accuracies_step = []
        self.val_losses_step = []
        self.val_accuracies_step = []

        # Metrics tracked per epoch
        self.train_losses_epoch = []
        self.train_accuracies_epoch = []
        self.val_losses_epoch = []
        self.val_accuracies_epoch = []

    def forward(self, batch):
        """
        Forward pass for inference.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :return: Model logits.
        """
        input_ids, attention_mask, labels = batch
        outputs = self.model(input_ids, attention_mask, labels)
        return outputs["logits"]

    def training_step(self, batch, batch_index):
        """
        Performs a single training step.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :param batch_index: Index of the batch.
        :return: Training loss.
        """
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = self.model(input_ids, attention_mask, labels)
        loss = outputs["loss"]

        # Compute accuracy
        preds = torch.argmax(outputs["logits"], dim=1)
        acc = self.train_accuracy(preds, labels)

        # Store step metrics
        self.train_losses_step.append(loss.item())
        self.train_accuracies_step.append(acc.item())

        # Log metrics for progress bar
        self.log("train_loss", loss, prog_bar=True, on_step=True, on_epoch=False)
        self.log("train_acc", acc, prog_bar=True, on_step=True, on_epoch=False)

        return loss

    def on_train_epoch_end(self):
        """
        Computes and stores epoch-level training metrics at the end of each epoch.
        """
        avg_loss = torch.tensor(self.train_losses_step).mean().item()
        avg_acc = torch.tensor(self.train_accuracies_step).mean().item()

        self.train_losses_epoch.append(avg_loss)
        self.train_accuracies_epoch.append(avg_acc)

        # Display epoch results
        print(f"[Epoch {self.current_epoch}] Train Loss: {avg_loss:.4f}, Train Accuracy: {avg_acc:.4f}")

        # Clear step metrics to prepare for the next epoch
        self.train_losses_step.clear()
        self.train_accuracies_step.clear()

    def validation_step(self, batch, batch_index):
        """
        Performs a single validation step.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :param batch_index: Index of the batch.
        :return: Validation loss.
        """
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = self.model(input_ids, attention_mask, labels)
        loss = outputs["loss"]

        # Compute accuracy
        preds = torch.argmax(outputs["logits"], dim=1)
        acc = self.val_accuracy(preds, labels)

        # Store step metrics
        self.val_losses_step.append(loss.item())
        self.val_accuracies_step.append(acc.item())

        # Log metrics for progress bar
        self.log("val_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log("val_acc", acc, prog_bar=True, on_step=False, on_epoch=True)

        return loss

    def on_validation_epoch_end(self):
        """
        Computes and stores epoch-level validation metrics at the end of each epoch.
        """
        avg_loss = torch.tensor(self.val_losses_step).mean().item()
        avg_acc = torch.tensor(self.val_accuracies_step).mean().item()

        self.val_losses_epoch.append(avg_loss)
        self.val_accuracies_epoch.append(avg_acc)

        # Display epoch results
        print(f"[Epoch {self.current_epoch}] Val Loss: {avg_loss:.4f}, Val Accuracy: {avg_acc:.4f}")

        # Clear step metrics to prepare for the next epoch
        self.val_losses_step.clear()
        self.val_accuracies_step.clear()

    def configure_optimizers(self):
        """
        Configures the optimizer and learning rate scheduler.
        :return: Dictionary containing the optimizer and scheduler configurations.
        """
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)

        # Dynamically calculate the total number of steps
        steps_per_epoch = 1534
        total_steps = steps_per_epoch * self.trainer.max_epochs

        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer,
            max_lr=self.lr,
            total_steps=total_steps,
            pct_start=0.1,
            anneal_strategy="linear",
        )
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}

### Logic Behind the `NLI` Class

The `NLI` class is built to manage the training and validation process for our NLI model efficiently:

1. **Modularity**: 
   - The class integrates the fine-tuned model and handles both forward passes and the training logic.
   - Metrics (loss and accuracy) are tracked at both step and epoch levels, enabling detailed analysis of model performance.

2. **Training and Validation**: 
   - The `training_step` and `validation_step` methods define the behavior for each batch during training and validation, respectively. 
   - These steps include loss computation, accuracy calculation, and logging metrics for real-time progress updates.

3. **Epoch-Level Summaries**: 
   - At the end of each epoch, average metrics are computed (`on_train_epoch_end` and `on_validation_epoch_end`).
   - This ensures we monitor model performance holistically over entire epochs.

4. **Optimizer and Scheduler**: 
   - The `configure_optimizers` method sets up AdamW as the optimizer and a `OneCycleLR` learning rate scheduler, which dynamically adjusts the learning rate for smoother convergence.

In [53]:
steps_per_epoch = len(train_loader)

# Define the path to the pre-trained model
model_path = "../../../models/oscar_4gb"  # Path to the Hugging Face pre-trained model

# Step 1: Create the fine-tuning model
nli_camembert = NLIFinetuningModel(
    base_model=CamemBERTBaseModel(model_path, trainable=True),  # Use a trainable base model
    num_labels=3  # Classes: entailment, neutral, contradiction
)

# Step 2: Configure the PyTorch Lightning module for training
nb_epochs = 3
nb_steps_per_epoch = len(train_loader)
pl_camembert = NLI(
    model=nli_camembert
)

# Step 3: Set up model checkpointing
checkpoint_callback = ModelCheckpoint(
    monitor="val_loss",           # Metric to monitor (validation loss)
    dirpath="checkpoints/",       # Directory for saving model checkpoints
    filename="nli-{epoch:02d}-{val_loss:.2f}",  # Format for checkpoint filenames
    save_top_k=2,                 # Save only the top 2 models with the best validation loss
    mode="min"                    # Minimize the monitored metric
)

# Step 4: Set up TensorBoard logging
logger = TensorBoardLogger(
    save_dir="my_logs",           # Directory for saving logs
    name="final-experiment"       # Name for the experiment
)

# Step 5: Initialize the PyTorch Lightning Trainer
trainer = pl.Trainer(
    max_epochs=5,                 # Number of training epochs
    accelerator="gpu",            # Use GPU for training (if available)
    devices=1,                    # Number of GPUs to use
    callbacks=[checkpoint_callback],  # Add checkpointing callback
    logger=logger                 # Use the configured TensorBoard logger
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [10]:
xnli_train_dataset = XNLIDataset(split="train", language="fr", cache_directory=data_path, max_length=64)
xnli_val_dataset = XNLIDataset(split="validation", language="fr", cache_directory=data_path, max_length=64)

train_loader = DataLoader(xnli_train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(xnli_val_dataset, batch_size=256, shuffle=False)

steps_per_epoch = len(train_loader)
steps_per_epoch

1534

In [14]:
trainer.fit(pl_camembert, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type               | Params | Mode 
--------------------------------------------------------------
0 | model          | NLIFinetuningModel | 110 M  | train
1 | train_accuracy | MulticlassAccuracy | 0      | train
2 | val_accuracy   | MulticlassAccuracy | 0      | train
--------------------------------------------------------------
110 M     Trainable params
0         Non-trainable params
110 M     Total params
442.497   Total estimated model params size (MB)
233       Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

[Epoch 0] Val Loss: 1.1032, Val Accuracy: 0.3340


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

[Epoch 0] Val Loss: 0.5163, Val Accuracy: 0.7956
[Epoch 0] Train Loss: 0.6583, Train Accuracy: 0.7151


Validation: |          | 0/? [00:00<?, ?it/s]

[Epoch 1] Val Loss: 0.4847, Val Accuracy: 0.8191
[Epoch 1] Train Loss: 0.4853, Train Accuracy: 0.8083


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Validation: |          | 0/? [00:00<?, ?it/s]

[Epoch 3] Val Loss: 0.5698, Val Accuracy: 0.8073
[Epoch 3] Train Loss: 0.3073, Train Accuracy: 0.8847


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.


[Epoch 4] Val Loss: 0.6112, Val Accuracy: 0.8077
[Epoch 4] Train Loss: 0.2436, Train Accuracy: 0.9118


In [22]:
import pandas as pd

def generate_epoch_table(model):
    """
    Generate and display a table of training and validation metrics by epoch.
    """
    epochs = range(1, len(model.train_losses_epoch) + 1)

    # Créer un DataFrame pour les résultats
    results_df = pd.DataFrame({
        "Epoch": epochs,
        "Train Loss": model.train_losses_epoch,
        "Val Loss": model.val_losses_epoch[1:],  # Ignorer la première validation
        "Train Accuracy": model.train_accuracies_epoch,
        "Val Accuracy": model.val_accuracies_epoch[1:]  # Ignorer la première validation
    })

    # Afficher la table
    print("Training Results by Epoch:")
    return results_df


table = generate_epoch_table(pl_camembert)
display(table)


Training Results by Epoch:


Unnamed: 0,Epoch,Train Loss,Val Loss,Train Accuracy,Val Accuracy
0,1,0.658302,0.516347,0.715135,0.795632
1,2,0.48529,0.484715,0.808292,0.819124
2,3,0.390026,0.530201,0.849947,0.811261
3,4,0.307304,0.569847,0.884712,0.807304
4,5,0.243646,0.611163,0.911783,0.807745


### Analysis of Training Results

Fine-tuning requires fewer epochs since the model starts with pre-trained weights, focusing on adapting to the specific task rather than learning from scratch. From the results, we observe that the second epoch achieves the best validation accuracy. Beyond this point, the model begins to overfit, as indicated by the increasing training accuracy and declining validation performance.

To ensure optimal generalization, we will use the model from the second epoch, which strikes the best balance between training and validation accuracy. Thanks to PyTorch Lightning, this model can be easily retrieved from the saved checkpoints.


## 3. Get Test Accuracy Score :

Using the same process as for training, we prepare the test dataset to evaluate the final accuracy. The test set consists of 5010 examples.


In [58]:
xnli_test_dataset = XNLIDataset(split="test", language="fr", cache_directory=data_path, max_length=64)
test_loader = DataLoader(xnli_train_dataset, batch_size=256, shuffle=False)
print(f"We have {len(xnli_test_dataset)} examples in the test dataset")

We have 5010 examples in the test dataset


Load the best model from checkpoints :

In [60]:
best_checkpoint = '../../../notebooks/trainings/nli_train/checkpoints/nli-epoch=01-val_loss=0.49.ckpt'

model = NLI.load_from_checkpoint(best_checkpoint, model=nli_camembert, lr=5e-5)
model.eval()
print("Checkpoint loaded successfully!")

Checkpoint loaded successfully!


In [61]:
# Initialize the accuracy metric
test_accuracy = Accuracy(task="multiclass", num_classes=3).to(device)

# Set the model to evaluation mode
model.eval()

# Disable gradient calculations for inference
with torch.no_grad():
    for batch in test_loader:
        # Move batch data to the same device as the model
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        # Get predictions from the model
        logits = model.model(input_ids=input_ids, attention_mask=attention_mask)["logits"]
        preds = torch.argmax(logits, dim=1)

        # Update the accuracy metric
        test_accuracy.update(preds, labels)

# Compute the final accuracy
final_accuracy = test_accuracy.compute().item()
print(f"Final Accuracy: {final_accuracy:.4f}")

Final Accuracy: 0.8568


The final accuracy obtained on the test dataset is **85.68%**, which is higher than the reported score of **82.06%** in the paper. This improvement may be attributed to differences in fine-tuning strategies, hyperparameter choices, or pre-processing steps.

### Final Results

The final accuracy obtained on the test dataset is **85.68%**, which is significantly higher than the reported score of **82.06%** in the paper. This discrepancy could be explained by differences in fine-tuning strategies. Specifically, the increase in batch size to 256 and the fine-tuning performed on the full dataset containing over 392,702 examples may have contributed to the improved performance.

## Additional Comments:

- We use accuracy as the evaluation metric because the classes in the NLI dataset are balanced, as stated in the dataset documentation on Hugging Face. If this were not the case, alternative metrics like the F1 score or a confusion matrix would be more appropriate.

