
In this notebook, we focus on fine-tuning our custom model specifically for the Natural Language Inference (NLI) task. This involves training the model to classify logical relationships between pairs of sentences, such as entailment, contradiction, or neutrality. The goal is to evaluate how well our model, after pretraining, can adapt to this downstream task and compare its performance to existing benchmarks or similar models. This process will provide insights into the model's ability to capture semantic relationships and its overall effectiveness in real-world language understanding scenarios.

In [1]:
# Suppress unnecessary warnings and set verbosity for Transformers
import warnings
import transformers
transformers.logging.set_verbosity_error()
warnings.filterwarnings("ignore")

# PyTorch core libraries
import torch
from torch import nn, Tensor
from torch.nn.functional import softmax
from torch.utils.data import Dataset, DataLoader
from torchmetrics import Accuracy

# Transformers and Datasets
from transformers import CamembertModel, CamembertTokenizer, CamembertConfig
from datasets import load_dataset

# PyTorch Lightning and Metrics
import pytorch_lightning as pl
from torchmetrics import Accuracy
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

# Visualization and DataFrame utilities
import matplotlib.pyplot as plt
import pandas as pd

2025-01-08 15:48:03.432088: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736347683.448483  756868 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736347683.453449  756868 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-08 15:48:03.472306: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Data :

In [3]:
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

In [12]:
class XNLIDataset(Dataset):
    def __init__(self, cache_directory, split="train", language="fr", tokenizer=tokenizer, max_length=64):
        """
        PyTorch-compatible dataset for the XNLI dataset.

        Args:
            split (str): Data split to load ("train", "test", or "validation").
            language (str): Target language for the dataset.
            cache_directory (str): Directory to cache the downloaded dataset.
            max_length (int): Maximum sequence length for padding/truncation.
        """
        super(XNLIDataset, self).__init__()
        self.split = split
        self.language = language
        self.cache_directory = cache_directory
        self.max_length = max_length

        # Load the data and the tokenizer
        self.data = load_dataset(
            "facebook/xnli",
            name=self.language,
            cache_dir=self.cache_directory
        )[self.split]  # Load the specified data split

        self.tokenizer = tokenizer  # CamembertTokenizer.from_pretrained("camembert-base")

    def __len__(self):
        """Returns the size of the dataset."""
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieve a specific sample from the dataset.

        Args:
            idx (int): Index of the sample.

        Returns:
            dict: Contains `input_ids`, `attention_mask`, and `label`.
        """
        example = self.data[idx]
        inputs = self.tokenizer(
            example["premise"],
            example["hypothesis"],
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )

        # Add the labels
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension
        inputs["label"] = torch.tensor(example["label"], dtype=torch.long)

        return inputs

In [13]:
dataset = load_dataset("facebook/xnli", name='fr', cache_dir="Noureddine/MLA-CamemBERT/data/XNLI")

# Display some examples from the dataset
print(f"Premise : {dataset['validation'][2]['premise']}")
print(f"Hypothesis : {dataset['validation'][2]['hypothesis']}")
print(f"Label : {dataset['validation'][2]['label']} (entailment)")

Premise : Et il a dit, maman, je suis à la maison.
Hypothesis : Il a dit à sa mère qu'il était rentré.
Label : 0 (entailment)


In [14]:
data_path = "data/xnli"

xnli_train_dataset = XNLIDataset(split="train", language="fr", cache_directory=data_path, max_length=64)
xnli_val_dataset = XNLIDataset(split="validation", language="fr", cache_directory=data_path, max_length=64)

train_loader = DataLoader(xnli_train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(xnli_val_dataset, batch_size=256, shuffle=False)


batch = next(iter(val_loader))
print(f"Batch shape: {batch['input_ids'].shape}")
print(f"Token IDs (example):\n{batch['input_ids'][2]} \n")
decoded_text = tokenizer.decode(batch['input_ids'][2])
print(f"Decoded text (example):\n{decoded_text}")

Batch shape: torch.Size([256, 64])
Token IDs (example):
tensor([    5,   139,    51,    33,   227,     7,  2699,     7,    50,   146,
           15,    13,   269,     9,     6,     6,    69,    33,   227,    15,
           77,   907,    46,    11,    62,   149, 10540,     9,     6,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1]) 

Decoded text (example):
<s> Et il a dit, maman, je suis à la maison.</s></s> Il a dit à sa mère qu'il était rentré.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


## Model :

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class CamembertConfig:
    def __init__(self):
        self.vocab_size = 32005
        self.hidden_size = 768
        self.num_hidden_layers = 12
        self.num_attention_heads = 12
        self.intermediate_size = 3072
        self.hidden_act = "gelu"
        self.hidden_dropout_prob = 0.1
        self.attention_probs_dropout_prob = 0.1
        self.max_position_embeddings = 514
        self.type_vocab_size = 1
        self.initializer_range = 0.02
        self.layer_norm_eps = 1e-5
        self.pad_token_id = 1
        self.head_type = "MLM"

class CamembertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        input_shape = input_ids.size()
        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device).unsqueeze(0)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)

        inputs_embeds = self.word_embeddings(input_ids)
        position_embeds = self.position_embeddings(position_ids)
        token_type_embeds = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + position_embeds + token_type_embeds
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)

        # Debug prints
        # print(f"Embeddings NaN: {torch.isnan(embeddings).any()}")

        return embeddings

class CamembertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.hidden_size // config.num_attention_heads
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)
        self.dropout = nn.Dropout(0.2)  # Increased dropout rate

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        return x.view(new_x_shape).permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask=None):
        query_layer = self.transpose_for_scores(self.query(hidden_states))
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))

        # Debug query, key, value
        # print(f"Query NaN: {torch.isnan(query_layer).any()}")
        # print(f"Key NaN: {torch.isnan(key_layer).any()}")
        # print(f"Value NaN: {torch.isnan(value_layer).any()}")

        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores /= math.sqrt(self.attention_head_size)

        # Clamp scores to prevent overflow
        attention_scores = torch.clamp(attention_scores, min=-1e9, max=1e9)
        attention_probs = nn.functional.softmax(attention_scores, dim=-1) + 1e-9
        attention_probs = self.dropout(attention_probs)

        # Debug attention scores and probabilities
        # print(f"Attention Scores NaN Before Clamp: {torch.isnan(attention_scores).any()}")
        # print(f"Attention Probs NaN: {torch.isnan(attention_probs).any()}")

        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        context_layer = context_layer.view(context_layer.size(0), -1, self.all_head_size)

        # Debug context layer
        # print(f"Context Layer NaN: {torch.isnan(context_layer).any()}")

        return context_layer



class CamembertFeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.activation = F.gelu if config.hidden_act == "gelu" else nn.ReLU()
        self.dense_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(0.2)  # Increased dropout rate

    def forward(self, hidden_states):
        intermediate_output = self.activation(self.dense_1(hidden_states))
        intermediate_output = torch.clamp(intermediate_output, min=-1e9, max=1e9)

        output = self.dense_2(intermediate_output)
        output = self.dropout(output)
        output = self.LayerNorm(output + hidden_states)

        return output


class CamembertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = CamembertSelfAttention(config)
        self.attention_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.feed_forward = CamembertFeedForward(config)

    def forward(self, hidden_states, attention_mask=None):
        attention_output = self.attention(hidden_states, attention_mask)
        hidden_states = self.attention_norm(hidden_states + attention_output)
        return self.feed_forward(hidden_states)

class CamembertEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = nn.ModuleList([CamembertLayer(config) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask=None):
        for i, layer in enumerate(self.layers):
            hidden_states = layer(hidden_states, attention_mask)

            # Debug prints for each layer
            # print(f"Layer {i} Hidden States NaN: {torch.isnan(hidden_states).any()}")

        return hidden_states

class CamembertLMHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self, hidden_states):
        hidden_states = F.gelu(self.dense(hidden_states))
        hidden_states = self.layer_norm(hidden_states)
        logits = self.decoder(hidden_states)

        return logits

class CamembertModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = CamembertEmbeddings(config)
        self.encoder = CamembertEncoder(config)
        self.head = CamembertLMHead(config) if config.head_type == "MLM" else None

    def forward(self, input_ids, attention_mask=None):
        embedded_input = self.embeddings(input_ids)

        if attention_mask is not None:
            attention_mask = (1.0 - attention_mask) * -float('inf')

        encoder_output = self.encoder(embedded_input, attention_mask)
        return self.head(encoder_output)

In [16]:
class Camembert_base(nn.Module) : 
    def __init__(self, embeddings, encoder):
        super(Camembert_base , self).__init__()
        self.embeddings = embeddings
        self.encoder = encoder
        
    def forward(self, input_ids, attention_mask=None):
        embedded_input = self.embeddings(input_ids)

        if attention_mask is not None:
            attention_mask = (1.0 - attention_mask) * -float('inf')

        encoder_output = self.encoder(embedded_input, attention_mask)
        return encoder_output


Here, we attach a classification head to enable the model to perform three-class classification.

In [17]:
import torch.nn as nn

class NLIHead(nn.Module):
    def __init__(self, hidden_size, num_labels, dropout_prob=0.1):
        super(NLIHead, self).__init__()
        self.fc1 = nn.Linear(hidden_size, 256)  # Couche fully connected
        self.activation = nn.ReLU()  # Activation ReLU
        self.dropout = nn.Dropout(p=dropout_prob)  # Dropout pour la régularisation
        self.fc2 = nn.Linear(256, num_labels)  # Couche finale pour les labels

    def forward(self, x):
        x = self.fc1(x)  # Première projection linéaire
        x = self.activation(x)  # Activation
        x = self.dropout(x)  # Application du Dropout
        x = self.fc2(x)  # Projection finale vers les classes
        return x

In [18]:
config = CamembertConfig()
loaded_model = CamembertModel(config)

model_path = "notebooks/trainings/models/Pretraining/model_checkpoints/checkpoint_epoch_9.pth"
checkpoint = torch.load(model_path)
# Extract only the model's state_dict
model_state_dict = checkpoint['model_state_dict']
loaded_model.load_state_dict(model_state_dict) 

<All keys matched successfully>

In [None]:
num_labels = 3  # Exemples pour NLI : entailment, contradiction, neutral
hidden_size = config.hidden_size  # Taille de sortie de l'encoder

camembert = Camembert_base(loaded_model.embeddings, loaded_model.encoder)
camembert.CamembertLMHead = NLIHead(hidden_size, num_labels)
camembert

In [21]:

# Generating example data
batch_size = 8
seq_len = 128
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
attention_mask = torch.ones(batch_size, seq_len)

# Forward pass
logits = loaded_model(input_ids, attention_mask)
print("Logits shape:", logits.shape)

Logits shape: torch.Size([8, 128, 32005])


In [None]:
class Camembert_base(nn.Module):
    def __init__(self, embeddings, encoder, head):
        super(Camembert_base, self).__init__()
        self.embeddings = embeddings
        self.encoder = encoder
        self.head = head

    def forward(self, input_ids, attention_mask=None, labels=None):
        # Embedding layer
        embedded_input = self.embeddings(input_ids)

        # Apply attention mask if provided
        if attention_mask is not None:
            attention_mask = (1.0 - attention_mask) * -float('inf')

        # Encoder layer
        encoder_output = self.encoder(embedded_input, attention_mask)

        # Head for classification
        logits = self.head(encoder_output[:, 0, :])  # Utiliser uniquement le token [CLS] pour classification

        # Compute loss if labels are provided
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)

        return {"logits": logits, "loss": loss}

## training :

In [22]:
class NLI(pl.LightningModule):
    def __init__(self, model, lr=5e-5):
        """
        NLI model for training with PyTorch Lightning.
        :param model: Instance of the fine-tuning model.
        :param lr: Learning rate.
        """
        super(NLI, self).__init__()
        self.model = model
        self.lr = lr

        # Accuracy metrics for training and validation
        self.train_accuracy = Accuracy(task="multiclass", num_classes=3)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=3)

        # Metrics tracked per step
        self.train_losses_step = []
        self.train_accuracies_step = []
        self.val_losses_step = []
        self.val_accuracies_step = []

        # Metrics tracked per epoch
        self.train_losses_epoch = []
        self.train_accuracies_epoch = []
        self.val_losses_epoch = []
        self.val_accuracies_epoch = []

    def forward(self, batch):
        """
        Forward pass for inference.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :return: Model logits.
        """
        input_ids, attention_mask, labels = batch
        outputs = self.model(input_ids, attention_mask, labels)
        return outputs["logits"]

    def training_step(self, batch, batch_index):
        """
        Performs a single training step.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :param batch_index: Index of the batch.
        :return: Training loss.
        """
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = self.model(input_ids, attention_mask, labels)
        loss = outputs["loss"]

        # Compute accuracy
        preds = torch.argmax(outputs["logits"], dim=1)
        acc = self.train_accuracy(preds, labels)

        # Store step metrics
        self.train_losses_step.append(loss.item())
        self.train_accuracies_step.append(acc.item())

        # Log metrics for progress bar
        self.log("train_loss", loss, prog_bar=True, on_step=True, on_epoch=False)
        self.log("train_acc", acc, prog_bar=True, on_step=True, on_epoch=False)

        return loss

    def on_train_epoch_end(self):
        """
        Computes and stores epoch-level training metrics at the end of each epoch.
        """
        avg_loss = torch.tensor(self.train_losses_step).mean().item()
        avg_acc = torch.tensor(self.train_accuracies_step).mean().item()

        self.train_losses_epoch.append(avg_loss)
        self.train_accuracies_epoch.append(avg_acc)

        # Display epoch results
        print(f"[Epoch {self.current_epoch}] Train Loss: {avg_loss:.4f}, Train Accuracy: {avg_acc:.4f}")

        # Clear step metrics to prepare for the next epoch
        self.train_losses_step.clear()
        self.train_accuracies_step.clear()

    def validation_step(self, batch, batch_index):
        """
        Performs a single validation step.
        :param batch: Input batch containing input IDs, attention masks, and labels.
        :param batch_index: Index of the batch.
        :return: Validation loss.
        """
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = self.model(input_ids, attention_mask, labels)
        loss = outputs["loss"]

        # Compute accuracy
        preds = torch.argmax(outputs["logits"], dim=1)
        acc = self.val_accuracy(preds, labels)

        # Store step metrics
        self.val_losses_step.append(loss.item())
        self.val_accuracies_step.append(acc.item())

        # Log metrics for progress bar
        self.log("val_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log("val_acc", acc, prog_bar=True, on_step=False, on_epoch=True)

        return loss

    def on_validation_epoch_end(self):
        """
        Computes and stores epoch-level validation metrics at the end of each epoch.
        """
        avg_loss = torch.tensor(self.val_losses_step).mean().item()
        avg_acc = torch.tensor(self.val_accuracies_step).mean().item()

        self.val_losses_epoch.append(avg_loss)
        self.val_accuracies_epoch.append(avg_acc)

        # Display epoch results
        print(f"[Epoch {self.current_epoch}] Val Loss: {avg_loss:.4f}, Val Accuracy: {avg_acc:.4f}")

        # Clear step metrics to prepare for the next epoch
        self.val_losses_step.clear()
        self.val_accuracies_step.clear()

    def configure_optimizers(self):
        """
        Configures the optimizer and learning rate scheduler.
        :return: Dictionary containing the optimizer and scheduler configurations.
        """
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)

        # Dynamically calculate the total number of steps
        steps_per_epoch = 1534
        total_steps = steps_per_epoch * self.trainer.max_epochs

        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer,
            max_lr=self.lr,
            total_steps=total_steps,
            pct_start=0.1,
            anneal_strategy="linear",
        )
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}

In [23]:
data_path = "data/xnli"

xnli_train_dataset = XNLIDataset(split="train", language="fr", cache_directory=data_path, max_length=64)
xnli_val_dataset = XNLIDataset(split="validation", language="fr", cache_directory=data_path, max_length=64)

train_loader = DataLoader(xnli_train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(xnli_val_dataset, batch_size=256, shuffle=False)


steps_per_epoch = len(train_loader)
steps_per_epoch

1534

In [28]:
num_labels = 3  # Exemples pour NLI : entailment, contradiction, neutral
hidden_size = config.hidden_size  # Taille de sortie de l'encoder

camembert = Camembert_base(loaded_model.embeddings, loaded_model.encoder, NLIHead(hidden_size, num_labels))

In [29]:
pl_camembert = NLI(model=camembert)

trainer = pl.Trainer(
    max_epochs=5,
    accelerator="gpu",  # Utilise GPU
    devices=1,  # Utilise un seul GPU
    callbacks=[
        ModelCheckpoint(
            monitor="val_loss",
            dirpath="../checkpoints/",
            filename="nli-{epoch:02d}-{val_loss:.2f}",
            save_top_k=2,
            mode="min",
        )
    ],
    logger=TensorBoardLogger("logs/", name="nli_experiment"),
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [30]:
trainer.fit(pl_camembert, train_loader, val_loader) 

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type               | Params | Mode 
--------------------------------------------------------------
0 | model          | Camembert_base     | 103 M  | train
1 | train_accuracy | MulticlassAccuracy | 0      | train
2 | val_accuracy   | MulticlassAccuracy | 0      | train
--------------------------------------------------------------
103 M     Trainable params
0         Non-trainable params
103 M     Total params
412.568   Total estimated model params size (MB)
8         Modules in train mode
152       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=5` reached.
