Daily Challenge : Preprocess & fine-tune transformer-based models

1. Understanding BERT and XLM-RoBERTa

Objective: Learn how transformer models work and their role in NLP tasks.

In [1]:
from transformers import BertTokenizer, XLMRobertaTokenizer

In [2]:
from transformers import XLMRobertaForSequenceClassification

In [3]:
import torch

In [4]:
MODEL_TYPE = 'xlm-roberta-base'
model = XLMRobertaForSequenceClassification.from_pretrained(
                 MODEL_TYPE,
                 num_labels = 3 # The number of output labels. 2 for binary classification.
              )
# Define and initialize input tensors with sample data
# Replace these with your actual data
b_input_ids = torch.tensor([[0, 35378, 2685, 5, 2, 1, 1, 1, 1, 1]])
b_input_mask = torch.tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
b_labels = torch.tensor([0])  # Replace with your actual label
outputs = model(input_ids=b_input_ids,
                 attention_mask=b_input_mask,
                 labels=b_labels) #
# These are the model inputs:
#   input_ids (type: torch tensor)
#   attention_mask (type: torch tensor)
#   labels (type: torch tensor)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
outputs = model(input_ids=b_input_ids,
                 attention_mask=b_input_mask,
                 labels=b_labels)

2. Tokenizing Text

Use the BertTokenizer and XLMRobertaTokenizer to convert sentences into tokenized input.

In [6]:
# Chargement des tokenizers pré-entraînés
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
xlm_roberta_tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [7]:
# Exemple de phrase
text = "il fait très beau aujourd'hui et toute la semaine."

In [8]:
# Tokenisation avec BERT
bert_tokens = bert_tokenizer.encode_plus(
    text,
    padding="max_length", #Ajoute des 0 pour que toutes les phrases aient la même longueur.
    truncation=True,
    max_length=20 # Longueur maximale fixée
)

In [9]:
print("BERT Tokens:", bert_tokens)

BERT Tokens: {'input_ids': [101, 6335, 26208, 2102, 24403, 17935, 8740, 23099, 4103, 1005, 17504, 3802, 2000, 10421, 2474, 7367, 24238, 2063, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [10]:
# Tokenisation avec XLM-RoBERTa
xlm_roberta_tokens = xlm_roberta_tokenizer.encode_plus(text, padding="max_length", truncation=True, max_length=20)
print(" XLM-RoBERTa Tokens:", xlm_roberta_tokens)

 XLM-RoBERTa Tokens: {'input_ids': [0, 211, 3193, 4099, 44551, 30639, 25, 12522, 82, 13725, 21, 36438, 5, 2, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}


In [12]:
# Exemple avec 2 phrases
text1 = "il fait très beau aujourd'hui et toute la semaine."
text2 = "demain, la pluie revient avec beaucoupe de vent et d'oarges"

In [13]:
bert_tokenizer(
    text1,
    text2,
    padding="max_length",  # Assure que toutes les phrases ont la même taille
    truncation=True,  # Coupe si la phrase est trop longue
    max_length=20,  # Longueur maximale fixée
    return_tensors="pt"  # Convertit en tenseurs PyTorch
)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


{'input_ids': tensor([[  101,  6335, 26208,  2102, 24403, 17935,  8740, 23099,  4103,  1005,
           102, 17183,  8113,  1010,  2474, 20228, 10179,  2063,  7065,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [14]:
xlm_roberta_tokens = xlm_roberta_tokenizer.encode_plus(
                                text1,
                                text2,
                                padding="max_length",
                                truncation=True,
                                max_length=20,
                                 return_tensors="pt"
                               )

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


In [15]:
print("BERT Tokens:", bert_tokens)
print(" XLM-RoBERTa Tokens:", xlm_roberta_tokens)

BERT Tokens: {'input_ids': [101, 6335, 26208, 2102, 24403, 17935, 8740, 23099, 4103, 1005, 17504, 3802, 2000, 10421, 2474, 7367, 24238, 2063, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
 XLM-RoBERTa Tokens: {'input_ids': tensor([[     0,    211,   3193,   4099,  44551,  30639,     25,  12522,     82,
              2,      2,      8,  24931,      4,     21, 183212, 207647,   1609,
          13318,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


3. Preparing Input Data for the Model

Objective: Format input data correctly for transformer models.

tokenizer.encode_plus() : permet de transformer un texte en une séquence de tokens
-->el ajoute des informations utiles : attention_mask -->token_type_ids (Distingue les phrases en entrée (0 = première phrase,...).

- tokenizer.special_tokens_map: retourne une liste des tokens spéciaux utilisés par le modèle.Ces tokens sont nécessaires pour structurer le texte avant qu'il soit envoyé dans le modèle.

In [None]:
# Tokens spéciaux BERT :
#  {
#  'bos_token': '[CLS]',  # Début de séquence
#  'eos_token': '[SEP]',  # Fin de séquence
#  'unk_token': '[UNK]',  # Token inconnu
#  'sep_token': '[SEP]',  # Séparateur entre phrases
#  'pad_token': '[PAD]',  # Token de padding
#  'cls_token': '[CLS]',  # Token de classification
#  'mask_token': '[MASK]' # Token pour le masquage
# }

tokenizer.vocab_size : la taille du vocabulaire du modèle; Plus le vocabulaire est grand, plus le modèle peut comprendre de mots différents.

4. Loading and Exploring the Dataset

In [16]:
import pandas as pd

In [17]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


In [18]:
df_train = pd.read_csv('train.csv')

In [19]:
df_train.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [20]:
# Afficher les dimensions du dataset (nombre de lignes et colonnes)
print( df_train.shape)

(12120, 6)


In [21]:
# Afficher les noms des colonnes -->Vérifie si les noms des colonnes sont bien "texte" et "label" ou s'il faut les modifier.
df_train.columns

Index(['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label'], dtype='object')

In [22]:
print(df_train.isnull().sum())

id            0
premise       0
hypothesis    0
lang_abv      0
language      0
label         0
dtype: int64


Si les données sont propres, on peut passer à la validation croisée (StratifiedKFold).

5. Creating Cross-Validation Folds

Objective: Implement k-fold cross-validation for training.

 Objectif du code
préparer un dataset multilingue pour entraîner un modèle XLM-RoBERTa à classifier du texte.
Ce processus inclut le chargement des données, la tokenisation et la création de plis (folds) pour la validation croisée.

In [49]:
import torch
import pandas as pd
from transformers import XLMRobertaTokenizer
from sklearn.model_selection import StratifiedKFold #Permet de diviser les données en plusieurs sous-ensembles équilibrés pour la validation croisée.

In [23]:
# Charger le tokenizer XLM-RoBERTa
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

In [24]:
# Concaténer les colonnes "premise" et "hypothesis"
df_train["text"] = df_train["premise"] + " </s> " + df_train["hypothesis"]
# XLM-RoBERTa utilise </s> comme séparateur entre deux phrases, alors que BERT utilise [SEP].

In [26]:
# Définir X (les phrases) et y (les labels)
X = df_train["text"].values  # Les phrases combinées
y = df_train["label"].values  # Les catégories
# séparer les phrases et leurs labels pour préparer l'entraînement du modèle.

In [28]:
from sklearn.model_selection import StratifiedKFold

In [29]:
# Création de la validation croisée--> StratifiedKFold
n_splits = 5  # Nombre de splits
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
folds = list(skf.split(X, y))

In [30]:
# Stocker les données tokenisées pour chaque split
folds_data = []

In [31]:
for i, (train_idx, val_idx) in enumerate(folds):#Parcourt chaque pli de validation croisée.
    print(f"Préparation du pli {i+1}...")

Préparation du pli 1...
Préparation du pli 2...
Préparation du pli 3...
Préparation du pli 4...
Préparation du pli 5...


In [32]:
 # Séparer les données en train et validation pour ce split
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

In [33]:
# Tokeniser les phrases avec XLM-RoBERTa
train_encodings = tokenizer(
        list(X_train),
        padding="max_length",
        truncation=True, # coupe phrase si trop longue
        max_length=20,
        return_tensors="pt"  # Convertit resultats en tenseurs PyTorch
    ) # transforme en input_ids

val_encodings = tokenizer(
        list(X_val),
        padding="max_length",
        truncation=True,
        max_length=20,
        return_tensors="pt"
    )

In [34]:
# Convertir les labels en tenseurs PyTorch
y_train_tensor = torch.tensor(y_train)
y_val_tensor = torch.tensor(y_val)

In [35]:
# Stocker les données préparées pour chaque slpit
folds_data.append({
        "train": {"input_ids": train_encodings["input_ids"], "attention_mask": train_encodings["attention_mask"], "labels": y_train_tensor},
        "val": {"input_ids": val_encodings["input_ids"], "attention_mask": val_encodings["attention_mask"], "labels": y_val_tensor},
    })

In [36]:
print("✅ Données préparées pour tous les split avec XLM-RoBERTa.")

✅ Données préparées pour tous les split avec XLM-RoBERTa.
