# Notebook 3 : Fine-tuning de RoBERTa pour Question-Answering

**Cours:** M2 Datascale - Fouille de Donn√©es  

## Objectifs
- Fine-tuner le mod√®le RoBERTa-base sur SQuAD v1.1
- √âvaluer avec les m√©triques F1 Score et Exact Match
- Mesurer le temps d'inf√©rence
- Comparer avec DistilBERT

## Mod√®le
- **Architecture:** RoBERTa (Liu et al., 2019)
- **Param√®tres:** 125M
- **Caract√©ristiques:** Optimisation de BERT avec plus de donn√©es et meilleur pr√©-entra√Ænement

## R√©f√©rences
- Liu et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach"
- Rajpurkar et al. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension"

## 1. V√©rification de l'Environnement GPU

In [1]:
import torch

print("V√©rification de l'environnement GPU...")
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU d√©tect√©: {device_name}")
    print(f"M√©moire disponible: {device_memory:.2f} GB")
    device = "cuda"
else:
    print("Aucun GPU d√©tect√©. Utilisation du CPU.")
    print("Note: L'entra√Ænement sur CPU sera significativement plus lent.")
    device = "cpu"

V√©rification de l'environnement GPU...
GPU d√©tect√©: NVIDIA GeForce RTX 5080
M√©moire disponible: 16.60 GB


## 2. Installation des D√©pendances

In [2]:
!pip install -q transformers datasets evaluate accelerate torch

## 3. Imports

In [3]:
import os
import random
import numpy as np
import time
import json
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DefaultDataCollator,
    pipeline
)
import evaluate
from collections import defaultdict

  from .autonotebook import tqdm as notebook_tqdm


## 4. Configuration des Hyperparam√®tres

In [4]:
# Seed pour la reproductibilit√©
SEED = 42

# Configuration du mod√®le
MODEL_NAME = "roberta-base"
OUTPUT_DIR = "../models/roberta_squad_finetuned"

# Hyperparam√®tres de tokenisation
MAX_LENGTH = 384
DOC_STRIDE = 128

# Hyperparam√®tres d'entra√Ænement
LEARNING_RATE = 3e-5
NUM_EPOCHS = 3
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 64
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1

# Taille du dataset
USE_FULL_DATASET = True  # Mettre False pour test rapide
MAX_TRAIN_SAMPLES = 87599 if USE_FULL_DATASET else 5000
MAX_EVAL_SAMPLES = 10570 if USE_FULL_DATASET else 1000

print("="*80)
print("Configuration des hyperparam√®tres - RoBERTa")
print("="*80)
print(f"Mod√®le: {MODEL_NAME}")
print(f"Longueur maximale: {MAX_LENGTH} tokens")
print(f"Stride: {DOC_STRIDE} tokens")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Nombre d'epochs: {NUM_EPOCHS}")
print(f"Batch size: {TRAIN_BATCH_SIZE}")
print(f"√âchantillons entra√Ænement: {MAX_TRAIN_SAMPLES}")
print(f"√âchantillons validation: {MAX_EVAL_SAMPLES}")

Configuration des hyperparam√®tres - RoBERTa
Mod√®le: roberta-base
Longueur maximale: 384 tokens
Stride: 128 tokens
Learning rate: 3e-05
Nombre d'epochs: 3
Batch size: 32
√âchantillons entra√Ænement: 87599
√âchantillons validation: 10570


## 5. Fixation du Seed

In [5]:
def set_seed(seed):
    """
    Fixe le seed pour assurer la reproductibilit√©.

    Args:
        seed (int): Valeur du seed
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(SEED)
print(f"Seed fix√© √† {SEED} pour la reproductibilit√©.")

Seed fix√© √† 42 pour la reproductibilit√©.


## 6. Chargement des Donn√©es

In [6]:
print("="*80)
print("Chargement du dataset SQuAD v1.1")
print("="*80)

squad = load_dataset("squad")

# S√©lection des sous-ensembles
train_dataset = squad["train"].shuffle(seed=SEED).select(
    range(min(MAX_TRAIN_SAMPLES, len(squad["train"])))
)
eval_dataset = squad["validation"].shuffle(seed=SEED).select(
    range(min(MAX_EVAL_SAMPLES, len(squad["validation"])))
)

print(f"√âchantillons d'entra√Ænement: {len(train_dataset)}")
print(f"√âchantillons de validation: {len(eval_dataset)}")

Chargement du dataset SQuAD v1.1
√âchantillons d'entra√Ænement: 87599
√âchantillons de validation: 10570


## 7. Chargement du Tokenizer

In [7]:
print("="*80)
print("Chargement du tokenizer RoBERTa")
print("="*80)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
print(f"Tokenizer charg√©: {MODEL_NAME}")
print(f"Taille du vocabulaire: {tokenizer.vocab_size}")

Chargement du tokenizer RoBERTa
Tokenizer charg√©: roberta-base
Taille du vocabulaire: 50265


## 8. Pr√©paration des Donn√©es (Tokenisation)

In [8]:
def prepare_train_features(examples):
    """
    Tokenise les exemples d'entra√Ænement et aligne les positions des r√©ponses.

    Args:
        examples (dict): Batch d'exemples contenant questions, contextes et r√©ponses

    Returns:
        dict: Features tokenis√©es avec positions start/end
    """
    # Supprimer les espaces √† gauche des questions (important pour RoBERTa)
    examples["question"] = [q.lstrip() for q in examples["question"]]

    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
            continue

        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
            token_start_index += 1

        token_end_index = len(input_ids) - 1
        while sequence_ids[token_end_index] != 1:
            token_end_index -= 1

        if not (offsets[token_start_index][0] <= start_char and
                offsets[token_end_index][1] >= end_char):
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                token_start_index += 1
            start_positions.append(token_start_index - 1)

            while offsets[token_end_index][1] >= end_char:
                token_end_index -= 1
            end_positions.append(token_end_index + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    return tokenized


def prepare_validation_features(examples):
    """
    Tokenise les exemples de validation en conservant les m√©tadonn√©es.

    Args:
        examples (dict): Batch d'exemples de validation

    Returns:
        dict: Features tokenis√©es avec IDs d'exemples et offset mapping
    """
    # Supprimer les espaces √† gauche des questions (important pour RoBERTa)
    examples["question"] = [q.lstrip() for q in examples["question"]]

    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    tokenized["example_id"] = []

    for i in range(len(tokenized["input_ids"])):
        sequence_ids = tokenized.sequence_ids(i)
        context_index = 1

        sample_index = sample_mapping[i]
        tokenized["example_id"].append(examples["id"][sample_index])

        tokenized["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized["offset_mapping"][i])
        ]

    return tokenized

In [9]:
print("="*80)
print("Tokenisation des donn√©es")
print("="*80)

print("Tokenisation de l'ensemble d'entra√Ænement...")
tokenized_train = train_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_dataset.column_names,
)

print("Tokenisation de l'ensemble de validation...")
tokenized_validation = eval_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
)

# Validation set pour eval_loss (doit contenir start_positions / end_positions)
tokenized_validation_for_loss = eval_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
)

print(f"\nFeatures d'entra√Ænement: {len(tokenized_train)}")
print(f"Features de validation: {len(tokenized_validation)}")

Tokenisation des donn√©es
Tokenisation de l'ensemble d'entra√Ænement...
Tokenisation de l'ensemble de validation...

Features d'entra√Ænement: 88568
Features de validation: 10790


## 9. Chargement du Mod√®le

In [10]:
print("="*80)
print("Initialisation du mod√®le RoBERTa")
print("="*80)

# Lib√©rer la m√©moire GPU si disponible
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("M√©moire GPU lib√©r√©e")

model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Mod√®le: {MODEL_NAME}")
print(f"Param√®tres totaux: {total_params:,}")
print(f"Param√®tres entra√Ænables: {trainable_params:,}")
print(f"Device: {device}")

# Afficher la m√©moire GPU utilis√©e
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    reserved = torch.cuda.memory_reserved(0) / 1e9
    print(f"M√©moire GPU allou√©e: {allocated:.2f} GB")
    print(f"M√©moire GPU r√©serv√©e: {reserved:.2f} GB")

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Initialisation du mod√®le RoBERTa
M√©moire GPU lib√©r√©e
Mod√®le: roberta-base
Param√®tres totaux: 124,056,578
Param√®tres entra√Ænables: 124,056,578
Device: cuda
M√©moire GPU allou√©e: 0.50 GB
M√©moire GPU r√©serv√©e: 0.53 GB


## 10. Configuration de l'Entra√Ænement

## 9b. Optimisation M√©moire (optionnel)

Si vous rencontrez des erreurs de m√©moire GPU, ex√©cutez cette cellule avant de continuer.

In [11]:
# Activer le gradient checkpointing pour r√©duire l'utilisation m√©moire
if hasattr(model.config, 'gradient_checkpointing'):
    model.gradient_checkpointing_enable()
    print("‚úì Gradient checkpointing activ√©")

# R√©duire les batch sizes si n√©cessaire
if torch.cuda.is_available():
    available_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    allocated_memory = torch.cuda.memory_allocated(0) / 1e9
    free_memory = available_memory - allocated_memory

    print(f"M√©moire GPU libre: {free_memory:.2f} GB")

    # Si moins de 6GB disponibles, r√©duire les batch sizes
    if free_memory < 6:
        TRAIN_BATCH_SIZE = 8
        EVAL_BATCH_SIZE = 16
        print(f"‚ö†Ô∏è  Batch sizes reduced: train={TRAIN_BATCH_SIZE}, eval={EVAL_BATCH_SIZE}")

    # Si moins de 4GB, encore plus petit
    if free_memory < 4:
        TRAIN_BATCH_SIZE = 4
        EVAL_BATCH_SIZE = 8
        print(f"‚ö†Ô∏è  Batch sizes further reduced: train={TRAIN_BATCH_SIZE}, eval={EVAL_BATCH_SIZE}")

M√©moire GPU libre: 16.10 GB


In [12]:
training_args = TrainingArguments(
    output_dir="../models/results_roberta_squad",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_ratio=WARMUP_RATIO,
    weight_decay=WEIGHT_DECAY,
    logging_steps=100,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation_for_loss,
    processing_class=tokenizer,
    data_collator=DefaultDataCollator(),
)

print("Trainer RoBERTa configur√© avec succ√®s.")

Trainer RoBERTa configur√© avec succ√®s.


## 11. Entra√Ænement

In [13]:
print("="*80)
print("D√©but de l'entra√Ænement - RoBERTa")
print("="*80)
print("‚è±Ô∏è  Temps estim√©: ~70 minutes sur RTX 5080")

start_time = time.time()
train_result = trainer.train()
training_time = time.time() - start_time

print("\n" + "="*80)
print("Entra√Ænement termin√©")
print("="*80)
print(f"Dur√©e totale: {training_time/60:.2f} minutes")
print(f"Loss finale: {train_result.training_loss:.4f}")

# Sauvegarde
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"\nMod√®le sauvegard√© dans: {OUTPUT_DIR}")

D√©but de l'entra√Ænement - RoBERTa
‚è±Ô∏è  Temps estim√©: ~70 minutes sur RTX 5080


Epoch,Training Loss,Validation Loss
1,0.8801,0.855378
2,0.6991,0.831307
3,0.5313,0.880576



Entra√Ænement termin√©
Dur√©e totale: 101.91 minutes
Loss finale: 0.8666

Mod√®le sauvegard√© dans: ../models/roberta_squad_finetuned


## 12. √âvaluation

In [14]:
def postprocess_qa_predictions(examples, features, raw_predictions,
                                n_best=20, max_answer_length=30):
    """
    Post-traite les pr√©dictions brutes pour extraire les r√©ponses textuelles.

    Args:
        examples: Exemples originaux
        features: Features tokenis√©es
        raw_predictions: Logits de d√©but et fin
        n_best: Nombre de candidats √† consid√©rer
        max_answer_length: Longueur maximale de la r√©ponse

    Returns:
        dict: Mapping ID -> texte de r√©ponse
    """
    all_start_logits, all_end_logits = raw_predictions

    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    predictions = {}

    for example_index, example in enumerate(examples):
        feature_indices = features_per_example[example_index]
        context = example["context"]

        best_answer = {"text": "", "score": -float("inf")}

        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            offset_mapping = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logits)[-n_best:][::-1]
            end_indexes = np.argsort(end_logits)[-n_best:][::-1]

            for start_index in start_indexes:
                for end_index in end_indexes:
                    if (start_index >= len(offset_mapping) or
                        end_index >= len(offset_mapping) or
                        offset_mapping[start_index] is None or
                        offset_mapping[end_index] is None):
                        continue

                    if (end_index < start_index or
                        end_index - start_index + 1 > max_answer_length):
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    text = context[start_char:end_char]
                    score = start_logits[start_index] + end_logits[end_index]

                    if score > best_answer["score"]:
                        best_answer = {"text": text, "score": float(score)}

        predictions[example["id"]] = best_answer["text"]

    return predictions

In [15]:
print("="*80)
print("√âvaluation du mod√®le RoBERTa")
print("="*80)

print("G√©n√©ration des pr√©dictions...")
raw_predictions = trainer.predict(tokenized_validation)

print("Post-traitement des pr√©dictions...")
final_predictions = postprocess_qa_predictions(
    eval_dataset,
    tokenized_validation,
    raw_predictions.predictions
)

# Calcul des m√©triques SQuAD
metric = evaluate.load("squad")

formatted_predictions = [
    {"id": k, "prediction_text": v}
    for k, v in final_predictions.items()
]
references = [
    {"id": ex["id"], "answers": ex["answers"]}
    for ex in eval_dataset
]

results = metric.compute(predictions=formatted_predictions, references=references)

print("\n" + "="*80)
print("R√©sultats de l'√©valuation - RoBERTa")
print("="*80)
print(f"F1 Score: {results['f1']:.2f}%")
print(f"Exact Match: {results['exact_match']:.2f}%")

√âvaluation du mod√®le RoBERTa
G√©n√©ration des pr√©dictions...


Post-traitement des pr√©dictions...

R√©sultats de l'√©valuation - RoBERTa
F1 Score: 91.96%
Exact Match: 85.65%


## 13. Test d'Inf√©rence

In [16]:
print("="*80)
print("Test d'inf√©rence - RoBERTa")
print("="*80)

qa_pipeline = pipeline(
    "question-answering",
    model=OUTPUT_DIR,
    tokenizer=OUTPUT_DIR,
    device=0 if torch.cuda.is_available() else -1
)

test_context = """
The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical
rainforest in the Amazon biome that covers most of the Amazon basin of South America.
The basin is 7,000,000 square kilometres. The rainforest represents over half of the
planet's remaining rainforests and comprises the largest and most biodiverse tract
of tropical rainforest in the world.
"""

test_questions = [
    "How large is the Amazon basin?",
    "What is another name for the Amazon rainforest?",
]

inference_times = []

for question in test_questions:
    start = time.time()
    result = qa_pipeline(question=question, context=test_context)
    inference_time = (time.time() - start) * 1000
    inference_times.append(inference_time)

    print(f"\nQuestion: {question}")
    print(f"R√©ponse: {result['answer']}")
    print(f"Confiance: {result['score']:.4f}")
    print(f"Temps: {inference_time:.2f} ms")

avg_inference_time = np.mean(inference_times)
print(f"\nTemps d'inf√©rence moyen: {avg_inference_time:.2f} ms")

Device set to use cuda:0


Test d'inf√©rence - RoBERTa

Question: How large is the Amazon basin?
R√©ponse: 7,000,000 square kilometres
Confiance: 0.9721
Temps: 10.92 ms

Question: What is another name for the Amazon rainforest?
R√©ponse: Amazonia
Confiance: 0.9851
Temps: 4.17 ms

Temps d'inf√©rence moyen: 7.54 ms


## 14. Sauvegarde des R√©sultats

In [17]:
results_summary = {
    "model_name": MODEL_NAME,
    "model_type": "roberta",
    "finetuned": True,
    "f1": results["f1"],
    "exact_match": results["exact_match"],
    "training_time_minutes": training_time / 60,
    "avg_inference_time_ms": avg_inference_time,
    "total_parameters": total_params,
    "trainable_parameters": trainable_params,
    "num_train_samples": len(train_dataset),
    "num_eval_samples": len(eval_dataset),
    "num_epochs": NUM_EPOCHS,
    "batch_size": TRAIN_BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "max_length": MAX_LENGTH,
    "doc_stride": DOC_STRIDE,
}

output_path = f"{OUTPUT_DIR}/results.json"
with open(output_path, "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"R√©sultats sauvegard√©s dans: {output_path}")

print("\n" + "="*80)
print("R√©sum√© des r√©sultats - RoBERTa")
print("="*80)
for key, value in results_summary.items():
    print(f"{key}: {value}")

R√©sultats sauvegard√©s dans: ../models/roberta_squad_finetuned/results.json

R√©sum√© des r√©sultats - RoBERTa
model_name: roberta-base
model_type: roberta
finetuned: True
f1: 91.95973256997178
exact_match: 85.6480605487228
training_time_minutes: 101.91337714592616
avg_inference_time_ms: 7.544159889221191
total_parameters: 124056578
trainable_parameters: 124056578
num_train_samples: 87599
num_eval_samples: 10570
num_epochs: 3
batch_size: 32
learning_rate: 3e-05
max_length: 384
doc_stride: 128


## 15. Comparaison avec Baseline

Comparaison des performances avant et apr√®s fine-tuning

In [None]:
# Charger les r√©sultats baseline
baseline_path = "../models/roberta_baseline/results.json"
if os.path.exists(baseline_path):
    with open(baseline_path, 'r') as f:
        baseline_results = json.load(f)

    print("\n" + "="*80)
    print("COMPARAISON: Baseline vs Fine-tuned (RoBERTa)")
    print("="*80)
    print(f"{'M√©trique':<25} {'Baseline':>15} {'Fine-tuned':>15} {'Gain':>15}")
    print("-"*80)
    print(f"{'F1 Score (%)':<25} {baseline_results['f1']:>15.2f} {results['f1']:>15.2f} {results['f1']-baseline_results['f1']:>+15.2f}")
    print(f"{'Exact Match (%)':<25} {baseline_results['exact_match']:>15.2f} {results['exact_match']:>15.2f} {results['exact_match']-baseline_results['exact_match']:>+15.2f}")
    print("="*80)
    print(f"\nüéØ Am√©lioration F1: +{results['f1']-baseline_results['f1']:.2f} points gr√¢ce au fine-tuning!")
else:
    print(f"\n‚ö†Ô∏è  Fichier baseline non trouv√©: {baseline_path}")


COMPARAISON: Baseline vs Fine-tuned (RoBERTa)
M√©trique                         Baseline      Fine-tuned            Gain
--------------------------------------------------------------------------------
F1 Score (%)                         6.50           91.96          +85.46
Exact Match (%)                      1.25           85.65          +84.40

üéØ Am√©lioration F1: +85.46 points gr√¢ce au fine-tuning!


: 