# Notebook 2 : Fine-tuning de DistilBERT pour Question-Answering

**Cours:** M2 Datascale - Fouille de Données  

## Objectifs
- Fine-tuner le modèle DistilBERT-base-uncased sur SQuAD v1.1
- Évaluer avec les métriques F1 Score et Exact Match
- Mesurer le temps d'inférence
- Sauvegarder le modèle fine-tuné

## Modèle
- **Architecture:** DistilBERT (Sanh et al., 2019)
- **Paramètres:** 66M
- **Caractéristiques:** Version distillée de BERT, 40% plus rapide

## Références
- Sanh et al. (2019). "DistilBERT, a distilled version of BERT"
- Rajpurkar et al. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension"

## 1. Vérification de l'Environnement GPU

In [1]:
import torch

print("Vérification de l'environnement GPU...")
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU détecté: {device_name}")
    print(f"Mémoire disponible: {device_memory:.2f} GB")
    device = "cuda"
else:
    print("Aucun GPU détecté. Utilisation du CPU.")
    print("Note: L'entraînement sur CPU sera significativement plus lent.")
    device = "cpu"

Vérification de l'environnement GPU...
GPU détecté: NVIDIA GeForce RTX 5080
Mémoire disponible: 16.60 GB


## 2. Installation des Dépendances

In [2]:
!pip install -q transformers datasets evaluate accelerate torch

## 3. Imports

In [3]:
import os
import random
import numpy as np
import time
import json
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DefaultDataCollator,
    pipeline
)
import evaluate
from collections import defaultdict

  from .autonotebook import tqdm as notebook_tqdm


## 4. Configuration des Hyperparamètres

In [4]:
# Seed pour la reproductibilité
SEED = 42

# Configuration du modèle
MODEL_NAME = "distilbert-base-uncased"
OUTPUT_DIR = "../models/distilbert_squad_finetuned"

# Hyperparamètres de tokenisation
MAX_LENGTH = 384
DOC_STRIDE = 128

# Hyperparamètres d'entraînement
LEARNING_RATE = 3e-5
NUM_EPOCHS = 3
TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 64
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1

# Taille du dataset
USE_FULL_DATASET = True  # Mettre False pour test rapide
MAX_TRAIN_SAMPLES = 87599 if USE_FULL_DATASET else 5000
MAX_EVAL_SAMPLES = 10570 if USE_FULL_DATASET else 1000

print("="*80)
print("Configuration des hyperparamètres")
print("="*80)
print(f"Modèle: {MODEL_NAME}")
print(f"Longueur maximale: {MAX_LENGTH} tokens")
print(f"Stride: {DOC_STRIDE} tokens")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Nombre d'epochs: {NUM_EPOCHS}")
print(f"Batch size: {TRAIN_BATCH_SIZE}")
print(f"Échantillons entraînement: {MAX_TRAIN_SAMPLES}")
print(f"Échantillons validation: {MAX_EVAL_SAMPLES}")

Configuration des hyperparamètres
Modèle: distilbert-base-uncased
Longueur maximale: 384 tokens
Stride: 128 tokens
Learning rate: 3e-05
Nombre d'epochs: 3
Batch size: 64
Échantillons entraînement: 87599
Échantillons validation: 10570


## 5. Fixation du Seed

In [5]:
def set_seed(seed):
    """
    Fixe le seed pour assurer la reproductibilité.

    Args:
        seed (int): Valeur du seed
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(SEED)
print(f"Seed fixé à {SEED} pour la reproductibilité.")

Seed fixé à 42 pour la reproductibilité.


## 6. Chargement des Données

In [6]:
print("="*80)
print("Chargement du dataset SQuAD v1.1")
print("="*80)

squad = load_dataset("squad")

# Sélection des sous-ensembles
train_dataset = squad["train"].shuffle(seed=SEED).select(
    range(min(MAX_TRAIN_SAMPLES, len(squad["train"])))
)
eval_dataset = squad["validation"].shuffle(seed=SEED).select(
    range(min(MAX_EVAL_SAMPLES, len(squad["validation"])))
)

print(f"Échantillons d'entraînement: {len(train_dataset)}")
print(f"Échantillons de validation: {len(eval_dataset)}")

Chargement du dataset SQuAD v1.1
Échantillons d'entraînement: 87599
Échantillons de validation: 10570


## 7. Chargement du Tokenizer

In [7]:
print("="*80)
print("Chargement du tokenizer")
print("="*80)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
print(f"Tokenizer chargé: {MODEL_NAME}")
print(f"Taille du vocabulaire: {tokenizer.vocab_size}")

Chargement du tokenizer
Tokenizer chargé: distilbert-base-uncased
Taille du vocabulaire: 30522


## 8. Préparation des Données (Tokenisation)

In [8]:
def prepare_train_features(examples):
    """
    Tokenise les exemples d'entraînement et aligne les positions des réponses.

    Args:
        examples (dict): Batch d'exemples contenant questions, contextes et réponses

    Returns:
        dict: Features tokenisées avec positions start/end
    """
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
            continue

        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
            token_start_index += 1

        token_end_index = len(input_ids) - 1
        while sequence_ids[token_end_index] != 1:
            token_end_index -= 1

        if not (offsets[token_start_index][0] <= start_char and
                offsets[token_end_index][1] >= end_char):
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                token_start_index += 1
            start_positions.append(token_start_index - 1)

            while offsets[token_end_index][1] >= end_char:
                token_end_index -= 1
            end_positions.append(token_end_index + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    return tokenized


def prepare_validation_features(examples):
    """
    Tokenise les exemples de validation en conservant les métadonnées.

    Args:
        examples (dict): Batch d'exemples de validation

    Returns:
        dict: Features tokenisées avec IDs d'exemples et offset mapping
    """
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    tokenized["example_id"] = []

    for i in range(len(tokenized["input_ids"])):
        sequence_ids = tokenized.sequence_ids(i)
        context_index = 1

        sample_index = sample_mapping[i]
        tokenized["example_id"].append(examples["id"][sample_index])

        tokenized["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized["offset_mapping"][i])
        ]

    return tokenized




In [9]:
print("="*80)
print("Tokenisation des données")
print("="*80)

print("Tokenisation de l'ensemble d'entraînement...")
tokenized_train = train_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_dataset.column_names,
)

print("Tokenisation de l'ensemble de validation...")
tokenized_validation = eval_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
)

# Validation set pour eval_loss (doit contenir start_positions / end_positions)
tokenized_validation_for_loss = eval_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
)

print(f"\nFeatures d'entraînement: {len(tokenized_train)}")
print(f"Features de validation: {len(tokenized_validation)}")

Tokenisation des données
Tokenisation de l'ensemble d'entraînement...
Tokenisation de l'ensemble de validation...

Features d'entraînement: 88524
Features de validation: 10784


## 9. Chargement du Modèle

In [10]:
print("="*80)
print("Initialisation du modèle")
print("="*80)

model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Modèle: {MODEL_NAME}")
print(f"Paramètres totaux: {total_params:,}")
print(f"Paramètres entraînables: {trainable_params:,}")
print(f"Device: {device}")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Initialisation du modèle
Modèle: distilbert-base-uncased
Paramètres totaux: 66,364,418
Paramètres entraînables: 66,364,418
Device: cuda


## 10. Configuration de l'Entraînement

In [11]:
training_args = TrainingArguments(
    output_dir="../models/results_distilbert_squad",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    learning_rate=3e-5,
    warmup_ratio=0.1,
    logging_steps=100,
)

data_collator = DefaultDataCollator()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train, # Correction: Utiliser le dataset tokenisé pour l'entraînement
    eval_dataset=tokenized_validation_for_loss,
    processing_class=tokenizer, # Adressé le FutureWarning ici
    data_collator=DefaultDataCollator(),
)

print("Trainer configuré avec succès.")

Trainer configuré avec succès.


## 11. Entraînement

In [12]:
print("="*80)
print("Début de l'entraînement")
print("="*80)

start_time = time.time()
train_result = trainer.train()
training_time = time.time() - start_time

print("\n" + "="*80)
print("Entraînement terminé")
print("="*80)
print(f"Durée totale: {training_time/60:.2f} minutes")
print(f"Loss finale: {train_result.training_loss:.4f}")

# Sauvegarde
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"\nModèle sauvegardé dans: {OUTPUT_DIR}")

Début de l'entraînement


Epoch,Training Loss,Validation Loss
1,1.3069,1.243359
2,1.04,1.150378
3,0.8415,1.155398



Entraînement terminé
Durée totale: 55.62 minutes
Loss finale: 1.3449

Modèle sauvegardé dans: ../models/distilbert_squad_finetuned


## 12. Évaluation

In [13]:
def postprocess_qa_predictions(examples, features, raw_predictions,
                                n_best=20, max_answer_length=30):
    """
    Post-traite les prédictions brutes pour extraire les réponses textuelles.

    Args:
        examples: Exemples originaux
        features: Features tokenisées
        raw_predictions: Logits de début et fin
        n_best: Nombre de candidats à considérer
        max_answer_length: Longueur maximale de la réponse

    Returns:
        dict: Mapping ID -> texte de réponse
    """
    all_start_logits, all_end_logits = raw_predictions

    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    predictions = {}

    for example_index, example in enumerate(examples):
        feature_indices = features_per_example[example_index]
        context = example["context"]

        best_answer = {"text": "", "score": -float("inf")}

        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            offset_mapping = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logits)[-n_best:][::-1]
            end_indexes = np.argsort(end_logits)[-n_best:][::-1]

            for start_index in start_indexes:
                for end_index in end_indexes:
                    if (start_index >= len(offset_mapping) or
                        end_index >= len(offset_mapping) or
                        offset_mapping[start_index] is None or
                        offset_mapping[end_index] is None):
                        continue

                    if (end_index < start_index or
                        end_index - start_index + 1 > max_answer_length):
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    text = context[start_char:end_char]
                    score = start_logits[start_index] + end_logits[end_index]

                    if score > best_answer["score"]:
                        best_answer = {"text": text, "score": float(score)}

        predictions[example["id"]] = best_answer["text"]

    return predictions

In [14]:
print("="*80)
print("Évaluation du modèle")
print("="*80)

print("Génération des prédictions...")
raw_predictions = trainer.predict(tokenized_validation)

print("Post-traitement des prédictions...")
final_predictions = postprocess_qa_predictions(
    eval_dataset,
    tokenized_validation,
    raw_predictions.predictions
)

# Calcul des métriques SQuAD
metric = evaluate.load("squad")

formatted_predictions = [
    {"id": k, "prediction_text": v}
    for k, v in final_predictions.items()
]
references = [
    {"id": ex["id"], "answers": ex["answers"]}
    for ex in eval_dataset
]

results = metric.compute(predictions=formatted_predictions, references=references)

print("\n" + "="*80)
print("Résultats de l'évaluation")
print("="*80)
print(f"F1 Score: {results['f1']:.2f}%")
print(f"Exact Match: {results['exact_match']:.2f}%")

Évaluation du modèle
Génération des prédictions...


Post-traitement des prédictions...

Résultats de l'évaluation
F1 Score: 84.41%
Exact Match: 75.81%


## 13. Test d'Inférence

In [15]:
print("="*80)
print("Test d'inférence")
print("="*80)

qa_pipeline = pipeline(
    "question-answering",
    model=OUTPUT_DIR,
    tokenizer=OUTPUT_DIR,
    device=0 if torch.cuda.is_available() else -1
)

test_context = """
The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical
rainforest in the Amazon biome that covers most of the Amazon basin of South America.
The basin is 7,000,000 square kilometres. The rainforest represents over half of the
planet's remaining rainforests and comprises the largest and most biodiverse tract
of tropical rainforest in the world.
"""

test_questions = [
    "How large is the Amazon basin?",
    "What is another name for the Amazon rainforest?",
]

inference_times = []

for question in test_questions:
    start = time.time()
    result = qa_pipeline(question=question, context=test_context)
    inference_time = (time.time() - start) * 1000
    inference_times.append(inference_time)

    print(f"\nQuestion: {question}")
    print(f"Réponse: {result['answer']}")
    print(f"Confiance: {result['score']:.4f}")
    print(f"Temps: {inference_time:.2f} ms")

avg_inference_time = np.mean(inference_times)
print(f"\nTemps d'inférence moyen: {avg_inference_time:.2f} ms")

Device set to use cuda:0


Test d'inférence

Question: How large is the Amazon basin?
Réponse: 7,000,000 square kilometres
Confiance: 0.9223
Temps: 3.56 ms

Question: What is another name for the Amazon rainforest?
Réponse: Amazonia
Confiance: 0.9450
Temps: 2.87 ms

Temps d'inférence moyen: 3.21 ms


## 14. Sauvegarde des Résultats

In [16]:
results_summary = {
    "model_name": MODEL_NAME,
    "model_type": "distilbert",
    "finetuned": True,
    "f1": results["f1"],
    "exact_match": results["exact_match"],
    "training_time_minutes": training_time / 60,
    "avg_inference_time_ms": avg_inference_time,
    "total_parameters": total_params,
    "trainable_parameters": trainable_params,
    "num_train_samples": len(train_dataset),
    "num_eval_samples": len(eval_dataset),
    "num_epochs": NUM_EPOCHS,
    "batch_size": TRAIN_BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "max_length": MAX_LENGTH,
    "doc_stride": DOC_STRIDE,
}

output_path = f"{OUTPUT_DIR}/results.json"
with open(output_path, "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"Résultats sauvegardés dans: {output_path}")

print("\n" + "="*80)
print("Résumé des résultats - DistilBERT")
print("="*80)

for key, value in results_summary.items():    print(f"{key}: {value}")

Résultats sauvegardés dans: ../models/distilbert_squad_finetuned/results.json

Résumé des résultats - DistilBERT
model_name: distilbert-base-uncased
model_type: distilbert
finetuned: True
f1: 84.41070434105153
exact_match: 75.80889309366131
training_time_minutes: 55.61639648278554
avg_inference_time_ms: 3.21042537689209
total_parameters: 66364418
trainable_parameters: 66364418
num_train_samples: 87599
num_eval_samples: 10570
num_epochs: 3
batch_size: 64
learning_rate: 3e-05
max_length: 384
doc_stride: 128


## 15. Téléchargement (Google Colab)

In [None]:
try:
    from google.colab import files
    files.download(output_path)
    print(f"Fichier téléchargé: {output_path}")
except ImportError:
    print("Exécution en environnement local.")
    print(f"Résultats disponibles dans: {output_path}")

Exécution en environnement local.
Résultats disponibles dans: ../models/distilbert_squad_finetuned/results.json


: 