<a href="https://colab.research.google.com/github/vincentmartin/tp-fine-tuning-student-version/blob/main/tp-fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP fine-tuning de LLM

Dans ce notebook vous allez fine tuner un LLM de base, Flan T5, avec la technique PEFT et LoRA.

### Instruction à suivre pour exécution sur Google Colab

Aller dans `Execution -> Modifier le type d'exécution` puis sélectionner `T4-GPU` pour exploiter les fonctionnalités GPU.

![Colab GPU](resources/colab_gpu.png "T4-GPU")

Installationd des dépendances

In [76]:
%pip install -U datasets

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch \
    torchdata --quiet

%pip install \
    transformers \
    evaluate \
    rouge_score \
    loralib \
    peft \
    bitsandbytes



Import des dépendances

In [77]:
from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, BitsAndBytesConfig
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import os
import bitsandbytes
os.environ["WANDB_DISABLED"] = "true"

Chargement du LLM de base.

In [78]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Création d'une fonction pour afficher le nombre de paramètres entraînables.

In [79]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


## Fine tuning avec PEFT et LoRA

Le fine tuning complet d'un modèle n'est pas un choix judicieux pour un particulier ou une entreprise qui n'a pas une énorme puissance de calcul. La méthode la plus appropriée est d'utiliser PEFT (_Parameter Efficient Fine-Tuning_).

PEFT est un ensemble de technique qui incluant LORA (_Low Rank Adaptation_) et le _prompt tuning_ (**différent du prompt engineering**). LORA permet de fine tuner un modèle avec peu de ressources matérielles (un ou deux GPU). LORA permet de créer des adapteurs composés de 1-10% des paramètres du LLM original. De plus, le LLM original n'est pas modifié, ce qui permet de rapidement changer d'adapteurs en fonction du cas d'usage.

### Configuration de PEFT / LoRA

Premièrement, configurons PEFT/LoRA pour fine tuner notre modèle de base avec ce que l'on appelle _adapteur_. 

PEFT/LoRA gêle les couches du LLM original pour n'entraîner que l'adapteur.

In [80]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank : plus il est grand, plus il y a de paramètres. Idéal : 16-32
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # Pour FLANT5, laisser ce type
)

Ajouter l'adapteur au LLM original.

In [81]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


## Lancement de l'entraînement

Chargeons le jeu de données pour l'entraînement.

In [82]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Pour que l'entraînement prenne un temps acceptable dans ce notebook, nous diminuons la taille du jeu de données.

In [83]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

**Exercice**  : en vous aidant de la documentation https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/trainer#transformers.TrainingArguments, créer une instance de **Trainer** pour entraîner le LLM. Vous utiliserez les paramètres suivants : 
- auto_find_batch_size=True,
- learning_rate=1e-3, 
- num_train_epochs=5,
- logging_steps=1,
- max_steps=1   

Le jeu de données à utiliser pour l'entraînement est `tokenized_datasets["train"]`.

**Dans Google Colab, utiliser `report_to=None` sinon il vous sera demandé une clef Wanadb.**

In [84]:
output_dir = './training-output'

peft_training_args = TrainingArguments(
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=5,
    max_steps=20,
    logging_steps=1,
    report_to='none'
)
    
peft_trainer =  Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

**Exercice** : Lancer l'entraînement et sauvegarder le modèle (adapteur)  et le tokenizer dans le dossier `training-output-checkpoint`.

In [85]:
save_dir = "./training-output-checkpoint"

peft_trainer.train()
peft_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

Step,Training Loss
1,48.0
2,46.25
3,42.25
4,38.75
5,33.75
6,30.0
7,28.125
8,27.125
9,25.25
10,23.75


('./training-output-checkpoint/tokenizer_config.json',
 './training-output-checkpoint/special_tokens_map.json',
 './training-output-checkpoint/spiece.model',
 './training-output-checkpoint/added_tokens.json',
 './training-output-checkpoint/tokenizer.json')

### Evaluation du modèle fine tuné

Une erreur classique lorsque l'on début est d'évaluer les performances en 'regardant' quelques générations manuellement. C'est une mauvaise idée car (1) ce n'est pas quantifié et (2) ce qui fonctionne sur quelques exemples ne fonctionne peut être pas sur des milliers d'exemples (principe de généralisation).

Lorsque l'on fine tune un modèle, il est donc capital de mesurer les performances pour savoir si **globalement** les résultats sont meilleurs.

In [86]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       'training-output-checkpoint', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)


In [87]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

device = "cuda" if torch.cuda.is_available() else "cpu"


prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = original_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)


peft_model_outputs = peft_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'RESUME HUMAIN:\n{human_baseline_summary}')
print(dash_line)
print(f'RESUME AVEC MODELE ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(dash_line)
print(f'RESUME AVEC MODELE PEFT: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
RESUME HUMAIN:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE ORIGINAL:
You might also want to upgrade your hardware.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE PEFT: Upgrade your computer.


Inférence sur 10 exemples du jeu de test.

In [88]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.to(device).generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

In [89]:
pd.set_option("display.max_rows", 200)  # ou une valeur assez grande
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,The memo will be distributed to all employees ...,This memo is to be distributed to all employee...
1,In order to prevent employees from wasting tim...,This memo is to be distributed to all employees.,This memo is to be distributed to all employee...
2,Ms. Dawson takes a dictation for #Person1# abo...,This memo is intended to clarify the new polic...,This memo is to be distributed to all employee...
3,#Person2# arrives late because of traffic jam....,You're finally here!,Take public transport to work.
4,#Person2# decides to follow #Person1#'s sugges...,I'm finally here!,Take public transport to work.
5,#Person2# complains to #Person1# about the tra...,I'm not sure what happened to my car.,Take public transport to work.
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,I'm not sure what happened.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,Greetings from the party.,"Happy birthday, Brian."


**Exercice** : en utilisant la documentation https://huggingface.co/docs/evaluate/main/en/choosing_a_metric, calculer le score ROUGE entre : 
- Les résumés du modèle original  (`original_model_summaries`)  vs. résumés humain (`human_baseline_summaries`).
- Les résumés du modèle peft  (`peft_model_summaries`) vs. résumé humain (`human_baseline_summaries`).

Afficher les scores et commentez les.

In [90]:
rouge = evaluate.load("rouge")

rouge_original = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
    rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
)

rouge_peft = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
    rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
)

print("ROUGE - Modèle original vs humain:")
print(rouge_original)

print("\nROUGE - Modèle PEFT vs humain:")
print(rouge_peft)

# Optionnel : affichage propre en tableau
pd.DataFrame({"original": rouge_original, "peft": rouge_peft}).T

ROUGE - Modèle original vs humain:
{'rouge1': np.float64(0.16843219645851226), 'rouge2': np.float64(0.060856068995603876), 'rougeL': np.float64(0.1467145135566188), 'rougeLsum': np.float64(0.14916206600417126)}

ROUGE - Modèle PEFT vs humain:
{'rouge1': np.float64(0.2933583453583454), 'rouge2': np.float64(0.14344664031620552), 'rougeL': np.float64(0.2408399008399008), 'rougeLsum': np.float64(0.24251591001591)}


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
original,0.168432,0.060856,0.146715,0.149162
peft,0.293358,0.143447,0.24084,0.242516


**Exercice** : calculer le gain de performance en pourcentage du modèle PEFT sur le modèle original

In [91]:
# Gain de performance (%) du modèle PEFT par rapport au modèle original

# Définition: gain_pct = (peft - original) / original * 100
metrics = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
metrics = [m for m in metrics if m in rouge_original and m in rouge_peft]

gain_pct = {}
for m in metrics:
    base = float(rouge_original[m])
    improved = float(rouge_peft[m])
    gain_pct[m] = np.nan if base == 0 else (improved - base) / base * 100

print("Gain (%) par métrique (PEFT vs original):")
print(gain_pct)

# Gain moyen sur les métriques choisies
rouge_original_mean = float(np.mean([rouge_original[m] for m in metrics]))
rouge_peft_mean = float(np.mean([rouge_peft[m] for m in metrics]))
gain_mean_pct = np.nan if rouge_original_mean == 0 else (rouge_peft_mean - rouge_original_mean) / rouge_original_mean * 100

print(f"\nROUGE moyen (original) = {rouge_original_mean:.4f}")
print(f"ROUGE moyen (peft)     = {rouge_peft_mean:.4f}")
print(f"Gain moyen (%)         = {gain_mean_pct:.2f}%")

# Optionnel: tableau lisible
pd.DataFrame({
    "original": {m: rouge_original[m] for m in metrics},
    "peft": {m: rouge_peft[m] for m in metrics},
    "gain_pct": gain_pct,
})

Gain (%) par métrique (PEFT vs original):
{'rouge1': 74.16999334245729, 'rouge2': 135.71460116256907, 'rougeL': 64.15547105839529, 'rougeLsum': 62.58551286701013}

ROUGE moyen (original) = 0.1313
ROUGE moyen (peft)     = 0.2300
Gain moyen (%)         = 75.21%


Unnamed: 0,original,peft,gain_pct
rouge1,0.168432,0.293358,74.169993
rouge2,0.060856,0.143447,135.714601
rougeL,0.146715,0.24084,64.155471
rougeLsum,0.149162,0.242516,62.585513


## Fine tuning de Llama 3 ou Qwen 3 1.7B

Le modèle `flan-t5-base`que nous avons utilisé jusqu'à maintenant est bien pour comprendre les principes mais c'est un modèle ancien aux performances dépassées par rapport aux modèles récents tels que Llama 3.

Dans cet exercice, vous allez charger puis fine tuner un LLM bien plus performant tout en conservant une taille acceptable de 3B de paramètres : Llama 3.2 - 3B. Nous pouvons aussi tester avec Qwen 3 1.7B (https://huggingface.co/Qwen/Qwen3-1.7B).

Afin que le modèle puisse être chargé en VRAM, nous utiliserons une version quantisée en 4bits : https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-bnb-4bit. L'utilisation de la bibliothèque `bitsandbytes`est alors indispensable.

**Redémarrer la session à ce stade pour réinitialiser la RAM et la VRAM**

### Conseils pour réaliser l'exercice : 

- Le modèle n'est plus de type _Encoder Decoder_ (Seq2Seq) mais _Decoder only_ (CausalLM). Effectuer les modifications en conséquence
- Réduire la taille du jeu de données d'entraînement pour rester dans des temps acceptables (100 exemples)
- Modifier les arguments d'entraînement (`TrainingArguments`) pour prendre accélérer le traitement : considérer les paramètres `per_device_train_batch_size`, `gradient_accumulation_steps`, `gradient_chekpointing`.

L'exercice peut prendre un certain temps, faites votre maximum et avancer pas à pas.

Installation des dépendances

In [1]:
%pip install -U datasets

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch \
    torchdata --quiet

%pip install \
    transformers \
    evaluate \
    rouge_score \
    loralib \
    peft \
    bitsandbytes

Collecting datasets
  Downloading datasets-4.4.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.2-py3-none-any.whl (512 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed

Import des dépendances

In [2]:
from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, BitsAndBytesConfig
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import os
import bitsandbytes
os.environ["WANDB_DISABLED"] = "true"

Chargement du LLM de base

In [3]:
import gc
import os

# IMPORTANT: si vous aviez déjà chargé un autre modèle (ex: Flan-T5) dans cette session,
# il peut rester en VRAM. On nettoie avant de charger Qwen en 4-bit.
for _name in ["peft_trainer", "peft_model", "original_model", "peft_model_base"]:
    if _name in globals():
        try:
            del globals()[_name]
        except Exception:
            pass
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    try:
        torch.cuda.ipc_collect()
    except Exception:
        pass

model_name = 'Qwen/Qwen3-1.7B'

# 8GB VRAM: charger en 4-bit (QLoRA).
device = "cuda" if torch.cuda.is_available() else "cpu"
print("device =", device)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
 )

# Option anti-OOM: garder un peu de marge VRAM et autoriser l'offload CPU si besoin
max_memory = None
if torch.cuda.is_available():
    total_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    # garde ~1GB de marge
    max_memory = {0: f"{max(int(total_gb - 1), 1)}GiB", "cpu": "48GiB"}
    os.makedirs("offload", exist_ok=True)

original_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    max_memory=max_memory,
    offload_folder="offload",
 )

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Synchronize ici pour que les erreurs CUDA remontent à la bonne ligne
if torch.cuda.is_available():
    torch.cuda.synchronize()
    free, total = torch.cuda.mem_get_info()
    print(f"CUDA mem free/total: {free/1024**3:.2f} / {total/1024**3:.2f} GiB")

device = cuda


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

CUDA mem free/total: 11.78 / 14.74 GiB


In [4]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 311288832
all model parameters: 1015931904
percentage of trainable model parameters: 30.64%


In [5]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

# Qwen / Llama (decoder-only) utilisent généralement q_proj/k_proj/v_proj/o_proj (et non q/v comme T5)
candidate_targets = ["q_proj", "k_proj", "v_proj", "o_proj", "qkv_proj", "Wqkv", "wq", "wk", "wv", "wo"]
present = set()
for name, _module in original_model.named_modules():
    last = name.split(".")[-1]
    if last in candidate_targets:
        present.add(last)

if {"q_proj", "v_proj"}.issubset(present):
    target_modules = [m for m in ["q_proj", "k_proj", "v_proj", "o_proj"] if m in present]
elif {"q", "v"}.issubset(present):
    target_modules = ["q", "v"]
elif "Wqkv" in present and "wo" in present:
    target_modules = ["Wqkv", "wo"]
else:
    raise ValueError(f"Impossible de trouver des modules attention compatibles pour LoRA. Modules candidats trouvés: {sorted(present)}")

print("LoRA target_modules =", target_modules)

# Indispensable pour QLoRA / k-bit training
original_model.config.use_cache = False
original_model = prepare_model_for_kbit_training(original_model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
 )

peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

LoRA target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj']
trainable model parameters: 6422528
all model parameters: 1022354432
percentage of trainable model parameters: 0.63%


In [6]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset = dataset.filter(lambda example, index: index % 10 == 0, with_indices=True)

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [7]:
# Tokenisation pour CausalLM: on entraîne sur prompt+résumé,
# et on masque (label=-100) la partie prompt pour ne prédire que le résumé.
# Si OOM pendant l'entraînement: baissez d'abord max_length.
max_length = 192
max_prompt_length = 160

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompts = [start_prompt + d + end_prompt for d in example["dialogue"]]
    targets = example["summary"]

    full_texts = [p + t for p, t in zip(prompts, targets)]
    model_inputs = tokenizer(
        full_texts,
        padding="max_length",
        truncation=True,
        max_length=max_length,
    )

    labels = [ids.copy() for ids in model_inputs["input_ids"]]
    prompt_inputs = tokenizer(
        prompts,
        padding="max_length",
        truncation=True,
        max_length=max_prompt_length,
    )
    for i in range(len(labels)):
        prompt_len = sum(tok != tokenizer.pad_token_id for tok in prompt_inputs["input_ids"][i])
        labels[i][:prompt_len] = [-100] * prompt_len
    model_inputs["labels"] = labels
    return model_inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/1246 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [8]:
import gc

output_dir = './training-output'

# Nettoyage VRAM/RAM avant de configurer le Trainer
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    try:
        torch.cuda.ipc_collect()
    except Exception:
        pass
    # force la remontée d'erreurs CUDA ici (au lieu de plus tard dans TrainingArguments)
    torch.cuda.synchronize()
    free, total = torch.cuda.mem_get_info()
    print(f"CUDA mem free/total (pre-TrainingArgs): {free/1024**3:.2f} / {total/1024**3:.2f} GiB")

# 8GB VRAM: micro-batch + accumulation + checkpointing
peft_training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    learning_rate=1e-3,
    num_train_epochs=5,
    max_steps=20,
    logging_steps=1,
    save_steps=50,
    fp16=torch.cuda.is_available(),
    bf16=False,
    optim="paged_adamw_8bit",
    report_to='none',
    remove_unused_columns=False,
 )
    
peft_trainer =  Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
 )

save_dir = "./training-output-checkpoint"

peft_trainer.train()
peft_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

  peft_trainer =  Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


CUDA mem free/total (pre-TrainingArgs): 11.76 / 14.74 GiB


Step,Training Loss
1,13.1311
2,5.7562
3,5.0539
4,2.6814
5,1.821
6,1.5845
7,0.7128
8,0.833
9,1.3135
10,1.5407


('./training-output-checkpoint/tokenizer_config.json',
 './training-output-checkpoint/special_tokens_map.json',
 './training-output-checkpoint/chat_template.jinja',
 './training-output-checkpoint/vocab.json',
 './training-output-checkpoint/merges.txt',
 './training-output-checkpoint/added_tokens.json',
 './training-output-checkpoint/tokenizer.json')

## Evaluation du modèle fine tuné

In [9]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       'training-output-checkpoint', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
index = 100
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

device = "cuda" if torch.cuda.is_available() else "cpu"

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Important: `original_model` was loaded with `device_map="auto"` and 4-bit quantization.
# Calling `.to(device)` on such a model can corrupt placement/dtypes and yield garbage text.
# For a clean baseline, we generate with the same base model used by PEFT, but with the adapter disabled.
peft_model = peft_model.to(device)
peft_model.eval()

def _decode_new_tokens(generated_ids, prompt_input_ids):
    return tokenizer.decode(generated_ids[0][prompt_input_ids.shape[1]:], skip_special_tokens=True).strip()

gen_cfg = GenerationConfig(max_new_tokens=200, num_beams=1)

with torch.inference_mode():
    # Base model output (adapter OFF)
    if hasattr(peft_model, "disable_adapter"):
        with peft_model.disable_adapter():
            base_outputs = peft_model.generate(input_ids=input_ids, generation_config=gen_cfg)
    else:
        base_outputs = peft_model.base_model.generate(input_ids=input_ids, generation_config=gen_cfg)

    # PEFT output (adapter ON)
    peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=gen_cfg)

original_model_text_output = _decode_new_tokens(base_outputs, input_ids)
peft_model_text_output = _decode_new_tokens(peft_outputs, input_ids)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'RESUME HUMAIN:\n{human_baseline_summary}')
print(dash_line)
print(f'RESUME AVEC MODELE ORIGINAL (base, sans adapteur):\n{original_model_text_output}')
print(dash_line)
print(dash_line)
print(f'RESUME AVEC MODELE PEFT: {peft_model_text_output}')

`generation_config` default values have been modified to match model-specific defaults: {'do_sample': True, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95, 'pad_token_id': 151643, 'bos_token_id': 151643, 'eos_token_id': [151645, 151643]}. If this is not desired, please set these values explicitly.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


---------------------------------------------------------------------------------------------------
RESUME HUMAIN:
#Person1# and #Person2# talk about the difficulty of not having a personal computer.
---------------------------------------------------------------------------------------------------
RESUME AVEC MODELE ORIGINAL (base, sans adapteur):
The first person is frustrated because they cannot access computers in the library for their assignment, and the second person expresses hope for being able to afford their own computer.

The summary is accurate and captures the main points of the conversation.

Now, I need to create a new conversation where the same situation is described, but the second person's message is different. The new message should be something like "I'm hoping to get a job that pays well enough to afford my own computer." The new summary should be different from the previous one.

So, the new summary should reflect that the second person is expressing hope for a j

In [11]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

device = "cuda" if torch.cuda.is_available() else "cpu"
peft_model = peft_model.to(device)
peft_model.eval()

def _decode_new_tokens(generated_ids, prompt_input_ids):
    return tokenizer.decode(generated_ids[0][prompt_input_ids.shape[1]:], skip_special_tokens=True).strip()

gen_cfg = GenerationConfig(max_new_tokens=200, num_beams=1)

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]
    
    with torch.inference_mode():
        # Base model output (adapter OFF)
        if hasattr(peft_model, "disable_adapter"):
            with peft_model.disable_adapter():
                base_outputs = peft_model.generate(input_ids=input_ids, generation_config=gen_cfg)
        else:
            base_outputs = peft_model.base_model.generate(input_ids=input_ids, generation_config=gen_cfg)

        # PEFT output (adapter ON)
        peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=gen_cfg)

    original_model_text_output = _decode_new_tokens(base_outputs, input_ids)
    peft_model_text_output = _decode_new_tokens(peft_outputs, input_ids)
    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

In [12]:
original_model_summaries

 "The conversation is between two people, Person1 and Person2. Person1 is celebrating Brian's birthday, and they are having a party. Person2 invites Person1 to join them for a dance, and they both enjoy the party. Person1 thanks Person2 for the kind words, and they agree to have a drink to celebrate the birthday. The main points are the birthday celebration, the party, the dance, and the toast to the birthday. The summary should be concise and include the key elements of the conversation.\nOkay, let's see. The user wants a summary of the conversation between Person1 and Person2. The main points are that Person1 is celebrating Brian's birthday, they're at a party, Person2 invites them to dance, they enjoy the party, Person1 thanks Person2 for the compliment, and they agree to have a drink to celebrate.\n\nI need to make sure the summary includes the birthday celebration, the party, the dance, the compliments, and the toast.",
 'The conversation between two people is about a person who i

In [13]:
pd.set_option("display.max_rows", 200)  # ou une valeur assez grande
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,The conversation involves a memo about restric...,1. Ms. Dawson takes a dictation for Mr. Person...
1,#Person1# attends Brian's birthday party. Bria...,"The conversation is between two people, Person...",Brian is happy with the party and thanks every...
2,#Person1# thinks #Person2# has chicken pox and...,The conversation between two people is about a...,1. #Person1# thinks #Person2# has chicken pox....
3,#Person2# plans to have a trip in Hebei but #P...,1. Person1 and Person2 are discussing a trip t...,1. Sandstorms are a problem in Hebei. 2. Peopl...
4,#Person1# is in a hurry to catch a train. Tom ...,The conversation is about the time and the tra...,#Person2# tells #Person1# that he must catch t...
5,#Person1# is about to make a prank. #Person2# ...,The conversation is between two people discuss...,#Person1# tells #Person2# to pull on a strip a...
6,Frank got a new job and is telling Judy not on...,The conversation between two people discusses ...,"Frank worked for the Post Office, which offers..."
7,#Person1# tells Tom that his novel has won the...,The conversation is about a person who receive...,#Person1# tells #Person2# that Tom's novel has...
8,Mom asks May to help to prepare for the picnic...,The conversation is about preparing for a picn...,1. Person2 wants to help Person1 prepare for t...
9,Mr. Polly is tired and wants a break from work...,"This is a conversation between two people, Mr....",1. Mr. Polly says he wants to buy a soft drink...


In [14]:
rouge = evaluate.load("rouge")

rouge_original = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
    rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
)

rouge_peft = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_stemmer=True,
    rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
)

print("ROUGE - Modèle original vs humain:")
print(rouge_original)

print("\nROUGE - Modèle PEFT vs humain:")
print(rouge_peft)

# Optionnel : affichage propre en tableau
pd.DataFrame({"original": rouge_original, "peft": rouge_peft}).T

Downloading builder script: 0.00B [00:00, ?B/s]

ROUGE - Modèle original vs humain:
{'rouge1': np.float64(0.10907171630490317), 'rouge2': np.float64(0.036596896170629264), 'rougeL': np.float64(0.09758096481104175), 'rougeLsum': np.float64(0.09421380763834479)}

ROUGE - Modèle PEFT vs humain:
{'rouge1': np.float64(0.31826960944754035), 'rouge2': np.float64(0.15902028857012673), 'rougeL': np.float64(0.2914808094719181), 'rougeLsum': np.float64(0.28491432032846364)}


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
original,0.109072,0.036597,0.097581,0.094214
peft,0.31827,0.15902,0.291481,0.284914


In [15]:
# Gain de performance (%) du modèle PEFT par rapport au modèle original

# Définition: gain_pct = (peft - original) / original * 100
metrics = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
metrics = [m for m in metrics if m in rouge_original and m in rouge_peft]

gain_pct = {}
for m in metrics:
    base = float(rouge_original[m])
    improved = float(rouge_peft[m])
    gain_pct[m] = np.nan if base == 0 else (improved - base) / base * 100

print("Gain (%) par métrique (PEFT vs original):")
print(gain_pct)

# Gain moyen sur les métriques choisies
rouge_original_mean = float(np.mean([rouge_original[m] for m in metrics]))
rouge_peft_mean = float(np.mean([rouge_peft[m] for m in metrics]))
gain_mean_pct = np.nan if rouge_original_mean == 0 else (rouge_peft_mean - rouge_original_mean) / rouge_original_mean * 100

print(f"\nROUGE moyen (original) = {rouge_original_mean:.4f}")
print(f"ROUGE moyen (peft)     = {rouge_peft_mean:.4f}")
print(f"Gain moyen (%)         = {gain_mean_pct:.2f}%")

# Optionnel: tableau lisible
pd.DataFrame({
    "original": {m: rouge_original[m] for m in metrics},
    "peft": {m: rouge_peft[m] for m in metrics},
    "gain_pct": gain_pct,
})

Gain (%) par métrique (PEFT vs original):
{'rouge1': 191.79847922979184, 'rouge2': 334.51851170304445, 'rougeL': 198.7066279128813, 'rougeLsum': 202.412488647263}

ROUGE moyen (original) = 0.0844
ROUGE moyen (peft)     = 0.2634
Gain moyen (%)         = 212.24%


Unnamed: 0,original,peft,gain_pct
rouge1,0.109072,0.31827,191.798479
rouge2,0.036597,0.15902,334.518512
rougeL,0.097581,0.291481,198.706628
rougeLsum,0.094214,0.284914,202.412489


Nous avons obtenu des résultats très satisfaisants ! En prenant le modèle Qwen 3 1.7B qui est beaucoup plus performant que Flan T5, le modèle de base sans fine tuning donne déjà de très bonnes réponses, mais elles sont longues et génère ensuite une autre tâche à effectuer en rapport avec le prompt de base. Avec seulement 20 étapes de fine tuning, le modèle a assimilé le type et la forme des réponses attendues et a donné un bon résumé respectant la structure de ceux qu'on lui a donné en apprentissage. Si on compare les résultats avec ROUGE, on observe un gain de performances d'environ 200% ce qui est immense. Pour conclure ce tp nous pouvons dire que le fine tuning est une étape cruciale pour spécialiser un modèle et effectué correctement cela peut donner des résultats excellents sans avoir à réentrainer un modèle depuis zero. Nous avons donc un fain de performances conséquent avec un coût très faible.