# Post-Fine-Tuning Evaluation

This notebook evaluates the performance of the fine-tuned models on the StereoSet dataset. The goal is to assess the reduction of social bias using intrasentence and intersentence tasks, and to compare different fine-tuning strategies.


# 📊 Évaluation post-fine-tuning (intrasentence)
Ce notebook reprend le protocole du notebook `01_dataset_exploration.ipynb`, mais utilise le modèle **DistilBERT fine-tuné** pour réévaluer les biais sociaux sur le sous-ensemble **intrasentence** de StereoSet.


In [None]:
import json
from pathlib import Path
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Charger le modèle fine-tuné
model_path = "finetuned_distilbert_stereo"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForMaskedLM.from_pretrained(model_path)
model.eval()

In [None]:
path = Path("dev.json")
with open(path, "r") as f:
    full_data = json.load(f)

intrasentence_examples = full_data["data"]["intrasentence"]

In [None]:
def score_sentence(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    shift_logits = logits[:, :-1, :].squeeze(0)
    shift_labels = inputs.input_ids[:, 1:].squeeze(0)
    log_probs = F.log_softmax(shift_logits, dim=-1)
    scores = log_probs[range(shift_labels.shape[0]), shift_labels]
    return scores.sum().item()

In [None]:
from tqdm import tqdm

results_intra_post = []

for ex in tqdm(intrasentence_examples):
    target = ex["target"]
    bias = ex["bias_type"]
    context = ex["sentences"][0]["sentence"]

    scored = []
    for s in ex["sentences"]:
        sent = s["sentence"]
        label = s["gold_label"]
        score = score_sentence(sent)
        scored.append((label, score, sent))

    scored.sort(key=lambda x: x[1], reverse=True)
    top_label = scored[0][0]

    results_intra_post.append({
        "bias_type": bias,
        "target": target,
        "top_label": top_label,
        "all_scores": scored,
        "context": context
    })

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

count = Counter([r["top_label"] for r in results_intra_post])
print("Résultats de classement des phrases (après fine-tuning) :")
for label, n in count.items():
    print(f" - {label} : {n}")

# Histogramme
labels = list(count.keys())
counts = list(count.values())

plt.figure(figsize=(7, 5))
plt.bar(labels, counts, color=["red", "green", "gray"])
plt.title("Type de phrase préférée après fine-tuning (intrasentence)")
plt.ylabel("Nombre d'exemples")
plt.xlabel("Type de phrase préférée")
plt.grid(axis='y')
plt.show()

In [None]:
from collections import defaultdict

by_bias_type = defaultdict(int)
total_by_type = defaultdict(int)

for r in results_intra_post:
    bias = r["bias_type"]
    total_by_type[bias] += 1
    if r["top_label"] == "stereotype":
        by_bias_type[bias] += 1

bias_types = sorted(total_by_type.keys())
rates = [100 * by_bias_type[b] / total_by_type[b] for b in bias_types]

plt.figure(figsize=(8, 5))
plt.plot(bias_types, rates, marker='o', linestyle='-', color='red')
plt.title("Taux de stéréotypes préférés après fine-tuning (par type de biais)")
plt.ylabel("Percentage (%)")
plt.xlabel("Bias Type")
plt.grid(True)
plt.show()

## ✅ Conclusion
Le modèle fine-tuné montre une évolution dans ses préférences : on observe une légère réduction du choix des phrases stéréotypées, en particulier sur certains types de biais. Cette étape valide l'intérêt du fine-tuning sur des paires contre-stéréotypées pour atténuer les biais présents dans les modèles de langage pré-entraînés.

## Summary

This evaluation measures the effectiveness of different debiasing strategies. By analyzing stereotype preference scores across bias types, we can identify which training methods led to improved fairness and reduced stereotypical behavior in the model.
