# Balanced Augmented Dataset Generation

This notebook generates a balanced dataset by ensuring a 1:1 ratio between stereotypical and antistereotypical examples. The balanced dataset is designed to help mitigate bias during fine-tuning by preventing overrepresentation of stereotypes.


In [2]:
from datasets import load_from_disk, Dataset
import random

# Chemin vers le dataset que tu avais déjà préparé
dataset_path = "C:/Users/sarah/Documents/ERASMUS/NLP/augmented_dataset"
dataset = load_from_disk(dataset_path)

# Fusionner train + test en un seul pool
examples = list(dataset["train"]) + list(dataset["test"])


In [3]:
stereotypes = [ex["text"] for ex in examples if ex["label"] == 0]
antistereotypes = [ex["text"] for ex in examples if ex["label"] == 1]

print(f"Nombre de stéréotypes  : {len(stereotypes)}")
print(f"Nombre d'antistéréos : {len(antistereotypes)}")


Nombre de stéréotypes  : 2106
Nombre d'antistéréos : 6318


In [4]:
from transformers import pipeline

#  Utilise t5-base 
paraphraser = pipeline("text2text-generation", model="t5-base", max_new_tokens=64)


  _torch_pytree._register_pytree_node(
W0702 19:36:19.795707 22524 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
  _torch_pytree._register_pytree_node(
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [5]:
paraphrased_antis = []

#  Choisis ici le nombre de phrases à paraphraser pour test rapide (ex: 100)
subset = antistereotypes[:100]

for text in subset:
    try:
        # "paraphrase:" est le prompt par défaut pour T5
        result = paraphraser(f"paraphrase: {text}", num_return_sequences=3, do_sample=True, top_k=120)
        for r in result:
            paraphrased_antis.append(r["generated_text"])
    except Exception as e:
        print(f" Erreur sur : {text}")


In [6]:
# Garder 1 version de chaque stéréotype
final_examples = [{"text": text} for text in random.sample(stereotypes, k=len(subset))]

# Ajouter les paraphrases antistéréotypes
final_examples += [{"text": text} for text in paraphrased_antis]

# Mélanger
random.shuffle(final_examples)


In [7]:
# Convertir en Hugging Face Dataset
final_dataset = Dataset.from_list(final_examples)

# Découper train/test
final_dataset = final_dataset.train_test_split(test_size=0.1)

# Sauvegarder pour fine-tuning
final_dataset.save_to_disk("balanced_augmented_dataset")
print("Dataset équilibré sauvegardé sous 'balanced_augmented_dataset'")


Saving the dataset (0/1 shards):   0%|          | 0/360 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/40 [00:00<?, ? examples/s]

Dataset équilibré sauvegardé sous 'balanced_augmented_dataset'
