# Data Preparation

This notebook details the data preprocessing steps applied to the StereoSet dataset. These steps include extraction, filtering, reformatting, and preparation of a dataset suitable for training and evaluation purposes.


FINE-TUNING

CREER LE DATASET

In [3]:
from datasets import Dataset
import random
import json
from pathlib import Path

# Charger les données
path = Path("C:/Users/sarah/Documents/ERASMUS/NLP/StereoSet/data/dev.json")
with open(path, "r") as f:
    full_data = json.load(f)

# Extraire les exemples intrasentence
intrasentence_examples = full_data["data"]["intrasentence"]

examples_ft = []

for ex in intrasentence_examples:
    stereotype = None
    antistereotype = None
    for s in ex["sentences"]:
        if s["gold_label"] == "stereotype":
            stereotype = s["sentence"]
        elif s["gold_label"] == "anti-stereotype":
            antistereotype = s["sentence"]

    if stereotype and antistereotype:
        examples_ft.append({"text": stereotype, "label": 0})
        examples_ft.append({"text": antistereotype, "label": 1})

# Mélange
random.shuffle(examples_ft)

# Conversion en Dataset Hugging Face
dataset = Dataset.from_list(examples_ft)
dataset = dataset.train_test_split(test_size=0.1)


Cherger le tokenizer DistillBERT

In [4]:
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


  _torch_pytree._register_pytree_node(


In [7]:
# Cette fonction transforme chaque exemple en vecteurs que le modèle peut comprendre.
# - padding: pour que toutes les séquences aient la même longueur
# - truncation: coupe les phrases trop longues
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)


In [8]:
# On applique la fonction de tokenisation à tout le dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/3790 [00:00<?, ? examples/s]

Map:   0%|          | 0/422 [00:00<?, ? examples/s]

In [9]:
tokenized_dataset["train"][0]


{'text': 'Learning about muslim beliefs, you will find out that it is a evil religion.',
 'label': 0,
 'input_ids': [101,
  4083,
  2055,
  5152,
  9029,
  1010,
  2017,
  2097,
  2424,
  2041,
  2008,
  2009,
  2003,
  1037,
  4763,
  4676,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0

In [10]:
# Sauvegarder le dataset tokenisé pour le recharger plus tard
tokenized_dataset.save_to_disk("tokenized_dataset_intrasentence")


Saving the dataset (0/1 shards):   0%|          | 0/3790 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/422 [00:00<?, ? examples/s]

## Summary

The dataset has been successfully cleaned and reformatted. It is now ready to be used for model training or bias analysis. The file has been saved in CSV format for future use.
