# Fine-Tuning on Augmented Dataset

This notebook performs masked language model (MLM) fine-tuning using an augmented version of the StereoSet dataset. The aim is to improve the model's ability to generate unbiased predictions by incorporating more antistereotypical examples.


In [1]:
from datasets import load_from_disk

dataset_path = "C:/Users/sarah/Documents/ERASMUS/NLP/augmented_dataset"
dataset = load_from_disk(dataset_path)
dataset["train"] = dataset["train"].select(range(100))  # ou même 50
dataset["test"] = dataset["test"].select(range(20))



In [2]:
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 100
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
})
{'text': 'The engineer is nerdy.', 'label': 0}


In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_train = tokenized_dataset["train"]
tokenized_eval = tokenized_dataset["test"]


  _torch_pytree._register_pytree_node(


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    DataCollatorWithPadding,
    Trainer
)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results_augmented",
    evaluation_strategy="epoch",
    save_strategy="no",  # évite d’écrire à chaque epoch
    per_device_train_batch_size=32,     # accélère
    per_device_eval_batch_size=32 ,
    num_train_epochs=1,                 # 1 epoch seulement
    learning_rate=3e-5,
    weight_decay=0.01,
    logging_dir="./logs_augmented",
    logging_steps=20,
    report_to="none"
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

W0630 17:35:00.128803 15344 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
  _torch_pytree._register_pytree_node(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training the Model

Fine-tuning is performed on the augmented dataset using the Hugging Face `Trainer` API.

In [5]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  0%|          | 0/4 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.6377898454666138, 'eval_runtime': 11.2704, 'eval_samples_per_second': 1.775, 'eval_steps_per_second': 0.089, 'epoch': 1.0}
{'train_runtime': 1337.2365, 'train_samples_per_second': 0.075, 'train_steps_per_second': 0.003, 'train_loss': 0.6181116104125977, 'epoch': 1.0}


TrainOutput(global_step=4, training_loss=0.6181116104125977, metrics={'train_runtime': 1337.2365, 'train_samples_per_second': 0.075, 'train_steps_per_second': 0.003, 'train_loss': 0.6181116104125977, 'epoch': 1.0})

In [6]:
trainer.save_model("finetuned_distilbert_augmented")
tokenizer.save_pretrained("finetuned_distilbert_augmented")


('finetuned_distilbert_augmented\\tokenizer_config.json',
 'finetuned_distilbert_augmented\\special_tokens_map.json',
 'finetuned_distilbert_augmented\\vocab.txt',
 'finetuned_distilbert_augmented\\added_tokens.json',
 'finetuned_distilbert_augmented\\tokenizer.json')

## Summary

The model has been fine-tuned on an augmented dataset that includes additional antistereotypical examples. This version will be evaluated to assess improvements in bias mitigation.
