# Fine-Tuning on StereoSet (Google Colab Version)

This notebook replicates the fine-tuning process for a masked language model (MLM) on the StereoSet dataset using Google Colab. It is optimized for GPU usage and assumes access to a runtime with the required packages preinstalled.


In [None]:
!pip install transformers==4.36.2 accelerate==0.24.1 -U


Collecting transformers==4.36.2
  Using cached transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
Collecting accelerate==0.24.1
  Using cached accelerate-0.24.1-py3-none-any.whl.metadata (18 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.36.2)
  Using cached tokenizers-0.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.10.0->accelerate==0.24.1)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.10.0->accelerate==0.24.1)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.10.0->accelerate==0.24.1)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.10.0->accelerate==0

In [None]:
!pip uninstall -y peft


Found existing installation: peft 0.15.2
Uninstalling peft-0.15.2:
  Successfully uninstalled peft-0.15.2


In [None]:
import transformers
print(transformers.__version__)  # doit afficher 4.36.2


4.36.2


  _torch_pytree._register_pytree_node(


In [None]:
from google.colab import files
uploaded = files.upload()


Saving dev.json to dev.json


In [None]:
import json
import random
from pathlib import Path
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments
)
import torch


  _torch_pytree._register_pytree_node(


In [None]:
# Charge le fichier dev.json (uploadé)
path = Path("dev.json")
with open(path, "r") as f:
    full_data = json.load(f)

# Extraire les exemples intrasentence
intrasentence_examples = full_data["data"]["intrasentence"]


In [None]:
examples_ft = []

for ex in intrasentence_examples:
    stereotype = None
    antistereotype = None
    for s in ex["sentences"]:
        if s["gold_label"] == "stereotype":
            stereotype = s["sentence"]
        elif s["gold_label"] == "anti-stereotype":
            antistereotype = s["sentence"]
    if stereotype and antistereotype:
        examples_ft.append({"text": stereotype, "label": 0})
        examples_ft.append({"text": antistereotype, "label": 1})

random.shuffle(examples_ft)
dataset = Dataset.from_list(examples_ft)
dataset = dataset.train_test_split(test_size=0.1)


In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_train = tokenized_dataset["train"]
tokenized_test = tokenized_dataset["test"]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3790 [00:00<?, ? examples/s]

Map:   0%|          | 0/422 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to="none"  # 👈 DÉSACTIVE wandb, tensorboard, etc.
)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Model Training

We train the DistilBERT model using the `Trainer` API from Hugging Face.

In [None]:

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.5594,0.60641
2,0.4139,0.621288
3,0.3377,0.663019


TrainOutput(global_step=1422, training_loss=0.5110369436180877, metrics={'train_runtime': 576.1623, 'train_samples_per_second': 19.734, 'train_steps_per_second': 2.468, 'total_flos': 1506154322718720.0, 'train_loss': 0.5110369436180877, 'epoch': 3.0})

In [None]:
trainer.save_model("finetuned_distilbert_stereo")
tokenizer.save_pretrained("finetuned_distilbert_stereo")

('finetuned_distilbert_stereo/tokenizer_config.json',
 'finetuned_distilbert_stereo/special_tokens_map.json',
 'finetuned_distilbert_stereo/vocab.txt',
 'finetuned_distilbert_stereo/added_tokens.json',
 'finetuned_distilbert_stereo/tokenizer.json')

In [None]:
from google.colab import files
!zip -r model.zip finetuned_distilbert_stereo
files.download("model.zip")


  adding: finetuned_distilbert_stereo/ (stored 0%)
  adding: finetuned_distilbert_stereo/tokenizer_config.json (deflated 76%)
  adding: finetuned_distilbert_stereo/config.json (deflated 46%)
  adding: finetuned_distilbert_stereo/training_args.bin (deflated 51%)
  adding: finetuned_distilbert_stereo/vocab.txt (deflated 53%)
  adding: finetuned_distilbert_stereo/tokenizer.json (deflated 71%)
  adding: finetuned_distilbert_stereo/special_tokens_map.json (deflated 42%)
  adding: finetuned_distilbert_stereo/model.safetensors (deflated 8%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import files
!zip -r model.zip finetuned_distilbert_stereo
files.download("model.zip")


updating: finetuned_distilbert_stereo/ (stored 0%)
updating: finetuned_distilbert_stereo/tokenizer_config.json (deflated 76%)
updating: finetuned_distilbert_stereo/config.json (deflated 46%)
updating: finetuned_distilbert_stereo/training_args.bin (deflated 51%)
updating: finetuned_distilbert_stereo/vocab.txt (deflated 53%)
updating: finetuned_distilbert_stereo/tokenizer.json (deflated 71%)
updating: finetuned_distilbert_stereo/special_tokens_map.json (deflated 42%)
updating: finetuned_distilbert_stereo/model.safetensors (deflated 8%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>