# Fine-Tuning on Balanced Augmented Dataset (MLM)

This notebook fine-tunes a masked language model (MLM) using a **balanced** version of the augmented StereoSet dataset. The dataset contains an equal number of stereotypical and antistereotypical examples to encourage fairer model behavior.


In [1]:
from datasets import load_from_disk

dataset_path = "C:/Users/sarah/Documents/ERASMUS/NLP/balanced_augmented_dataset"
dataset = load_from_disk(dataset_path)


In [2]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)


  _torch_pytree._register_pytree_node(


Map:   0%|          | 0/360 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

In [3]:
from transformers import AutoModelForMaskedLM, TrainingArguments, Trainer

model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

training_args = TrainingArguments(
    output_dir="./results_balanced_mlm",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    learning_rate=5e-5,
    logging_dir="./logs_balanced_mlm",
    save_total_limit=1,
    report_to="none"
)


W0702 20:21:47.568452 19744 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
  _torch_pytree._register_pytree_node(


## Train the MLM on Balanced Data

Fine-tuning is launched on a dataset that has a 1:1 balance of stereotype and antistereotype examples.

In [4]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  0%|          | 0/69 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 2.8497729301452637, 'eval_runtime': 8.5846, 'eval_samples_per_second': 4.66, 'eval_steps_per_second': 0.349, 'epoch': 1.0}




  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 3.1684842109680176, 'eval_runtime': 8.7567, 'eval_samples_per_second': 4.568, 'eval_steps_per_second': 0.343, 'epoch': 2.0}




  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 2.012118101119995, 'eval_runtime': 8.7022, 'eval_samples_per_second': 4.597, 'eval_steps_per_second': 0.345, 'epoch': 3.0}
{'train_runtime': 7363.5297, 'train_samples_per_second': 0.147, 'train_steps_per_second': 0.009, 'train_loss': 2.616793425186821, 'epoch': 3.0}


TrainOutput(global_step=69, training_loss=2.616793425186821, metrics={'train_runtime': 7363.5297, 'train_samples_per_second': 0.147, 'train_steps_per_second': 0.009, 'train_loss': 2.616793425186821, 'epoch': 3.0})

In [5]:
trainer.save_model("finetuned_distilbert_balanced_mlm")
tokenizer.save_pretrained("finetuned_distilbert_balanced_mlm")


('finetuned_distilbert_balanced_mlm\\tokenizer_config.json',
 'finetuned_distilbert_balanced_mlm\\special_tokens_map.json',
 'finetuned_distilbert_balanced_mlm\\vocab.txt',
 'finetuned_distilbert_balanced_mlm\\added_tokens.json',
 'finetuned_distilbert_balanced_mlm\\tokenizer.json')