# Fine-Tuning with MLM Objective (Augmented Dataset)

This notebook fine-tunes a masked language model (MLM) using an augmented version of the StereoSet dataset. The MLM objective helps the model better understand contextual completions, especially those that are antistereotypical.


In [1]:
from datasets import load_from_disk

dataset_path = "C:/Users/sarah/Documents/ERASMUS/NLP/augmented_dataset"
dataset = load_from_disk(dataset_path)

# Fusionner train + test (facultatif)
full_dataset = dataset["train"].train_test_split(test_size=0.1)


In [None]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = full_dataset.map(tokenize_function, batched=True, remove_columns=["text", "label"])
 

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)


  _torch_pytree._register_pytree_node(


Map:   0%|          | 0/6822 [00:00<?, ? examples/s]

Map:   0%|          | 0/759 [00:00<?, ? examples/s]

In [3]:
from transformers import AutoModelForMaskedLM, TrainingArguments, Trainer

model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

training_args = TrainingArguments(
    output_dir="./results_mlm_augmented",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    learning_rate=5e-5,
    logging_dir="./logs_mlm_augmented",
    save_total_limit=1,
    report_to="none"
)


W0630 21:47:54.710217 3108 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
  _torch_pytree._register_pytree_node(


## Train the MLM

Training is launched using the `Trainer` API with MLM objective.

In [4]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  0%|          | 0/1281 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/48 [00:00<?, ?it/s]

{'eval_loss': 2.151728391647339, 'eval_runtime': 26840.45, 'eval_samples_per_second': 0.028, 'eval_steps_per_second': 0.002, 'epoch': 1.0}




{'loss': 2.2547, 'learning_rate': 3.0483996877439503e-05, 'epoch': 1.17}


  0%|          | 0/48 [00:00<?, ?it/s]

{'eval_loss': 1.8515901565551758, 'eval_runtime': 172.8808, 'eval_samples_per_second': 4.39, 'eval_steps_per_second': 0.278, 'epoch': 2.0}




{'loss': 1.7885, 'learning_rate': 1.0967993754879002e-05, 'epoch': 2.34}


  0%|          | 0/48 [00:00<?, ?it/s]

{'eval_loss': 1.5770235061645508, 'eval_runtime': 147.1516, 'eval_samples_per_second': 5.158, 'eval_steps_per_second': 0.326, 'epoch': 3.0}
{'train_runtime': 60653.7984, 'train_samples_per_second': 0.337, 'train_steps_per_second': 0.021, 'train_loss': 1.9383394358960284, 'epoch': 3.0}


TrainOutput(global_step=1281, training_loss=1.9383394358960284, metrics={'train_runtime': 60653.7984, 'train_samples_per_second': 0.337, 'train_steps_per_second': 0.021, 'train_loss': 1.9383394358960284, 'epoch': 3.0})

In [5]:
trainer.save_model("finetuned_distilbert_augmented_mlm")
tokenizer.save_pretrained("finetuned_distilbert_augmented_mlm")


('finetuned_distilbert_augmented_mlm\\tokenizer_config.json',
 'finetuned_distilbert_augmented_mlm\\special_tokens_map.json',
 'finetuned_distilbert_augmented_mlm\\vocab.txt',
 'finetuned_distilbert_augmented_mlm\\added_tokens.json',
 'finetuned_distilbert_augmented_mlm\\tokenizer.json')