# MLM Training With Trainer

In this notebook we'll cover the training process for masked-language modeling (MLM) using the HuggingFace `Trainer` function.

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We'll be using *Meditations* by *Marcus Aurelius* as our training data. The file below has already been cleaned and so no further processing is required (beyond the `split`).

In [2]:
with open('clean.txt', 'r') as fp:
    text = fp.read().split('\n')

In [3]:
text[:5]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
 'From my great-grandfather, not to have frequented public schools, and to have had good teachers at home, and to know that on such things a man should spend liberally.',
 "From my governor, to be neither of the green nor of the blue party at the games in the Circus, nor a partizan either of the Parmularius or the Scutarius at the gladiators' fights; from him too I learned endurance of labour, and to want little, and to work with my own hands, and not to meddle with other people's affairs, and not to be ready to listen to slander."]

First, we'll tokenize our text.

In [4]:
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

Then we create our *labels* tensor by cloning the *input_ids* tensor.

In [5]:
inputs['labels'] = inputs.input_ids.detach().clone()

Now we mask tokens in the *input_ids* tensor, using the 15% probability we used before - and the **not** a *CLS*, *SEP*, or *PAD* token condition.

In [6]:
# create random array of floats with equal dimensions to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
# create mask array
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * \
           (inputs.input_ids != 102) * (inputs.input_ids != 0)

And now we take take the indices of each `True` value, within each individual vector.

In [7]:
selection = []

for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )

Then apply these indices to each respective row in *input_ids*, assigning each of the values at these indices as *103*.

In [8]:
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

The `Trainer` expects a `Dataset` object, so we need to initialize this.

In [9]:
class MeditationsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

Initialize our data using the `MeditationsDataset` class.

In [10]:
dataset = MeditationsDataset(inputs)

We'll pass a training `args` dictionary to the `Trainer` defining our training arguments.

In [11]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='out',
    per_device_train_batch_size=8,
    num_train_epochs=2
)

Now we'll import and initialize our `Trainer`.

In [12]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset
)

And train.

In [13]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33ms-kant-malviya[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss


TrainOutput(global_step=128, training_loss=1.0204026699066162, metrics={'train_runtime': 74.5313, 'train_samples_per_second': 13.632, 'train_steps_per_second': 1.717, 'total_flos': 267416089804800.0, 'train_loss': 1.0204026699066162, 'epoch': 2.0})