# MLM Training

In this notebook we'll cover the training process for masked-language modeling (MLM). First we import and initialize everything required.

In [None]:
%%bash
pip3 install transformers
pip3 install torch torchvision torchaudio

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.0.17 pyyaml-5.4.1 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.10.2
Collecting torchaudio
  Downloading torchaudio-0.9.1-cp37-cp37m-manylinux1_x86_64.whl (1.9 MB)
  Downloading torchau

In [None]:
!gdown --id 1Xb9J6UeRFqfeBJOUc-SyvsU3r4_tZ3Lm

Downloading...
From: https://drive.google.com/uc?id=1Xb9J6UeRFqfeBJOUc-SyvsU3r4_tZ3Lm
To: /content/train_set.json
119MB [00:00, 153MB/s]


In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-cased')
model = BertForMaskedLM.from_pretrained('dbmdz/bert-base-italian-xxl-cased')

Downloading:   0%|          | 0.00/235k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-base-italian-xxl-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We'll be using *Meditations* by *Marcus Aurelius* as our training data. The file below has already been cleaned and so no further processing is required (beyond the `split`).

In [None]:
import json
with open('train_set.json') as f:
    text = json.load(f)

In [None]:
text[:5]

['Lunedì, 5 ottobre 1987',
 'Vi saluto con profonda gioia.',
 '- passi di chi non leva mai la mano per ferire, ma per sostenere, per confortare, per beneficare.',
 'È quanto vi chiede la vostra Beata, ripetendo anche oggi il suo saluto: “Stì bé!”.',
 'Beata Antonia, beata Pierina, pregate per noi!']

In [None]:
from random import shuffle

train_dataset_size = 20000
eval_dataset_size = 500
shuffle(text)
#train_text = text[:-eval_dataset_size]
train_text = text[:train_dataset_size]
eval_text = text[-eval_dataset_size:]

First, we'll tokenize our text.

In [None]:
inputs = tokenizer(train_text, return_tensors='pt', max_length=128, truncation=True, padding='max_length')
inputs_eval = tokenizer(eval_text, return_tensors='pt', max_length=128, truncation=True, padding='max_length')

In [None]:
inputs

{'input_ids': tensor([[  102,   207,  1040,  ...,     0,     0,     0],
        [  102,  1360,   121,  ...,     0,     0,     0],
        [  102,  6022,   993,  ...,     0,     0,     0],
        ...,
        [  102,  3702,   146,  ...,     0,     0,     0],
        [  102,  2911,   126,  ...,     0,     0,     0],
        [  102,   126, 14748,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [None]:
inputs_eval

{'input_ids': tensor([[  102,   199,   240,  ...,     0,     0,     0],
        [  102,   678,   136,  ...,     0,     0,     0],
        [  102, 26818,  2820,  ...,     0,     0,     0],
        ...,
        [  102,   461,   198,  ...,     0,     0,     0],
        [  102,   572,  1682,  ...,     0,     0,     0],
        [  102,  2660,   120,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Then we create our *labels* tensor by cloning the *input_ids* tensor.

In [None]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs_eval['labels'] = inputs.input_ids.detach().clone()

In [None]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

Now we mask tokens in the *input_ids* tensor, using the 15% probability we used before - and the **not** a *CLS* or *SEP* token condition. This time, because we have padding tokens we also need to exclude *PAD* tokens (*0* input ids).

In [None]:
# create random array of floats with equal dimensions to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
# create mask array
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * \
           (inputs.input_ids != 102) * (inputs.input_ids != 0)

# Do the same for the eval set
rand = torch.rand(inputs_eval.input_ids.shape)
mask_arr_eval = (rand < 0.15) * (inputs_eval.input_ids != 101) * \
           (inputs_eval.input_ids != 102) * (inputs_eval.input_ids != 0)

In [None]:
mask_arr

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False,  True,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

In [None]:
mask_arr_eval

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

And now we take take the indices of each `True` value, within each individual vector.

In [None]:
selection = []

for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )

selection_eval = []

for i in range(inputs_eval.input_ids.shape[0]):
    selection_eval.append(
        torch.flatten(mask_arr_eval[i].nonzero()).tolist()
    )

In [None]:
selection[:5]

[[13], [5, 8, 13, 17], [2, 5], [2, 4, 6, 10, 14, 18], [2, 9, 12, 14, 22]]

Then apply these indices to each respective row in *input_ids*, assigning each of the values at these indices as *103*.

In [None]:
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

for i in range(inputs_eval.input_ids.shape[0]):
    inputs_eval.input_ids[i, selection_eval[i]] = 103

In [None]:
inputs.input_ids

tensor([[  102,   207,  1040,  ...,     0,     0,     0],
        [  102,  1360,   121,  ...,     0,     0,     0],
        [  102,  6022,   103,  ...,     0,     0,     0],
        ...,
        [  102,  3702,   146,  ...,     0,     0,     0],
        [  102,  2911,   126,  ...,     0,     0,     0],
        [  102,   126, 14748,  ...,     0,     0,     0]])

We can see the values *103* have been assigned in the same positions as we found *True* values in the `mask_arr` tensor.

The `inputs` tensors are now ready, and can we can begin setting them up to be fed into our model during training. We create a PyTorch dataset from our data.

In [None]:
class VaticanDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

Initialize our data using the `MeditationsDataset` class.

In [None]:
dataset = VaticanDataset(inputs)
eval_dataset = VaticanDataset(inputs_eval)

Definisco gli argomenti per il trainer

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='out',
    per_device_train_batch_size=16,
    num_train_epochs=2,
    logging_steps=100,
    evaluation_strategy="steps",
    learning_rate=5e-7,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

Definisco il trainer

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset,
    eval_dataset=eval_dataset
)

In [None]:
trainer.evaluate(eval_dataset=dataset)

***** Running Evaluation *****
  Num examples = 20000
  Batch size = 8
  """


{'eval_loss': 15.196087837219238,
 'eval_runtime': 395.2545,
 'eval_samples_per_second': 50.6,
 'eval_steps_per_second': 6.325}

In [None]:
trainer.evaluate(eval_dataset=eval_dataset)

***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
  """


{'eval_loss': 17.883710861206055,
 'eval_runtime': 9.8689,
 'eval_samples_per_second': 50.664,
 'eval_steps_per_second': 6.384}

In [None]:
trainer.train()

***** Running training *****
  Num examples = 20000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2500
  """


Step,Training Loss,Validation Loss
100,No log,11.297961
200,No log,3.490658
300,No log,3.394033
400,No log,3.393038
500,3.715900,3.428187
600,3.715900,3.471198
700,3.715900,3.517477
800,3.715900,3.57653
900,3.715900,3.613635
1000,0.182700,3.652072


***** Running Evaluation *****
  Num examples = 500
  Batch size = 8


***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
Saving model checkpoint to out/checkpoint-500
Configuration saved in out/checkpoint-500/config.json
Model weights saved in out/checkpoint-500/pytorch_model.bin
  """
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
***** Running Evaluation *****
  Num examples = 500
  Batch size = 8
Saving model checkpoint to out/checkpoint-1000
Configuration saved in out/checkpoint-1000/config.json
Model weights saved in out/checkpoint-1000/pytorch_model.bin
  """
***** Running Evaluation *****
  Num example

TrainOutput(global_step=2500, training_loss=0.8411328750610352, metrics={'train_runtime': 2613.6572, 'train_samples_per_second': 15.304, 'train_steps_per_second': 0.957, 'total_flos': 2632096665600000.0, 'train_loss': 0.8411328750610352, 'epoch': 2.0})

In [None]:
model.save_pretrained("./final_model")

Configuration saved in ./final_model/config.json
Model weights saved in ./final_model/pytorch_model.bin


In [None]:
%%bash
cd final_model
ls -lh

total 423M
-rw-r--r-- 1 root root  680 Sep 21 00:08 config.json
-rw-r--r-- 1 root root 423M Sep 21 00:08 pytorch_model.bin
