<a href="https://colab.research.google.com/github/rajni-arora/Pretraining-Bert-Transformers-models/blob/main/The_MLM_Logic_step2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Train Using MLM**

This notebook tells logic behind fine-tuning a model using masked-language modelling (MLM).

In [None]:
!pip install transformers

In [5]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
outputs.keys()

odict_keys(['logits'])

This returns just our MLM output logits.

In [9]:
outputs.logits.shape

torch.Size([1, 62, 30522])

To identify the token position where we have [MASK] tokens we can check the inputs tensor for tokens matching 103 (eg MASK).

In [10]:
mask_pos = torch.flatten((inputs.input_ids[0] == 103).nonzero()).tolist()
mask_pos

[9, 43]

It is for these two positions that we must calculate the loss for when training our model. How does that work? Well, we compare the inputs at those two positions, to the predicted outputs at those two positions - converted to one-hot encoding and probability distribution respectively.

To convert the inputs tokens to one-hot encodings we need the vocab dictionary length, and add a 1 at the position given by the token value.

In [11]:
vocab_size = len(tokenizer.get_vocab())
vocab_size

30522

Which aligns to our outputs.logits.shape tensor shape too:

In [12]:
outputs.logits.shape

torch.Size([1, 62, 30522])

In reality, we must mask our tokens randomly, after which we feed in the original (unmasked) token_ids into to model as labels, and keep token_ids as the new masked tensor.

To test this, let's unmask our text first.

In [13]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

Cnvert text using the tokenizer:

In [14]:
inputs = tokenizer(text, return_tensors='pt')

In [15]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Now we want to mask a number of tokens in the input_ids tensor, the original BERT model was pretrained using a masking probability of 15%, alongside a few additional rules. We will create a similiar, but simpler implementation by taking the primary 15% masking functionality.

In [16]:
# create random array of floats in equal dimension to input_ids
rand = torch.rand(inputs.input_ids.shape)
rand

tensor([[0.6141, 0.7556, 0.1710, 0.9498, 0.8127, 0.1863, 0.1553, 0.3483, 0.0249,
         0.6523, 0.1473, 0.2882, 0.8349, 0.6266, 0.3006, 0.5001, 0.5579, 0.2589,
         0.3163, 0.2815, 0.0736, 0.5379, 0.7003, 0.7847, 0.7621, 0.8342, 0.1835,
         0.1404, 0.8514, 0.9276, 0.9933, 0.7567, 0.1171, 0.2811, 0.6283, 0.2032,
         0.1014, 0.3431, 0.2576, 0.4328, 0.5828, 0.9488, 0.1615, 0.9806, 0.1054,
         0.3970, 0.2622, 0.0421, 0.6582, 0.5936, 0.6381, 0.6185, 0.8627, 0.5244,
         0.3405, 0.1910, 0.0965, 0.6378, 0.0401, 0.2303, 0.3690, 0.7065]])

In [17]:
# where the random array is less than 0.15, we set true, these will be our MASK tokens
mask_arr = rand < 0.15
mask_arr

tensor([[False, False, False, False, False, False, False, False,  True, False,
          True, False, False, False, False, False, False, False, False, False,
          True, False, False, False, False, False, False,  True, False, False,
         False, False,  True, False, False, False,  True, False, False, False,
         False, False, False, False,  True, False, False,  True, False, False,
         False, False, False, False, False, False,  True, False,  True, False,
         False, False]])

In [18]:
# extract the True value index positions
selection = torch.flatten((mask_arr[0]).nonzero()).tolist()
selection

[8, 10, 20, 27, 32, 36, 44, 47, 56, 58]

We then select these True value positions in the input_ids tensor, and change their values to 103 - we need to copy the original tensor first though, to create a new one called labels:

In [19]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

And now use mask_arr to mask ~15% of our values:

In [20]:
inputs.input_ids[0, selection] = 103
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,   103,  2602,
           103,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
           103,  2163,  4161,  2037, 22965,  2013,  1996,   103,  2000,  2433,
          1996, 18179,   103,  2162,  3631,  2041,   103,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,   103,  7680,  3334,   103,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,   103,  1005,   103, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

We should notice here that we have a slight problem, our CLS token has been replaced by a MASK token - we don't want to do this, nor do we want to replace any other special tokens. In this case, we have only two special tokens, CLS and SEP at the beginning and end of our sequence respectively. So the solution here is easy, we add another condition when creating our Boolean mask tensor mask_arr. This condition should check the original inputs.input_ids, where a value is not equal to 101 or 102 (CLS and SEP respectively) then we set True:

In [21]:
# reinitialize inputs.input_ids first (to remove mask)
inputs = tokenizer(text, return_tensors='pt')
# and copy to labels tensor
inputs['labels'] = inputs.input_ids.detach().clone()
# now test the new conditional logic
(inputs.input_ids != 101) * (inputs.input_ids != 102)

tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True, False]])

And add this logic to our mask_arr:

In [22]:
# where the random array is less than 0.15, we set true, these will be our MASK tokens
# BUT - we must also add our new conditions too
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102)
mask_arr

tensor([[False, False, False, False, False, False, False, False,  True, False,
          True, False, False, False, False, False, False, False, False, False,
          True, False, False, False, False, False, False,  True, False, False,
         False, False,  True, False, False, False,  True, False, False, False,
         False, False, False, False,  True, False, False,  True, False, False,
         False, False, False, False, False, False,  True, False,  True, False,
         False, False]])

In [23]:
inputs.input_ids

tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]])

And run through the same steps again.

In [24]:
selection = torch.flatten((mask_arr[0]).nonzero()).tolist()  # create selection from mask_arr
inputs.input_ids[0, selection] = 103  # apply selection index to inputs.input_ids, adding MASK tokens
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,   103,  2602,
           103,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
           103,  2163,  4161,  2037, 22965,  2013,  1996,   103,  2000,  2433,
          1996, 18179,   103,  2162,  3631,  2041,   103,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,   103,  7680,  3334,   103,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,   103,  1005,   103, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

Now we're ready. We pass these inputs into our model.

In [25]:
outputs = model(**inputs)

And we'll find that we now have an additional output tensor, the loss tensor.

In [26]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [27]:
outputs.loss

tensor(0.9645, grad_fn=<NllLossBackward0>)

It is this loss that we will be training on.