<a href="https://colab.research.google.com/github/rajni-arora/Pretraining-Bert-Transformers-models/blob/main/07_mlm_and_nsp_logic_step3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training With MLM and NSP

BERT was originally trained using both MLM *and* NSP. So far, we've dived into how we can use each of those individually - but not together. In this notebook, we'll explore how.

First, we'll start by initializing everything we need. This time, rather than using a `BertForMaskedLM` or `BertForNextSentencePrediction` class, we use the `BertForPreTraining` class - which includes both a MLM head, and an NSP head.

In [None]:
from transformers import BertTokenizer, BertForPreTraining
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces [MASK] Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We tokenize just as before.

In [None]:
inputs = tokenizer(text, text2, return_tensors='pt')
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

And process those `inputs` as we did before too.

In [None]:
outputs = model(**inputs)

In [None]:
outputs.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

We will find that we now return two output tensors:

* *prediction_logits* for our predicted output tokens - which we will use for calculating MLM loss.

* *seq_relationship_logits* for our predicted `IsNextSentence` or `NotNextSentence` classifications, which we can use for calculating NSP loss.

In [None]:
outputs.prediction_logits

tensor([[[ -8.2231,  -8.0797,  -8.1794,  ...,  -7.3005,  -7.2905,  -5.0078],
         [-12.6802, -12.6025, -12.7861,  ..., -12.2099, -11.7521,  -8.9502],
         [ -6.1791,  -6.3236,  -5.8000,  ...,  -6.0132,  -6.1775,  -4.6381],
         ...,
         [ -1.7799,  -1.5980,  -1.7029,  ...,  -1.2014,  -1.2108,  -6.9480],
         [-14.3935, -14.4690, -14.3760,  ..., -11.7672, -11.9389, -10.8475],
         [-13.9112, -14.0683, -13.9222,  ..., -11.3252, -11.7293, -10.5214]]],
       grad_fn=<AddBackward0>)

In [None]:
outputs.seq_relationship_logits

tensor([[ 6.0843, -5.6813]], grad_fn=<AddmmBackward>)

But how to we return the *loss* tensor? We need to add labels, both for the MLM head, and the NSP head.

We have two additional input labels for our model:

* *labels* for our MLM.
* *next_sentence_label* for NSP.

All we need to do is fill both of these. First let's fill in the *\[MASK\]* tokens.

In [None]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

And tokenize our text again.

In [None]:
inputs = tokenizer(text, text2, return_tensors='pt')

This time, we must `clone` our *input_ids* tensor to create a new *labels* tensor.

In [None]:
inputs['labels'] = inputs.input_ids.detach().clone()

Now we can go ahead and mask *'election'* and *'attacked'* in the *input_ids* tensor.

In [None]:
inputs.input_ids

tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]])

In [None]:
inputs.input_ids[0, [9, 44]] = 103
inputs.input_ids

tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]])

And now we just need to add our *next_sentence_label* tensor, which is a simple single value `LongTensor` like in the previous NSP sections.

In [None]:
inputs['next_sentence_label'] = torch.LongTensor([0])

Now our `inputs` are ready for processing.

In [None]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labe

In [None]:
outputs = model(**inputs)
outputs.keys()

odict_keys(['loss', 'prediction_logits', 'seq_relationship_logits'])

And now we can see that we've returned another tensor, our *loss*!

In [None]:
outputs.loss

tensor(1.0767, grad_fn=<AddBackward0>)

We can then go ahead and use this loss when fine-tuning our models using both MLM and NSP.