## After tokenizer, we need three tensors:
- input_ids.
- attention_mask.
- labels.(for now, the label tensor would be the input_ids) 

- now, input_ids needs to be passed through a masked language modelling script. Which will mask around 15% of the tokens withing that tensor.
- while training, model is essentially going to try and guess what those masked tokens are. then optimize the model using the loss between the guesses that the model outputs from the input ids and real values(i.e our labels)

##### MLM
MLM is one of the two core training approaches. Another one being next sentence prediction.
- Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data.
- We will mask around 15% of the tokens.
- So we’re actually inputting an incomplete sentence and asking BERT to complete it for us.
- Example being humans learning fill in the blanks. As humans, we use a mix of general world knowledge, and linguistic understanding to come to that conclusion. For BERT, this guess will come from reading a lot — and learning linguistic patterns incredibly well. 

- Argmax is an operation that finds the argument that gives the maximum value from a target function. 

labels == input_ids

input_ids -> MLM

In [2]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

now we initialize out tokenizer and model

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
text = ("After my milk, I will eat the banana")

return_tensors (str or TensorType, optional) –

If set, will return tensors instead of list of python integers. Acceptable values are:

'tf': Return TensorFlow tf.constant objects.

'pt': Return PyTorch torch.Tensor objects.

'np': Return Numpy np.ndarray objects.

In [5]:
# then we tokenize the text:

inputs = tokenizer(text, return_tensors='pt')
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [6]:
inputs.input_ids

tensor([[  101,  2044,  2026,  6501,  1010,  1045,  2097,  4521,  1996, 15212,
           102]])

101 is CLS, 103 is mask token. 102 is seperator token

now, we can go ahead and create our target labels
and target labels needs to be contained in within a tensor called labels

In [7]:
# it just needs to be a copy of this input ids tensor
inputs['labels'] = inputs.input_ids.detach().clone()

In [8]:
inputs

{'input_ids': tensor([[  101,  2044,  2026,  6501,  1010,  1045,  2097,  4521,  1996, 15212,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[  101,  2044,  2026,  6501,  1010,  1045,  2097,  4521,  1996, 15212,
           102]])}

next, we want to mask a random number of input ids or tokens within the input ids tensor, but not the labels tensor

In [11]:
# normalize input tokens here we create a random array of floats within dimension equal to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
rand.shape

torch.Size([1, 11])

In [12]:
rand

tensor([[0.1942, 0.8076, 0.0601, 0.0249, 0.3020, 0.9449, 0.3641, 0.6106, 0.7794,
         0.3370, 0.3545]])

now, if we want to select random 15% of those 

In [17]:
mask_arr = (rand<0.15)*(inputs.input_ids != 101) * (inputs.input_ids != 102)
mask_arr

tensor([[False, False,  True,  True, False, False, False, False, False, False,
         False]])

we see above that our classifier token is masked, we don't want that
so we are going to add a little logic 

we add the filter below in above cell

In [14]:
(inputs.input_ids != 101) * (inputs.input_ids != 102)

tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False]])

now, i want the index positions of these True values

In [24]:
#nonzero gives vector with indeces having true values and then convert it into a list
selection = torch.flatten(mask_arr[0].nonzero()).tolist()
selection

[2, 3]

Flattens input by reshaping it into a one-dimensional tensor. If start_dim or end_dim are passed, only dimensions starting with start_dim and ending with end_dim are flattened. The order of elements in input is unchanged.

t = torch.tensor([[[1, 2],

                   [3, 4]],

                  [[5, 6],

                   [7, 8]]])

torch.flatten(t)

torch.flatten(t, start_dim=1)

now we use the selection to select a certain number or select those indices within our input ids tensor

In [25]:
inputs.input_ids[0, selection] = 103
inputs.input_ids

tensor([[  101,  2044,   103,   103,  1010,  1045,  2097,  4521,  1996, 15212,
           102]])

now, we can pass all this into our model and the model will calculate our loss on the logits that we saw before

In [26]:
outputs = model(**inputs)

In [30]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [31]:
outputs.loss

tensor(4.8697, grad_fn=<NllLossBackward0>)

with this loss information, we can optimize our model