<a href="https://colab.research.google.com/github/ummeamunira/NLP-LLM/blob/main/Masking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To demonstrate masking for MLM using the Hugging Face Transformers library:

In [2]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Example sentence
sentence = "The cat sat on the mat."
inputs = tokenizer(sentence, return_tensors='pt')

# Mask a token (let's mask "cat")
inputs['input_ids'][0, 1] = tokenizer.mask_token_id #inputs['input_ids'][0, 1]:[0]: Refers to the first (and only) sentence in the batch.[1]: Refers to the second token in that sentence (indexing starts at 0).

# Run the model
with torch.no_grad():
    outputs = model(**inputs)

# Get the predictions for the masked token
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted token: {predicted_token}")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Predicted token: the


In this example, the word "cat" in the sentence "The cat sat on the mat." is masked, and the model predicts the masked token based on the context.

**How attention mask works**

In [3]:
from transformers import BertTokenizer, BertModel
import torch

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentences
sentences = ["The cat sat.", "The cat sat on the mat."]

# Tokenize sentences
encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

# Extract input IDs and attention mask
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

print("Input IDs:\n", input_ids)
print("Attention Mask:\n", attention_mask)

# Run the model
outputs = model(input_ids, attention_mask=attention_mask)
last_hidden_states = outputs.last_hidden_state

print("Last Hidden States Shape:\n", last_hidden_states.shape)


Input IDs:
 tensor([[  101,  1996,  4937,  2938,  1012,   102,     0,     0,     0],
        [  101,  1996,  4937,  2938,  2006,  1996, 13523,  1012,   102]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])
Last Hidden States Shape:
 torch.Size([2, 9, 768])


Input IDs: The tokenized input sentences with padding.

Attention Mask: Indicates which tokens are real (1) and which are padding (0).

Last Hidden States: The output tensor from the BERT model, with dimensions [batch_size, sequence_length, hidden_size].

In this example, the attention mask ensures that the model does not consider the padding tokens (0 values in the mask) when computing attention, focusing only on the real tokens (1 values).