<a href="https://colab.research.google.com/github/simulate111/Deep-Learning-in-Human-Language-Technology/blob/main/Exercise%20task%209%20bert_model_output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic BERT operations


In [1]:
!pip3 -q install datasets transformers

In [2]:
import transformers
import datasets
import torch

In [3]:
tokenizer=transformers.AutoTokenizer.from_pretrained("bert-base-cased") #you can also use the trusty "TurkuNLP/bert-base-finnish-cased-v1"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
texts='As a Lord of the Rings fan, I was eagerly awaiting the origin stories of Middle-earth. Of course, I have high [MASK] after Lord of the Rings, which is close to perfection in terms of time and fiction. Because they have a considerable budget and opportunities, that\'s why I gave my points by watching the first episode right away.'

In [5]:
# We will be running the model directly, so let's use return_tensors="pt" to get torch tensors rather than Python lists
#texts=["Dogs like to [MASK] cats. They taste good.","Bad joke!"]
t=tokenizer(texts,padding=True, truncation=True, return_tensors="pt")
print("Input ids",t["input_ids"])
print("Token type ids",t["token_type_ids"])
print("Attention mask",t["attention_mask"])

Input ids tensor([[  101,  1249,   170,  2188,  1104,  1103, 22518,  5442,   117,   146,
          1108, 19379, 16794,  1103,  4247,  2801,  1104,  3089,   118,  4033,
           119,  2096,  1736,   117,   146,  1138,  1344,   103,  1170,  2188,
          1104,  1103, 22518,   117,  1134,  1110,  1601,  1106, 17900,  1107,
          2538,  1104,  1159,  1105,  4211,   119,  2279,  1152,  1138,   170,
          5602,  4788,  1105,  6305,   117,  1115,   112,   188,  1725,   146,
          1522,  1139,  1827,  1118,  2903,  1103,  1148,  2004,  1268,  1283,
           119,   102]])
Token type ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attention mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [6]:
# This is what the first sequence looks like
tokenizer.decode(t["input_ids"][0])

"[CLS] As a Lord of the Rings fan, I was eagerly awaiting the origin stories of Middle - earth. Of course, I have high [MASK] after Lord of the Rings, which is close to perfection in terms of time and fiction. Because they have a considerable budget and opportunities, that's why I gave my points by watching the first episode right away. [SEP]"

# BERT: bare model
* How to use the bare model
* What does it give us?

In [7]:
bert=transformers.AutoModel.from_pretrained("bert-base-cased") #"TurkuNLP/bert-base-finnish-cased-v1" if you run this in Finnish


* in torch the model's forward() function tends to be mapped to `__call__()` i.e. it is used when you call the model as if it were a function


In [8]:
bert_out=bert(
    input_ids=t["input_ids"],
    attention_mask=t["attention_mask"],
    token_type_ids=t["token_type_ids"])
#an easy way to say the above would be bert(**t)


that's it, this is how you call BERT, now let's see what it gave us (not hard to figure out it is really a dictionary)

In [9]:
bert_out.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

* last_hidden_state: the last layer of the encoder
* pooler_output: the `tanh` layer on top of `[CLS]`

In [10]:
# Before you run this, stop to think:
# What will the shape be? How many dimensions? 1? 2? 3? more? And their approximate sizes?
# make a guess, see if it matches
bert_out.last_hidden_state.shape

torch.Size([1, 72, 768])

In [11]:
# And here? What will the shape be?
bert_out.pooler_output.shape

torch.Size([1, 768])

# BERT: masked language modelling output

* Not much we can do with the above
* But BERT is trained to predict masked words, let's try!

In [12]:
# Have a look at HuggingFace automodels documentation to see what types of automodels there are
bert=transformers.AutoModelForPreTraining.from_pretrained("bert-base-cased")

In [13]:
# Tell the model it is not really being trained (disables dropout for example)
# I do not think this is needed but am playing it safe, the docs say it is put to eval mode upon load: https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained.config
bert=bert.eval()

Now we can again run the model, and we will see the output is quite different!

In [14]:
bert_out=bert(**t)
bert_out.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

In [15]:
# What are these? https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#transformers.BertForPreTraining
#What do you think these shapes will be?
print("Logits",bert_out["prediction_logits"].shape)
print("Seq relationship logits",bert_out["seq_relationship_logits"].shape)

Logits torch.Size([1, 72, 28996])
Seq relationship logits torch.Size([1, 2])


In [16]:
#cross-check
tokenizer.vocab_size

28996

...now let's see how well this works for the masked word prediction...
* we need to find the most likely predicted words
* which can be achieved by arg-sorting the predictions and picking top N words
* this is easy and we have done this kind of stuff before
* now let's try straight in torch without a roundtrip to numpy

In [17]:
predictions = bert_out["prediction_logits"]
print(predictions.shape)
top20=torch.argsort(predictions,dim=2,descending=True)[:,:,:20] #why dim=2? what does [:,:,:20] do?
print(top20)

torch.Size([1, 72, 28996])
tensor([[[  119,   117,  1103,  ...,  1123,   113,   146],
         [ 1249,   119,  1112,  ...,  1108,  1104,  1116],
         [  170,  1126,   138,  ...,  1108,  1821,   188],
         ...,
         [ 1283, 11343,  1175,  ...,  1303,  1378,  1313],
         [  119,   106,   132,  ...,  1272,  1139,  1362],
         [  119,   132,  1232,  ...,  1570,  1111,   188]]])


In [18]:
print(texts[0])

print("Guesses:",tokenizer.decode(top20[0,4]))

A
Guesses: of to and Of the'from in with for. but bya at -, timer year


# ...in one block...

In [19]:
#texts=["Dogs like to [MASK] cats. They are cute."]
texts='As a Lord of the Rings fan, I was eagerly awaiting the origin stories of Middle-earth. Of course, I have high [MASK] after Lord of the Rings, which is close to perfection in terms of time and fiction. Because they have a considerable budget and opportunities, that\'s why I gave my points by watching the first episode right away.'
t=tokenizer(texts,padding=True, truncation=True, return_tensors="pt")
bert_out=bert(**t)
top20=torch.argsort(bert_out["prediction_logits"],dim=2,descending=True)[:,:,:20]
print("Guesses:",tokenizer.decode(top20[0,4]))

Guesses: of to and Of the'from in with for. but bya at -, timer year


In [20]:
print(t)
print(tokenizer.mask_token_id)

{'input_ids': tensor([[  101,  1249,   170,  2188,  1104,  1103, 22518,  5442,   117,   146,
          1108, 19379, 16794,  1103,  4247,  2801,  1104,  3089,   118,  4033,
           119,  2096,  1736,   117,   146,  1138,  1344,   103,  1170,  2188,
          1104,  1103, 22518,   117,  1134,  1110,  1601,  1106, 17900,  1107,
          2538,  1104,  1159,  1105,  4211,   119,  2279,  1152,  1138,   170,
          5602,  4788,  1105,  6305,   117,  1115,   112,   188,  1725,   146,
          1522,  1139,  1827,  1118,  2903,  1103,  1148,  2004,  1268,  1283,
           119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

# TASKS

As an exercise, you can try to solve the following:

1. How good is BERT at the masked language modelling (MLM) task? Feed random texts e.g. from the IMDB dataset, mask a random token at a time, and check: did BERT predict it correctly?
2. If you did (1), can you answer did BERT predict it correctly in top-5?
3. Try can you do better. Make yourself a program which picks random texts from one of the datasets we used in this course and produces two files: one with segments of texts with one [MASK] and one with the correct answers. Then try to guess the words without looking at the latter file and then compare your answers with the correct ones. How well did you do?


As a Lord of the Rings fan, I was eagerly awaiting the origin stories of Middle-earth. Of course, I have high **expectations** after Lord of the Rings, which is close to perfection in terms of time and fiction. Because they have a considerable budget and opportunities, that's why I gave my points by watching the first episode right away.

Comments: I have masked 'expectation' but the language model was not able to predict that at all.

In [21]:
import random
from datasets import load_dataset
dataset = load_dataset("imdb")

# Choose a subset of comments (for simplicity, we will take 10 random samples from the test set)
num_samples = 10
comments = dataset['test']['text']
random_comments = random.sample(comments, num_samples)

# Prepare to display masked comments and answers
masked_comments = []
correct_answers = []

for comment in random_comments:
    # Tokenize the comment into words
    words = comment.split()

    # Randomly choose an index to mask (avoid masking the first and last word for better context)
    if len(words) > 2:  # Ensure there's a token to mask
        mask_index = random.randint(1, len(words) - 2)  # Avoid first and last word
        correct_answer = words[mask_index]  # Store the correct answer

        # Mask the selected word
        words[mask_index] = "[MASK]"
        masked_comment = " ".join(words)

        # Store results
        masked_comments.append(masked_comment)
        correct_answers.append(correct_answer)

# Print masked comments and answers
print("Masked Comments:")
for masked in masked_comments:
    print(masked)

print("\nCorrect Answers:")
for answer in correct_answers:
    print(answer)


Masked Comments:
So.. what can I tell you about this movie. If you cheated a lot in high school, you do recognize some cheattips...<br /><br />This is the best thing i can tell you about this film!<br /><br />If you like American-teen movies, maybe you also like it!<br /><br />But i don't see this kind of movies as something funny.. sorry to say but if you are older then 10 years, i shouldn't advise you to watch this.<br /><br />Because there is one shot with a [MASK] of beautiful women (girls.. in this movie) i'll give it a rate of: 2!<br /><br />so.. deal for yourself! good luck
Remember when Rick Mercer was funny? 22 Minutes was a great show when Rick Mercer was on it and Made In Canada was a great show once too. Talking To Americans was such a funny special too. But like my friend said "Rick Mercer woke up one day and wasn't funny any more" I think that day was when Rick Mercer Report went on the air. What is the point of this show? Rick Mercer reads wacky fake headlines, shows pic

In [22]:
import random
from datasets import load_dataset
comments = dataset['test']['text']

random_comment = random.choice(comments)
words = random_comment.split()
mask_index = random.randint(1, len(words) - 2)
correct_answer = words[mask_index]

words[mask_index] = "[MASK]"
masked_comment = " ".join(words)
print("Masked:")
print(masked_comment)
print("Correct:")
print(correct_answer)

Masked:
The acting in this film was of the old school: corny and stiff. Irene Dunne is luminous, and comes off the best even though she has some very unnatural lines to say. Still, her ability to [MASK] emotion comes through.<br /><br />Old movie buffs will find at least some redeeming qualities in this film through observation of cinematic technique of the 1930s. Otherwise, it is not really that worthwhile.
Correct:
convey
