# Biweekly Report 3

# Visualizing BERT Attention

## Jake Watts



In this section of the report, I visualize the attention from the trained BERT model used in the BERT_Tuning notebook on the MRPC data. I first visualize the attention between two sentences that are not semantically equivalent. I then visualize the attention between two sentences that are semantically equivalent. My goal in visualizing these pairs is to get a better understanding of BERT attention and to see if there is a difference in attention between equivalent and non-equivalent sentences.

Note: Since the visualizations are interactive they can't be viewed within GitHub and have to be run within a Jupyter Notebook.

Sources:

https://github.com/jessevig/bertviz#setting-default-layer-head-s

https://arxiv.org/abs/1810.04805

In [None]:
!pip install bertviz

Collecting bertviz
  Downloading bertviz-1.3.0-py3-none-any.whl (155 kB)
[K     |████████████████████████████████| 155 kB 4.9 MB/s 
[?25hCollecting transformers>=2.0
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 62.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 47.0 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.21.4-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 21.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 48.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.9 MB/s 
[?25hCollecting sacremoses


Here I load the pre-trained BERT model. Sentence A and Sentence B are sentenced from the MRPC that are not equivalent. They are converted into tokens and then used to calculate attention.

In [None]:
# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)
sentence_a = "The identical rovers will act as robotic geologists , searching for evidence of past water ."
sentence_b = "The rovers act as robotic geologists , moving on six wheels ."
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 

In [None]:
print("Sentence A:", sentence_a)
print("Sentence B:", sentence_b)

Sentence A: The 2002 second quarter results don 't include figures from our friends at Compaq .
Sentence B: The year-ago numbers do not include figures from Compaq Computer .


Below is the visualization for the attention, set to be displayed in the dark mode of colab. The grid is 12x12 and displays the attention for each attention head under each layer. To view only the attention between sentences A and B, I chose "Sentence A -> Sentence B" under the attention dropdown at the top. To view more details about a specific layer and attention head you can click on a square in the grid to see the words and attention.
From the visualization, we can see that the attention layers show stronger connections as the layer number increases. It also appears that the words in sentence A are connected to the SEP token in a sizeable number of grids. Some grids also display a different pattern in which there are parallel lines in the top half of the grid. Clicking on those grids reveals that they connect the words that both sentences share, which mostly occur in the first half of the sentences.


In [None]:
model_view(attention, tokens, sentence_b_start, display_mode="dark")

Now let's look at sentences that are semantically equivalent. Sentence A and B are printed below.

In [None]:
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)
sentence_a = "The 2002 second quarter results don 't include figures from our friends at Compaq ."
sentence_b = "The year-ago numbers do not include figures from Compaq Computer ."
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
print("Sentence A:", sentence_a)
print("Sentence B:", sentence_b)

Sentence A: The 2002 second quarter results don 't include figures from our friends at Compaq .
Sentence B: The year-ago numbers do not include figures from Compaq Computer .


Looking at Sentence A -> Sentence B attention shows a similar pattern in which some grids show connections to the SEP token and some show attention between words with equivalent meaning between the sentences. For example layer 4, head 3 shows an attention mapping between equivalent words across the two sentences. Since these sentences are semantically equivalent, there are more these connections in this visualization than the previous one.

In [None]:
model_view(attention, tokens, sentence_b_start, display_mode="dark")

## Summary

Looking at attention visualizing is helpful in exploring and understanding the BERT model. From the connection visualizations, it is also easy to see some differences between the sentence pairs that are and are not semantically equivalent. The trained model has a grasp of the english language and it makes sense that this model can be quickly fine-tuned for evaluating sentence equivalence.