How to interpret the visualization results? #15

mertyyanik · 2020-08-19T11:04:44Z

Can you explain the visualization results? What is the meaning of each head plots?

tatp22 · 2020-08-19T20:55:16Z

TL;DR: This shows the relationship between input tokens x and y. A brighter value means a stronger relationship between token x and token y, and a lower one means a weaker relationship. Every head is independent, meaning that the attention mechanism can learn a diverse set of representations that allow it to better model the problem.

Hi @mertyyanik!

I can do my best here to explain what the visualization results mean.

What I am plotting are all of the attention heads, for every head and every depth. To explain what an attention head is, let's look at
the attention for equation for a single head: (I am going to use the full attention equation for now, I will switch to the linformer later)

attn(Q,K,V) = softmax(QK^T/sqrt(d_m))V

Basically, what I am plotting is softmax(QK^T/sqrt(d_m)). What is this result exactly? Let's take a look at this through just the encoder module.

Let's say that that you just wanted to encode the sentence "The dog ran.", that you wanted to pass through, for example, the LinformerLM class. For simplicity, assume that only 3 tokens were created, one for "The", one for "dog", and one for "ran". You would then tokenize the words (for example, "The" -> 4212, "dog" -> 123, "ran" -> 443), vectorize them ([4212, 123, 443]). These words go through an embedding, and then your input is of size (3, d_m), where d_m is the dimension of the embedding. Finally, this goes through the Linformer.

Now, when it goes through the encoder, we get that this input vector of size (3, d_m) gets transformed into Q,K,V matrices. That means that each of these Q,K,V matrices are different representations of the input tokens, and they of size (3, d). We multiply Q,K in the softmax to get softmax(QK^T/sqrt(d_m)), which is a 3x3 matrix. One can think of these dimensions as representing the input words; there are 3 words on dimension 1 and dimension 2, but they actually correspond to the same sentence, which is "The dog ran". The standard interpretation, then is that one can see this matrix as how much information one token routes through others. As an example, lets say that the output of softmax(QK^T/sqrt(d_m)) is, for this head:

0.5 0.5 0.0
0.0 1.0 0.0
0.3 0.3 0.4

Here, the word "The" routes 0.5 of its information through "The", and 0.5 of the information through "dog", "dog" routes all of its information through "dog", and "ran" routed all of its information through a mix of all 3.

Finally, this is done for every head at every depth, and what one gets is a visualization of the internal representations. What exactly do these internal representations mean? Well, it depends on your task, but generally, it is how important the relationship is between two tokens.

For some great articles and papers on how this stuff works, with more visualizations and better plots, I can link some other stuff here:

https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
https://arxiv.org/pdf/2002.12327.pdf
https://arxiv.org/pdf/1906.04341.pdf

Now, what I described is how this works for standard attention. As for the Linformer attention... to be honest, it's hard to say. Recall that it is of the form

attn(Q,K,V) = softmax(Q(EK)^T/sqrt(d_m))FV

Where E,F are of size k,n if they are linear layers, or it is a convolution of kernel size (n/k) and a stride of (n/k).

I can imagine when using the convolution downsampling method, it means how relationship between a token x corresponds to the neighborhood of tokens y, in a similar fashion to above. Just imagine, however, that instead of a 3x3 matrix, one has a 2048x512 matrix, if k=512, and that a token x can look at a set of tokens y, and see how important y is to route information through. With the convolutional method, it's interpretable, because y corresponds to a local group of 4 tokens, in this example.

To give another small example for the convolutional method, imagine that you have 4 tokens, a,b,c,d, and k is 2. Then, the attention matrix would be of size (4,2), and the options all the 4 tokens would have would be "How much information should I route from my token to the tokens a,b, and how much information should I route from myself to the tokens c,d".

However, if one were to replace it with a linear layer or a fixed layer, I think that all interpretability goes through the drain, because the tokens can be totally dispersed throughout the whole head. However, this may actually be an open research question, and if you want to investigate further, feel free!

I hope this answered your question!

mertyyanik · 2020-08-20T10:02:26Z

Thanks for the perfect explanation!

mertyyanik closed this as completed Aug 20, 2020

tatp22 mentioned this issue Sep 10, 2020

causal_mask of the decoder #16

Closed

tatp22 mentioned this issue Apr 9, 2021

Loss goes to 0 when using LinformerLM #25

Closed

tatp22 mentioned this issue Jul 27, 2022

Question: Is Linformer permutation equivariant (set-operation)? #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to interpret the visualization results? #15

How to interpret the visualization results? #15

mertyyanik commented Aug 19, 2020

tatp22 commented Aug 19, 2020 •

edited

mertyyanik commented Aug 20, 2020

How to interpret the visualization results? #15

How to interpret the visualization results? #15

Comments

mertyyanik commented Aug 19, 2020

tatp22 commented Aug 19, 2020 • edited

mertyyanik commented Aug 20, 2020

tatp22 commented Aug 19, 2020 •

edited