Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret the visualization results? #15

Closed
mertyyanik opened this issue Aug 19, 2020 · 2 comments
Closed

How to interpret the visualization results? #15

mertyyanik opened this issue Aug 19, 2020 · 2 comments

Comments

@mertyyanik
Copy link

Can you explain the visualization results? What is the meaning of each head plots?

@tatp22
Copy link
Owner

tatp22 commented Aug 19, 2020

TL;DR: This shows the relationship between input tokens x and y. A brighter value means a stronger relationship between token x and token y, and a lower one means a weaker relationship. Every head is independent, meaning that the attention mechanism can learn a diverse set of representations that allow it to better model the problem.

Hi @mertyyanik!

I can do my best here to explain what the visualization results mean.

What I am plotting are all of the attention heads, for every head and every depth. To explain what an attention head is, let's look at
the attention for equation for a single head: (I am going to use the full attention equation for now, I will switch to the linformer later)

attn(Q,K,V) = softmax(QK^T/sqrt(d_m))V

Basically, what I am plotting is softmax(QK^T/sqrt(d_m)). What is this result exactly? Let's take a look at this through just the encoder module.

Let's say that that you just wanted to encode the sentence "The dog ran.", that you wanted to pass through, for example, the LinformerLM class. For simplicity, assume that only 3 tokens were created, one for "The", one for "dog", and one for "ran". You would then tokenize the words (for example, "The" -> 4212, "dog" -> 123, "ran" -> 443), vectorize them ([4212, 123, 443]). These words go through an embedding, and then your input is of size (3, d_m), where d_m is the dimension of the embedding. Finally, this goes through the Linformer.

Now, when it goes through the encoder, we get that this input vector of size (3, d_m) gets transformed into Q,K,V matrices. That means that each of these Q,K,V matrices are different representations of the input tokens, and they of size (3, d). We multiply Q,K in the softmax to get softmax(QK^T/sqrt(d_m)), which is a 3x3 matrix. One can think of these dimensions as representing the input words; there are 3 words on dimension 1 and dimension 2, but they actually correspond to the same sentence, which is "The dog ran". The standard interpretation, then is that one can see this matrix as how much information one token routes through others. As an example, lets say that the output of softmax(QK^T/sqrt(d_m)) is, for this head:

0.5 0.5 0.0
0.0 1.0 0.0
0.3 0.3 0.4

Here, the word "The" routes 0.5 of its information through "The", and 0.5 of the information through "dog", "dog" routes all of its information through "dog", and "ran" routed all of its information through a mix of all 3.

Finally, this is done for every head at every depth, and what one gets is a visualization of the internal representations. What exactly do these internal representations mean? Well, it depends on your task, but generally, it is how important the relationship is between two tokens.

For some great articles and papers on how this stuff works, with more visualizations and better plots, I can link some other stuff here:

https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
https://arxiv.org/pdf/2002.12327.pdf
https://arxiv.org/pdf/1906.04341.pdf

Now, what I described is how this works for standard attention. As for the Linformer attention... to be honest, it's hard to say. Recall that it is of the form

attn(Q,K,V) = softmax(Q(EK)^T/sqrt(d_m))FV

Where E,F are of size k,n if they are linear layers, or it is a convolution of kernel size (n/k) and a stride of (n/k).

I can imagine when using the convolution downsampling method, it means how relationship between a token x corresponds to the neighborhood of tokens y, in a similar fashion to above. Just imagine, however, that instead of a 3x3 matrix, one has a 2048x512 matrix, if k=512, and that a token x can look at a set of tokens y, and see how important y is to route information through. With the convolutional method, it's interpretable, because y corresponds to a local group of 4 tokens, in this example.

To give another small example for the convolutional method, imagine that you have 4 tokens, a,b,c,d, and k is 2. Then, the attention matrix would be of size (4,2), and the options all the 4 tokens would have would be "How much information should I route from my token to the tokens a,b, and how much information should I route from myself to the tokens c,d".

However, if one were to replace it with a linear layer or a fixed layer, I think that all interpretability goes through the drain, because the tokens can be totally dispersed throughout the whole head. However, this may actually be an open research question, and if you want to investigate further, feel free!

I hope this answered your question!

@mertyyanik
Copy link
Author

Thanks for the perfect explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants