#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

#Visualizing transformer attention mechanisms

This notebook provides a visual and interactive introduction to the attention mechanisms that power modern Large Language Models (LLMs). We will explore three fundamental types of attention:

* **Self-Attention**: The mechanism that allows a model to weigh the importance of different words in a single sequence to understand context. We will use BERT for this demonstration.

*  **Masked Self-Attention**: A variation of self-attention used in generative models like GPT to prevent a token from "seeing" future tokens during training and generation.

*  **Cross-Attention**: The mechanism used in encoder-decoder models (like T5) that allows the decoder to focus on relevant parts of the input sequence while generating the output sequence.

We will use the `bertviz` library for visualization and `transformers` by Hugging Face to load the models.  

##Setup Environment
First, let's install the necessary libraries. We need `transformers` to load our models (BERT, GPT-2, T5) and `bertviz` for the visualizations.


In [None]:
!pip install -q transformers bertviz


# What is Attention? A Brief Introduction

Before jumping into the code, let‚Äôs unpack the core idea. An **attention mechanism** lets a model decide *which* parts of the input are most relevant for the task at hand.

Think of translating the sentence:  
> *The black cat sat on the mat.*  

When you‚Äôre about to write the Spanish word **"gato"**, your brain naturally zeroes in on the English word **"cat"**, filtering out less relevant words.

In neural networks, this selective focus is implemented by creating three vectors for each input token:  

- **Query (Q)** ‚Äì Captures what the current token is looking for in context: *‚ÄúWhat information do I need?‚Äù*  
- **Key (K)** ‚Äì Encodes what information the token offers: *‚ÄúHere‚Äôs what I represent.‚Äù*  
- **Value (V)** ‚Äì Contains the actual detailed content or meaning of the token, which will be *retrieved and passed along* when a Key matches a Query.

The process works like this:  

1. **Scoring relevance** ‚Äì Compare the Query of the current token with the Key of every token using a dot product.  
2. **Scaling and normalizing** ‚Äì Divide by the square root of the Key‚Äôs dimension \\( \sqrt{d_k} \\) to avoid overly large values, then apply a softmax to turn scores into probabilities.  
3. **Building context** ‚Äì Multiply each Value vector by its corresponding attention weight and sum the results, producing a **context-enriched representation**.  

The whole operation is expressed as:  

\\[\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right) V\\]

This formula is the mathematical heart of the attention mechanism ‚Äî it‚Äôs how each token ‚Äúlooks at‚Äù every other token and gathers the context it needs.

# Self-Attention with BERT
**Self-Attention** is the mechanism used by models like BERT (2018)(Bidirectional Encoder Representations from Transformers). It's called "self" because the attention is calculated within the same sentence. Because BERT is bidirectional, a token can attend to tokens that come both before and after it.

BERT‚Äôs architecture depends on the variant, but we will use `bert-base-multilingual-cased` which has:
* 12 layers (also called Transformer encoder blocks)
* 12 attention heads per layer
* ~110 million parameters

Here, we are more interested in studying the  interactive attention visualization than the specific code. The code will load a multilingual BERT model (bert-base-multilingual-cased) along with its tokenizer, and will define a helper function to visualize self-attention using BertViz.  




In [None]:
# --- 1. Import necessary libraries ---
# torch is the deep learning framework.
# The `...Model` and `...Tokenizer` classes from `transformers` allow us to load pretrained models.
# `head_view` is the specific visualization function from bertviz.
import torch
from transformers import BertTokenizer, BertModel
from bertviz import head_view

# --- 2. Setup model and tokenizer ---
# We use 'bert-base-multilingual-cased' because it's trained on many languages,
# making it suitable for both our English and Spanish examples.
# We also instantiate the model that corresponds to this tokenizer.
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# --- 3. Define a function for visualization to avoid code repetition ---
def show_self_attention(sentence, model, tokenizer):
    """
    Prepares a sentence and visualizes its self-attention using bertviz.
    """
    # --- a. Prepare the input ---
    # The tokenizer converts the sentence string into a list of token IDs.
    # It also adds special tokens like [CLS] at the beginning and [SEP] at the end.
    # `return_tensors='pt'` ensures the output is a PyTorch tensor.
    inputs = tokenizer.encode(sentence, return_tensors='pt')

    # --- b. Get model outputs ---
    # We pass the tokenized input through the model.
    # `output_attentions=True` is CRUCIAL. It tells the model to return the attention weights.
    outputs = model(inputs, output_attentions=True)

    # --- c. Extract attention weights and tokens ---
    # The attention weights are in `outputs.attentions`. This is a tuple where each
    # element corresponds to a layer in the model.
    # The shape of each attention tensor is: (batch_size, num_heads, sequence_length, sequence_length)
    attention = outputs.attentions

    # We convert the token IDs back to human-readable strings for the visualization.
    tokens = tokenizer.convert_ids_to_tokens(inputs[0])

    # --- d. Visualize ---
    # `head_view` takes the attention tensors, and the list of tokens to display them.
    # It generates an interactive visualization. You can select different layers and heads.
    print(f"Visualizing self-attention for: '{sentence}'")
    display(head_view(attention, tokens))

# Some Examples  

Example 1 ‚Äì Pronoun Ambiguity (Simple)  
Sentence: *Alice went to the park. She saw a dog.*  
This is a straightforward case of pronoun resolution.  
- Focus on how **"She"** attends to **"Alice"** across different layers and heads.  
- Early layers may capture local grammatical links, while deeper layers integrate broader context.  


Example 2 ‚Äì Polysemy Disambiguation (Simple)  
Sentence: *She will play the piano.*  
This example tests the model‚Äôs ability to select the correct meaning of a polysemous word.  
- **"Play"** here means ‚Äúto perform music,‚Äù not ‚Äúto engage in a game‚Äù or ‚Äúto act in a role.‚Äù  
- Check how **"play"** attends to **"piano"** to clarify its intended meaning.  
- Deeper layers often strengthen semantic connections like this.  


Example 3 ‚Äì Pronoun Ambiguity (Complex)  
Sentence: *The cat chased the mouse because it was hungry.*  
This is a more challenging case of pronoun resolution.  
- **"It"** could refer to either **"cat"** or **"mouse"**.  
- Examine which token **"it"** attends to most strongly in different heads and layers.  
- Some layers may focus on grammatical structure, while others attempt semantic interpretation.  
- Attention patterns here may be less clear, illustrating the difficulty of ambiguous pronoun.



# How to Use the BertViz Interface

- **Left vs. right tokens**: In the visualization, the tokens on the left are the queries (Q) ‚Äî the words whose attention patterns you are inspecting. The tokens on the right are the keys (K) ‚Äî the words they are attending to. The attention scores are computed as interactions between queries and keys, and then applied to values (V) to produce the final contextualized representations.

- **Layer dropdown**: Use this to select which Transformer layer‚Äôs attention you want to visualize. You can focus on the last layer (often the most semantically meaningful), or explore all layers together to observe how attention patterns evolve step by step.

- **Head selection**: Each attention head captures different types of relationships (e.g., subject‚Äìverb, modifier‚Äìnoun, or long-range dependencies). Switching between heads helps reveal these complementary perspectives. When all heads are selected, you can also distinguish on the right which heads are paying the most attention to a given token, based on their color coding.

- **Token highlighting**: Hover over a token to see where its attention is directed most strongly. Keep in mind that attention is often distributed and gradual ‚Äî all tokens tend to attend to all others to some extent, though certain connections stand out more clearly.

Compare how patterns differ between the two examples to see BERT‚Äôs handling of different linguistic challenges.

In [None]:
# Example 1: Pronoun ambiguity (simple)
sentence1 = "Alice went to the park. She saw a dog."
print("\nExample 1: Pronoun Ambiguity (simple)")
print("Focus on how 'He' attends to 'Alice' across different heads/layers.")
show_self_attention(sentence1, model, tokenizer)


# Example 2: Polysemy disambiguation (simple)
sentence2 = "She will play the piano."
print("\nExample 2: Polysemy Disambiguation (simple)")
print("Focus on how 'play' attends to 'piano' to clarify its meaning.")
show_self_attention(sentence2, model, tokenizer)

# Example 3: Pronoun ambiguity (complex)
sentence3 = "The cat chased the mouse because it was hungry."
print("\nExample 3: Pronoun Ambiguity (complex)")
print("Focus on how 'it' attends to 'cat' vs 'mouse' across different heads/layers.")
show_self_attention(sentence3, model, tokenizer)

# Masked Self-Attention with GPT-2
**Masked Self-Attention**. The concept is central to how autoregressive models like GPT-2 (2019) operate. These models generate text one token at a time, and at any point they should only have access to the tokens already produced ‚Äî not the ones that come next. To enforce this causal constraint, a mask is applied to the attention scores. This mask blocks (sets to zero) all connections to future tokens, ensuring that each token can attend only to itself and the tokens preceding it.

GPT-2 Architecture Overview:

- Layers (Transformer decoder blocks): 12  
- Attention heads per layer: 12  
- Parameters: ~117 million (smallest GPT-2 version)  



In [None]:

# --- 1. Import necessary libraries ---
# We now use the GPT-2 specific classes. The process is very similar to BERT.
from transformers import GPT2Tokenizer, GPT2Model

# --- 2. Setup model and tokenizer ---
# We'll use the standard 'gpt2' model. Note that this model is primarily trained on English.
# A different model would be needed for high-quality Spanish generation, but 'gpt2'
# is perfect for demonstrating the *masking* mechanism.
model_name_gpt = 'gpt2'
tokenizer_gpt = GPT2Tokenizer.from_pretrained(model_name_gpt)
model_gpt = GPT2Model.from_pretrained(model_name_gpt, output_attentions=True)


# --- 3. Define a function for visualization ---
# We can reuse our function structure, just adapting it for GPT-2.
def show_masked_self_attention(sentence, model, tokenizer):
    """
    Prepares a sentence and visualizes its masked self-attention.
    """
    # --- a. Prepare the input ---
    # GPT-2's tokenizer works similarly but might use different special tokens.
    inputs = tokenizer.encode(sentence, return_tensors='pt')

    # --- b. Get model outputs ---
    # Again, `output_attentions=True` is the key.
    outputs = model(inputs, output_attentions=True)

    # --- c. Extract attention weights and tokens ---
    attention = outputs.attentions
    # Note: The GPT-2 tokenizer often prepends a space to the first token.
    tokens = tokenizer.convert_ids_to_tokens(inputs[0])

    # --- d. Visualize ---
    # The same `head_view` function works perfectly here.
    # LOOK for the masked (blank) area in the top-right of the visualization.
    print(f"Visualizing masked self-attention for: '{sentence}'")
    display(head_view(attention, tokens))



Let us run some examples. In the visualization, you will see this clearly: all the attention lines will either point backwards or to the current token.  

In [None]:
# Example 1: Pronoun ambiguity (simple)
sentence1 = "Alice went to the park. She saw a dog."
print("\nExample 1: Pronoun Ambiguity (simple)")
print("Focus on how 'He' attends to 'Alice' across different heads/layers.")
show_self_attention(sentence1, model_gpt, tokenizer_gpt)


# Example 2: Polysemy disambiguation (simple)
sentence2 = "She will play the piano."
print("\nExample 2: Polysemy Disambiguation (simple)")
print("Focus on how 'play' attends to 'piano' to clarify its meaning.")
show_self_attention(sentence2, model_gpt, tokenizer_gpt)

# Example 3: Pronoun ambiguity (complex)
sentence3 = "The cat chased the mouse because it was hungry."
print("\nExample 3: Pronoun Ambiguity (complex)")
print("Focus on how 'it' attends to 'cat' vs 'mouse' across different heads/layers.")
show_self_attention(sentence3, model_gpt, tokenizer_gpt)

# Cross-Attention with T5

**Cross-Attention** is a core mechanism in encoder‚Äìdecoder models, enabling sequence-to-sequence tasks such as translation, summarization, and question answering.  

**T5** (2019), or Text-to-Text Transfer Transformer, is a series of large language models developed by Google AI that treat every natural language processing (NLP) task as a text-to-text problem. This means both the input and the output are always text, regardless of the task type, such as translation, summarization, or question answering. T5 uses an encoder-decoder Transformer architecture, where the encoder processes the input text, and the decoder generates the output text.

T5-Base Architecture Overview
- Layers: 12 encoder blocks and 12 decoder blocks  
- Attention heads per layer: 12  
- Parameters: ~220 million  
- Type: Transformer-based encoder‚Äìdecoder  

For our example, we will use T5 to **translate a sentence from English to Spanish**.



In [None]:
# --- 1. Import necessary libraries ---
# T5 is a versatile encoder-decoder model. We'll use it for a translation task.
# We need `T5ForConditionalGeneration` which includes both encoder and decoder.
## We import bertviz `model_view` which can handle self-attention, cross-attention, and more.
from transformers import T5Tokenizer, T5ForConditionalGeneration
from bertviz import model_view

# --- 2. Setup model and tokenizer ---
# 't5-base' is a great, robust model for sequence-to-sequence tasks.
model_name_t5 = 't5-base'
tokenizer_t5 = T5Tokenizer.from_pretrained(model_name_t5)
model_t5 = T5ForConditionalGeneration.from_pretrained(model_name_t5)

# --- 3. Define a function for visualization ---
def show_cross_attention(input_text, output_text, model, tokenizer):
    """
    Visualizes the cross-attention from an encoder to a decoder sequence.
    """
    # --- a. Prepare the inputs for BOTH encoder and decoder ---
    encoder_inputs = tokenizer(input_text, return_tensors='pt')
    with tokenizer.as_target_tokenizer():
        decoder_inputs = tokenizer(output_text, return_tensors='pt')

    # --- b. Get model outputs ---
    outputs = model(
        input_ids=encoder_inputs['input_ids'],
        decoder_input_ids=decoder_inputs['input_ids'],
        output_attentions=True,
        return_dict=True
    )

    # --- c. Extract attention weights and tokens ---
    encoder_tokens = tokenizer.convert_ids_to_tokens(encoder_inputs['input_ids'][0])
    decoder_tokens = tokenizer.convert_ids_to_tokens(decoder_inputs['input_ids'][0])

    print("Encoder tokens (English):", encoder_tokens)
    print("Decoder tokens (Spanish):", decoder_tokens)

    # --- d. Visualize ---
    display(model_view(
        encoder_tokens=encoder_tokens,
        decoder_tokens=decoder_tokens,
        encoder_attention=outputs.encoder_attentions,
        decoder_attention=outputs.decoder_attentions,
        cross_attention=outputs.cross_attentions
    ))

#One Example and  How Cross-Attention Works
In this example, we explicitly tell T5 what task to perform by adding a **task prefix**:  

* **Input:** `"translate English to Spanish: A black cat sat on the mat."`  
The prefix `translate English to Spanish:` tells the model to perform a translation task from English to Spanish. Without the task prefix, T5 might not realize it needs to translate and could produce unrelated output, since it was trained on many different tasks.


*  **Expected Output:** `"Un gato negro se sent√≥ en la alfombra."`  
This is the ground truth translation we use to compare against the model‚Äôs prediction and to visualize the cross-attention patterns.  


 How Cross-Attention Works
1. **Encoding the source:**  
   The encoder processes the entire input sequence (e.g., an English sentence) using self-attention, producing a rich contextual representation for each token.  

2. **Decoding with context:**  
   The decoder generates the output sequence (e.g., a Spanish sentence) one token at a time.  
   
3. **Bridging the two with cross-attention:**  
   At every decoding step, cross-attention allows the decoder to attend to the encoder‚Äôs outputs.  
   - The *queries* come from the decoder‚Äôs current hidden state.  
   - The *keys* and *values* come from the encoder‚Äôs final representations.  

4. **Example:**  
   When generating the word *gato*, the decoder‚Äôs cross-attention heads can focus strongly on the encoder token *"cat"*, ensuring semantic alignment between source and target.  

This ‚Äúcross‚Äù flow of information enables the decoder to ground each generated token in the full context of the input sequence.




# BertViz Interface for the Cross Attention
 Due to limitations of the visualization tools, we use `model_view` from bertviz for T5 instead of `head_view` used earlier with BERT and GPT2. This is because `head_view` does not support cross-attention.  `model_view` is used instead.   

With model_view, BertViz displays the attention matrices for each layer and each head. It is usually most informative to inspect the last layer (starting with the leftmost heads), since these capture higher-level semantic relationships.

The interface also provides a dropdown menu to select which type of attention to display: encoder self-attention, decoder self-attention, or cross-attention. In our case, we are mainly interested in cross-attention, since it shows how the decoder attends to the encoder‚Äôs representations when generating the output sequence.

In the cross-attention blocks, the visualization shows the queries (Q) coming from the decoder on the left, and the keys (K) and values (V) coming from the encoder on the right. This reflects how the decoder attends to encoder representations when generating the output sequence.

In [None]:
 # Note: T5 was pre-trained on many tasks, including translation.
# We prepend the task prefix to guide the model.
source_sentence = "translate English to Spanish: A black cat sat on the mat."
target_sentence = "Un gato negro se sent√≥ en la alfombra." # The ground truth translation

show_cross_attention(source_sentence, target_sentence, model_t5, tokenizer_t5)

**Note:**  When you fix the widget to the **Cross-Attention** view, you might notice an exaggerated attention from tokens like "gato" to the word "Spanish" in the encoder input. This happens because the model uses the task prefix `"translate English to Spanish:"` as an important signal to guide the translation. As a result, the decoder pays extra attention to these prefix tokens since they define the target language and task, sometimes more than to the actual source words like "cat". This is expected behavior and reflects how T5 leverages the prefix to control output generation.

# Conclusions


This notebook provides a visual and interactive introduction to the attention mechanisms that power modern Language Models. Through practical examples and visualizations, we explored the three fundamental types of attention.

Using the `bertviz` library and Hugging Face‚Äôs transformers, we gained insight into how attention heads and layers behave across different architectures and linguistic phenomena, such as pronoun resolution and polysemy.

However, attention visualizations can be complex and sometimes counterintuitive; they offer valuable hints but do not provide definitive explanations of model reasoning.  Overall, this exploration deepens understanding of how attention mechanisms function inside LLMs and highlights the challenges of interpreting their inner workings, paving the way for more advanced interpretability methods.





