In [12]:
# @title Install the Dependencies and Set Things Up {"display-mode":"form"}
!pip install bertviz -q
import warnings
from transformers import utils
warnings.filterwarnings("ignore", message="`encoder_attention_mask` is deprecated and will be removed in version 4.55.0")
utils.logging.set_verbosity_error() # Suppress verbose output

print("Dependcies are install and we're ready to rock and roll. 🎸")

Dependcies are install and we're ready to rock and roll. 🎸


# Understanding the Attention Mechanism: How Models Focus

**Previously**: You've learned how text becomes tokens and how an **attention mask** tells the model which tokens to ignore (`0`) and which to focus on (`1`).

But how does the model *use* that information to _focus_? The secret is the **attention mechanism**, the engine that allows models to weigh the importance of different words in a sentence.

Think of it like this: when you read the sentence, "The cat sat on the mat," your brain instantly knows that "sat" is related to "cat" (the one doing the sitting) and "mat" (where it sat). The attention mechanism gives models this same ability to understand relationships between words, no matter how far apart they are.

# From Word Embeddings to Contextual Understanding

Before we dive into attention, let's quickly review a key concept or two from our last discussion.

**Tokenization**: We split text into tokens and map them to IDs.

> `"The cat sat"` &rarr; `[101, 1996, 4937, 3323, 102]`

Once we have our tokens, the next step is to create an **embedding**. Each token ID is then mapped to a dense vector of numbers (an **embedding**). This vector represents the word's "meaning" in a multi-dimensional space.

> Initially, the embedding for "bank" is the same in "river bank" and "money bank." This is a problem.

This is where **attention** comes in. It takes these initial, context-free embeddings and enriches them with information from all other words in the sentence.

# The Core Idea: Queries, Keys, and Values

The attention mechanism works by creating three special vectors for each word's embedding:

1.  **Query (Q)**: What I'm currently looking for (e.g., the current word being processed).
2.  **Key (K)**: What this word has to offer (e.g., a label for another word).
3.  **Value (V)**: The actual information this word provides.

## An Analogy: A Library

Imagine you're researching "artificial intelligence" in a library. (**Fair warning**: I am about to show my age.)

  * Your **Query (Q)** is your research topic: "artificial intelligence."
  * You go to the card catalog. Each book's title is a **Key (K)**. You compare your query to each key to see how well they match.
  * A book titled "A History of AI" is a strong match. A book on "Gardening" is a weak match.
  * The content of the books are the **Values (V)**.
  * You pull out the books that are the best matches (highest attention scores) and blend their information together to get a rich, contextual understanding.

The attention mechanism does this for every single word in the input, allowing the model to learn which other words are most important for understanding its meaning in that specific context.

# Visualizing Attention

Let's use a simple example to see what this looks like. We'll need a model and tokenizer, and then we can visualize the attention scores.

(**Tasting note**: We're going to use the [bertviz](https://github.com/jessevig/bertviz) library to help visualize attention.)

In [13]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import head_view # A great library for visualizing attention

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

sentence_a = "The cat sat on the mat" #@param {type:"string"}
sentence_b = "The dog played in the park" #@param {type:"string"}

# Tokenize and get model outputs
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
outputs = model(**inputs)
attention = outputs[-1]  # The last output is the attention scores

# Visualize the attention from the [CLS] token
head_view(attention, tokens=tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))

<IPython.core.display.Javascript object>

Looking at the visualization, we can (hopefully) see strong connections between `cat` and `sat`/`mat`, and between `dog` and `played`/`park`. You'd also see that the model learns to keep the two sentences separate, thanks to the `[SEP]` token.

# In Summary

We started with high-level **Pipelines** to see what models can do, then you dove deep into **Tokenization** to understand how we prepare text for a model to read. We're _killing_ it.

So, now we've covered what an `attention_mask` is, a crucial list of 1s and 0s. You know *what* it does—tells the model which tokens to pay attention to—but not *how* or *why*.

Understanding attention is the key to truly understanding how models like BERT and GPT work. It's how they find relationships between words and generate meaningful text.

AI systems are not magical, but rather complex and elegant systems built on clever principles that you can understand.