## BERT Attention Visualization (BERT base uncased)

This notebook demonstrates how a Transformer (BERT) processes a sentence and visualizes attention heads using BertViz.

We will:
- Load `bert-base-uncased` with `output_hidden_states=True` and `output_attentions=True`.
- Tokenize the sentence: "Transformers are amazing!".
- Run a forward pass through the model.
- Print tensor shapes for `input_ids`, `last_hidden_state`, `hidden_states`, and `attentions`.
- Visualize attention heads with BertViz `head_view`.


In [2]:
# If running on Colab, uncomment the following line to install deps
# !pip -q install transformers==4.41.2 torch bertviz==1.4.0

import torch
from transformers import AutoTokenizer, AutoModel

print(torch.__version__)



2.8.0+cpu


In [3]:
# Load tokenizer and model with hidden states and attentions
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(
    model_name,
    output_hidden_states=True,
    output_attentions=True,
)
model.eval()

print("Model loaded:", model_name)



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model loaded: bert-base-uncased


In [4]:
# Tokenize input sentence
sentence = "Transformers are amazing!"
enc = tokenizer(sentence, return_tensors="pt")
input_ids = enc["input_ids"]
attention_mask = enc["attention_mask"]

print("Sentence:", sentence)
print("Tokens:", tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))
print("input_ids shape:", tuple(input_ids.shape))
print("attention_mask shape:", tuple(attention_mask.shape))



Sentence: Transformers are amazing!
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']
input_ids shape: (1, 6)
attention_mask shape: (1, 6)


In [5]:
# Forward pass with outputs for hidden states and attentions
with torch.no_grad():
    outputs = model(**enc)

last_hidden_state = outputs.last_hidden_state
hidden_states = outputs.hidden_states  # tuple: embedding + 12 layer outputs (for BERT base)
attentions = outputs.attentions        # tuple: 12 items, each (batch, heads, seq_len, seq_len)

print("last_hidden_state shape:", tuple(last_hidden_state.shape))
print("# hidden_states tensors:", len(hidden_states))
if len(hidden_states) > 0:
    print("hidden_states[0] (embeddings) shape:", tuple(hidden_states[0].shape))
    print("hidden_states[-1] (last layer) shape:", tuple(hidden_states[-1].shape))

print("# attentions tensors:", len(attentions))
if len(attentions) > 0:
    print("attentions[0] shape:", tuple(attentions[0].shape))
    print("attentions[-1] shape:", tuple(attentions[-1].shape))



last_hidden_state shape: (1, 6, 768)
# hidden_states tensors: 13
hidden_states[0] (embeddings) shape: (1, 6, 768)
hidden_states[-1] (last layer) shape: (1, 6, 768)
# attentions tensors: 12
attentions[0] shape: (1, 12, 6, 6)
attentions[-1] shape: (1, 12, 6, 6)


In [6]:
# Visualize attention with BertViz head_view
from bertviz import head_view

# Convert ids to tokens for visualization
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
# head_view expects: attention (list[torch.Tensor] per layer), tokens (list[str])
# It also supports passing sentence pairs; here we pass a single sequence.

head_view(attentions, tokens)



<IPython.core.display.Javascript object>