layout: post
title: “How Transformers Really Work: A Deep Dive with Code and Visuals”
date: 2025-06-25
tags:
  - Deep Learning
  - NLP
  - Computer Vision
  - Transformers

## Introduction

Transformers have fundamentally reshaped how machines understand sequences. Whether it’s translating between languages, generating coherent text, or even understanding images — transformers are behind the scenes. This post offers a comprehensive and intuitive walkthrough of how transformers operate, with code, diagrams, and a little bit of math. 

## Part 1: One-Hot Encoding and Dot Products

To understand transformers, we begin with something simple: representing words as vectors.
By applying integer encoding, we can assign each word a unique integer. For example, let’s say we have a vocabulary of the following words:

In [1]:
vocab = {'files': 0, 'find': 1, 'my': 2}

Please note, the integer assignment is purely accidental.  There is no inherent meaning in the numbers assigned to the words.  Later we will see this is actually a problem as we won't be able to understand the similarity between words, nor are we able to capture the contextual meaning of them.

## One-Hot Encoding


Each word is represented as a one-hot vector:


In [2]:
import numpy as np

def one_hot(word, vocab):
    vec = np.zeros(len(vocab))
    vec[vocab[word]] = 1
    return vec

print(one_hot('find', vocab))  # [0, 1, 0]
print(one_hot('files', vocab))  # [1, 0, 0]
print(one_hot('my', vocab))  # [0, 0, 1]

[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]



Why one-hot? Because it gives each word a unique identity without implying any relation between them.

Dot Product: Measuring Similarity


In [3]:
a = one_hot('find', vocab)
b = np.array([0.2, 0.7, 0.8])
print(np.dot(a, b))  # 0.7

0.7


The dot product acts as a lookup, allowing us to compute similarity and perform matrix multiplication, which is the bedrock of neural networks.

🔢 Visualizing Dot Products

flowchart LR
  A["Find (0,1,0)"] -- Dot --> B["Vector (0.2, 0.7, 0.8)"] --> C[Output: 0.7]


## Part 2: Sequence Modeling with Markov Chains

Let’s consider a user input like: "show me my files"

A first-order Markov model assumes the next word only depends on the current word:
```python
transition_probs = {
    'my': {'files': 0.3, 'photos': 0.5, 'directories': 0.2}
}
```
We can model this as a matrix:
```python
matrix = np.array([
    # files, photos, directories
    [0.3, 0.5, 0.2]
])
```
However, this model can’t capture longer-range dependencies — such as knowing that “it” refers to “the dog” from earlier in the sentence.


## Part 3: Embeddings — Compressing Semantics

One-hot vectors are sparse and unintelligent. To give them meaning, we embed them into a dense space:
```python
import torch
embedding = torch.nn.Embedding(num_embeddings=50000, embedding_dim=512)
word_ids = torch.LongTensor([1, 2, 0])
embedded = embedding(word_ids)
```
This maps each word to a 512-dimensional vector, where semantic closeness is preserved.

### Why Embeddings Work
	•	Words like “Paris” and “London” end up close.
	•	Embedding matrix W learns during training.
	•	Shape: [vocab_size x embedding_dim]

graph TD
  A["One-hot word"] -->|W| B["Embedded word vector"]



## Part 4: Positional Encoding — Giving Order to Words

Unlike RNNs, transformers don’t have a natural sense of sequence. So we encode position manually:
```python
def positional_encoding(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)
    i = torch.arange(d_model).unsqueeze(0)
    angle_rates = 1 / torch.pow(10000, (2 * (i//2)) / d_model)
    angle_rads = pos * angle_rates
    
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(angle_rads[:, 0::2])
    pe[:, 1::2] = torch.cos(angle_rads[:, 1::2])
    return pe
```
These sinusoidal patterns let the model learn relative positions, essential for grammatical structure.

### Positional Encoding Visualization

Each row (position) contains a unique combination of sines and cosines:

```mermaid
graph LR
  A[Position 0] -->|sin/cos| B[Encoding Vector]
  C[Position 1] -->|sin/cos| D[Encoding Vector]
```


## Part 5: Scaled Dot-Product Attention

The magic of transformers is the attention mechanism. It lets the model focus on relevant parts of the input:

Formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Code:
```python
def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k**0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.nn.functional.softmax(scores, dim=-1)
    return torch.matmul(weights, V)
```
Where:
	•	Q: What we’re looking for
	•	K: What we have
	•	V: What we return if we match


## Part 6: Multi-Head Attention

Instead of one attention score, we use multiple attention heads:
```python    
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model):
        super().__init__()
        self.heads = heads
        self.d_k = d_model // heads

        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, q, k, v):
        bs = q.size(0)

        # Linear projections
        Q = self.q_linear(q).view(bs, -1, self.heads, self.d_k).transpose(1,2)
        K = self.k_linear(k).view(bs, -1, self.heads, self.d_k).transpose(1,2)
        V = self.v_linear(v).view(bs, -1, self.heads, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        weights = torch.nn.functional.softmax(scores, dim=-1)
        output = torch.matmul(weights, V)

        # Concatenate heads
        concat = output.transpose(1,2).contiguous().view(bs, -1, self.heads * self.d_k)
        return self.out(concat)
```

Diagram

flowchart TD
  A[Query] --> B(Multi-head Attention)
  B --> C[Context Vectors]


## Part 7: Feed Forward + Residual Connections

Each token’s representation is passed through a feedforward network:

```python
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))
```
And wrapped with residual connection + normalization:

```python
x = x + self.dropout(self.ff(x))
x = self.norm(x)
```


## Part 8: Full Encoder-Decoder Architecture

Translation Pipeline:

graph LR
  A[Input: "I am good"] --> B[Encoder]
  B --> C[Context Vectors]
  D[Decoder] --> E[Output: "Je vais bien"]
  C --> D

During training:
	•	The decoder sees the entire target sentence.
During inference:
	•	It sees only previous outputs, one word at a time.


## Part 9: Vision Transformers (ViT)

Images are split into patches → tokens!
```python
def image_to_patches(img, patch_size=16):
    B, C, H, W = img.shape
    img = img.view(B, C, H//patch_size, patch_size, W//patch_size, patch_size)
    return img.permute(0,2,4,3,5,1).reshape(B, -1, patch_size*patch_size*C)
```
Then fed through the same transformer encoder architecture.

ViT Diagram

graph LR
  A[Image] --> B[Patchify + Embed]
  B --> C[Transformer Encoder]
  C --> D[Classification Head]


## Part 10: Fine-Tuning a Transformer (Hugging Face)

Here’s how to fine-tune a transformer (like BERT) for sentiment analysis:
```python
from transformers import Trainer, TrainingArguments, BertForSequenceClassification, BertTokenizerFast
from datasets import load_dataset

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
dataset = load_dataset("imdb")

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

dataset = dataset.map(tokenize, batched=True)
training_args = TrainingArguments("./bert-finetuned", evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'].shuffle(seed=42).select(range(5000)),
    eval_dataset=dataset['test'].shuffle(seed=42).select(range(1000))
)

trainer.train()
``

## Part 11: Visualizing Transformers with Gradio

Let’s use Gradio to visualize how attention works.
```python
import gradio as gr
from transformers import pipeline

pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

def classify(text):
    return pipe(text)

demo = gr.Interface(fn=classify, inputs="text", outputs="label")
demo.launch()
```
You can extend this with heatmaps to visualize token-level attention weights.


## Conclusion

Transformers build intelligent understanding of sequences — whether text or images — through:
	•	Embedding
	•	Positional Encoding
	•	Attention
	•	Multi-head parallelism
	•	Feedforward networks

They’ve replaced RNNs in NLP and now rival CNNs in vision. The next frontier? Efficient, universal models across all modalities.


## Transformer Fine-Tuning + Gradio Demo
This notebook fine-tunes a BERT model on the IMDb dataset and builds a simple Gradio demo.

In [4]:
# Install required packages
!pip install transformers datasets gradio --quiet

In [5]:
# Load the dataset and tokenizer
from datasets import load_dataset
from transformers import BertTokenizerFast

dataset = load_dataset("imdb")
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

dataset = dataset.map(tokenize, batched=True)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
import torch

class BertIMDBDataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset):
        self.dataset = hf_dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        # Only keep the fields needed for BERT
        item = {
            'input_ids': torch.tensor(self.dataset[idx]['input_ids']),
            'attention_mask': torch.tensor(self.dataset[idx]['attention_mask']),
            'labels': torch.tensor(self.dataset[idx]['label'])
        }
        return item

train_dataset = BertIMDBDataset(dataset['train'])
eval_dataset = BertIMDBDataset(dataset['test'])

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, BertTokenizer
from torch.optim import AdamW

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Assume you have train_dataset and eval_dataset as PyTorch Dataset objects
# Example: train_dataset = MyCustomDataset(...)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
eval_loader = DataLoader(eval_dataset, batch_size=8)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 1

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        # Assume batch is a dict with 'input_ids', 'attention_mask', 'labels'
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1} completed.")

    # Evaluation loop (optional)
    model.eval()
    total, correct = 0, 0
    with torch.no_grad():
        for batch in eval_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = torch.argmax(outputs.logits, dim=-1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    print(f"Validation accuracy: {correct/total:.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Save and Reload the Fine-Tuned Model

In [None]:
trainer.save_model("./bert-finetuned")
tokenizer.save_pretrained("./bert-finetuned")

## Build Gradio Demo

In [None]:
import gradio as gr
from transformers import pipeline

pipe = pipeline("text-classification", model="./bert-finetuned")

def classify(text):
    return pipe(text)

demo = gr.Interface(fn=classify, inputs="text", outputs="label")
demo.launch(share=True)

## Attention Visualization with BERTViz

In [None]:
# Install bertviz
!pip install bertviz --quiet

In [None]:
from bertviz import head_view
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "The dog chased the cat because it was fast."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

# Launch attention head visualization
head_view(outputs.attentions, tokens=tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))