# Workshop: Simple Transformer Encoder with Next-Word Prediction
This workshop builds a simple Transformer encoder from scratch, trains it on a toy dataset, and predicts the next word given an input sequence.

We cover positional encoding, multi-head attention, feed-forward layers, residual connections, and a training loop for next-word prediction.

## Toy Dataset Preparation
We create a small vocabulary and toy sentences for training. Each sentence is tokenized into word indices.
Input sequences exclude the last word; target sequences exclude the first word (teacher forcing).

## Model Architecture
Our `SimpleTransformerModel` includes:
- Embedding layer to convert word indices into vectors
- Positional encoding to add position information
- Transformer encoder layer (self-attention + feed-forward)
- Linear output layer to predict token scores

## Training Loop
We train the model using cross-entropy loss.
The optimizer updates the model weights to minimize prediction error over epochs.
Batches of sequences are fed into the model, and loss is accumulated.

## Prediction
Given an input sequence of words, the model outputs logits for the next word.
We select the word with the highest probability as the predicted next word.

## Summary
- We implemented key Transformer components and trained a simple next-word predictor.
- This foundational exercise demonstrates how Transformers learn sequence dependencies.
- Extend this by increasing dataset size, stacking layers, or adding decoders.

# Workshop: Build a Simple Transformer Encoder from Scratch
This workshop demonstrates the core concepts of the Transformer architecture by implementing a simplified Transformer encoder using PyTorch.

## Key Components of Transformer Encoder
- Positional Encoding
- Multi-Head Self-Attention
- Feed-Forward Neural Network
- Layer Normalization and Residual Connections

This implementation focuses on the encoder part only.

## Step 1: Import Libraries and Setup

In [1]:

import torch
import torch.nn as nn
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')


Using device: cpu


## Step 2: Positional Encoding
Positional Encoding injects information about the token position into embeddings since Transformer has no recurrence.

In [2]:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)  # Shape: (max_len, 1, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x shape: (seq_len, batch_size, d_model)
        x = x + self.pe[:x.size(0), :]
        return x


## Step 3: Multi-Head Self-Attention

In [3]:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.linear_q = nn.Linear(d_model, d_model)
        self.linear_k = nn.Linear(d_model, d_model)
        self.linear_v = nn.Linear(d_model, d_model)
        self.linear_out = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(0.1)

    def attention(self, query, key, value, mask=None):
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        p_attn = torch.softmax(scores, dim=-1)
        p_attn = self.dropout(p_attn)
        return torch.matmul(p_attn, value), p_attn

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(1)

        # Linear projections
        query = self.linear_q(query).view(-1, batch_size, self.num_heads, self.d_k).transpose(1, 2)
        key = self.linear_k(key).view(-1, batch_size, self.num_heads, self.d_k).transpose(1, 2)
        value = self.linear_v(value).view(-1, batch_size, self.num_heads, self.d_k).transpose(1, 2)

        # Apply attention
        x, attn = self.attention(query, key, value, mask=mask)

        # Concat heads
        x = x.transpose(1, 2).contiguous().view(-1, batch_size, self.d_model)

        # Final linear layer
        return self.linear_out(x)


## Step 4: Feed-Forward Network

In [4]:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear2(x)
        return x


## Step 5: Transformer Encoder Layer

In [5]:

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src, src_mask=None):
        # Self-attention + add & norm
        src2 = self.self_attn(src, src, src, mask=src_mask)
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        # Feed-forward + add & norm
        src2 = self.feed_forward(src)
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src


## Step 6: Instantiate and Run Encoder on Sample Input

In [6]:

# Parameters
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2

encoder_layer = TransformerEncoderLayer(d_model, num_heads).to(device)
pos_encoder = PositionalEncoding(d_model).to(device)

# Random input embeddings (seq_len, batch_size, d_model)
input_embeddings = torch.rand(seq_len, batch_size, d_model).to(device)

# Add positional encoding
input_pos_encoded = pos_encoder(input_embeddings)

# Forward pass through encoder
output = encoder_layer(input_pos_encoded)

print("Output shape:", output.shape)


Output shape: torch.Size([10, 2, 512])



```
       Encoded Output
             ^
             |
        Add & Norm 2
          ^     ^
          |     | (Residual)
          |     |
Feed-Forward Network
          ^
          |
       Add & Norm 1
          ^     ^
          |     | (Residual)
          |     |
Multi-Head Self-Attention
          ^
          |
Positional Encoding
          ^
          |
  Input Embeddings
```

## Step 7: Train Transformer for Next-Word Prediction

In [7]:

import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

# Toy dataset: simple tokenized sentences (word indices)
vocab = ['i', 'love', 'machine', 'learning', 'and', 'deep', 'neural', 'networks', '<pad>']
word2idx = {w: idx for idx, w in enumerate(vocab)}
print(word2idx)

idx2word = {idx: w for w, idx in word2idx.items()}
print(idx2word)


{'i': 0, 'love': 1, 'machine': 2, 'learning': 3, 'and': 4, 'deep': 5, 'neural': 6, 'networks': 7, '<pad>': 8}
{0: 'i', 1: 'love', 2: 'machine', 3: 'learning', 4: 'and', 5: 'deep', 6: 'neural', 7: 'networks', 8: '<pad>'}


In [9]:

sentences = [
    ['i', 'love', 'machine'],
    ['machine', 'learning', 'and'],
    ['deep', 'neural', 'networks'],
    ['i', 'love', 'deep'],
    ['neural', 'networks', 'and'],
]

max_len = 3

# Convert to indices
def encode(sent):
    return [word2idx[w] for w in sent]

encoded_sents = [encode(sent) for sent in sentences]
print(encoded_sents)

[[0, 1, 2], [2, 3, 4], [5, 6, 7], [0, 1, 5], [6, 7, 4]]


In [11]:

class ToyDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        x = torch.tensor(self.data[idx][:-1])
        y = torch.tensor(self.data[idx][1:])
        return x, y

dataset = ToyDataset(encoded_sents)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Extend TransformerEncoderLayer to include Embedding and output layer
class SimpleTransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        self.encoder_layer = TransformerEncoderLayer(d_model, num_heads, d_ff)
        self.fc_out = nn.Linear(d_model, vocab_size)
        
    def forward(self, src):
        # src shape: (seq_len, batch_size)
        embedded = self.embedding(src)  # (seq_len, batch_size, d_model)
        pos_encoded = self.pos_encoder(embedded)
        encoded = self.encoder_layer(pos_encoded)
        output = self.fc_out(encoded)
        return output

# Initialize model
vocab_size = len(vocab)
model = SimpleTransformerModel(vocab_size, d_model=32, num_heads=4, d_ff=64).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
model.train()
for epoch in range(50):
    total_loss = 0
    for x_batch, y_batch in dataloader:
        x_batch = x_batch.transpose(0,1).to(device)  # (seq_len, batch)
        y_batch = y_batch.transpose(0,1).to(device)  # (seq_len, batch)
        optimizer.zero_grad()
        output = model(x_batch)  # (seq_len, batch, vocab_size)
        loss = criterion(output.view(-1, vocab_size), y_batch.reshape(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")

# Prediction: Given an input sequence, predict next word
def predict_next_word(model, input_words):
    model.eval()
    with torch.no_grad():
        input_ids = torch.tensor([word2idx[w] for w in input_words]).unsqueeze(1).to(device)  # (seq_len, batch=1)
        output = model(input_ids)  # (seq_len, batch, vocab_size)
        last_token_logits = output[-1, 0]
        predicted_idx = torch.argmax(last_token_logits).item()
        return idx2word[predicted_idx]



Epoch 10, Loss: 0.1375
Epoch 20, Loss: 0.1704
Epoch 30, Loss: 0.1275
Epoch 40, Loss: 0.1327
Epoch 50, Loss: 0.1151


In [12]:
# Example usage
input_sequence = ['i', 'love']
predicted_word = predict_next_word(model, input_sequence)
print(f"Given input words {input_sequence}, predicted next word: {predicted_word}")

Given input words ['i', 'love'], predicted next word: machine


In [13]:
# Example usage
input_sequence = ['i', 'love', 'machine']
predicted_word = predict_next_word(model, input_sequence)
print(f"Given input words {input_sequence}, predicted next word: {predicted_word}")

Given input words ['i', 'love', 'machine'], predicted next word: learning
