# Lab 1: Pre-training Encoder and Decoder Models from Scratch

## Learning Objectives
In this lab, you will:
1. Understand the difference between encoder and decoder architectures
2. Implement Masked Language Modeling (MLM) for encoder pre-training
3. Implement Causal Language Modeling (CLM) for decoder pre-training
4. Train both models on a toy dataset
5. Compare their strengths and limitations

## Background
From the lecture, you learned that:
- **Encoders** (like BERT) use bidirectional context and are pre-trained with Masked Language Modeling
- **Decoders** (like GPT) use unidirectional context and are pre-trained with Causal Language Modeling
- Pre-training teaches models general language understanding before fine-tuning on specific tasks

## Step 1: Install Required Libraries

We'll use the Hugging Face `transformers` library and PyTorch.

In [None]:
# Install required packages
!pip install transformers datasets torch tokenizers accelerate -q

## Step 2: Import Libraries and Set Up

In [None]:
import torch
import torch.nn as nn
from transformers import (
    BertConfig, BertForMaskedLM,
    GPT2Config, GPT2LMHeadModel,
    PreTrainedTokenizerFast,
    DataCollatorForLanguageModeling,
    Trainer, TrainingArguments
)
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
from datasets import Dataset
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Step 3: Create a Toy Dataset

We'll create a small corpus about African animals and culture. This toy dataset will be used to pre-train both models.

In [None]:
# Toy corpus - sentences about African themes
toy_corpus = [
    "The lion is the king of the savanna.",
    "Elephants are the largest land animals in Africa.",
    "Giraffes have very long necks to reach tall trees.",
    "Zebras have black and white stripes.",
    "The cheetah is the fastest land animal.",
    "Hippos spend most of their time in water.",
    "Rhinos have thick skin and large horns.",
    "Lions live in groups called prides.",
    "Elephants use their trunks to drink water.",
    "The leopard is a skilled climber.",
    "African cultures are rich and diverse.",
    "Many languages are spoken across the continent.",
    "Traditional music uses drums and dancing.",
    "The baobab tree is called the tree of life.",
    "The Sahara is the largest hot desert.",
    "Victoria Falls is one of the largest waterfalls.",
    "Mount Kilimanjaro is the highest mountain in Africa.",
    "The Nile is the longest river in the world.",
    "Coral reefs exist along the coast.",
    "Rainforests are home to many species.",
    "The sun shines brightly in the sky.",
    "Birds fly from tree to tree.",
    "Fish swim in the rivers and lakes.",
    "People farm crops like maize and cassava.",
    "Children play games in the village.",
    "Markets sell fruits and vegetables.",
    "Storytelling is an important tradition.",
    "Elders share wisdom with the young.",
    "Music and dance celebrate life.",
    "Artists create beautiful sculptures and paintings.",
]

# Duplicate the corpus to have more training data
toy_corpus = toy_corpus * 20  # Now we have 600 sentences

print(f"Total sentences in corpus: {len(toy_corpus)}")
print("\nFirst 5 sentences:")
for i, sent in enumerate(toy_corpus[:5]):
    print(f"{i+1}. {sent}")

## Step 4: Train a Tokenizer

Both models need a tokenizer to convert text into tokens. We'll train a simple WordPiece tokenizer from scratch.

In [None]:
# Create a WordPiece tokenizer
tokenizer_model = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer_model.pre_tokenizer = pre_tokenizers.Whitespace()

# Train the tokenizer
trainer = trainers.WordPieceTrainer(
    vocab_size=1000,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

# Save corpus to a temporary file for training
with open("toy_corpus.txt", "w") as f:
    for text in toy_corpus:
        f.write(text + "\n")

tokenizer_model.train(files=["toy_corpus.txt"], trainer=trainer)

# Save and load as PreTrainedTokenizerFast
tokenizer_model.save("toy_tokenizer.json")
tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="toy_tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

print(f"Vocabulary size: {len(tokenizer)}")
print(f"\nExample tokenization:")
example = "The lion is the king of the savanna."
tokens = tokenizer.tokenize(example)
print(f"Text: {example}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {tokenizer.convert_tokens_to_ids(tokens)}")

## Step 5: Prepare Dataset for Training

We'll tokenize our corpus and create a Hugging Face Dataset object.

In [None]:
# Tokenize the corpus
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=64, padding="max_length")

# Create dataset
dataset_dict = {"text": toy_corpus}
dataset = Dataset.from_dict(dataset_dict)

# Tokenize
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

# Split into train and validation
train_size = int(0.9 * len(tokenized_dataset))
eval_size = len(tokenized_dataset) - train_size

train_dataset = tokenized_dataset.select(range(train_size))
eval_dataset = tokenized_dataset.select(range(train_size, len(tokenized_dataset)))

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

## Step 6: Pre-train an Encoder Model (BERT-style)

### 6.1 Understanding Masked Language Modeling (MLM)

MLM works by:
1. Randomly masking 15% of tokens in each sentence
2. Training the model to predict the masked tokens
3. Using bidirectional context (looking at words before AND after the mask)

Example:
- Original: "The lion is the king of the savanna"
- Masked: "The lion is the [MASK] of the savanna"
- Model predicts: "king"

In [None]:
# Create a small BERT model configuration

encoder_config = BertConfig(
    vocab_size=len(tokenizer),
    hidden_size=128,           # Small for our toy dataset
    num_hidden_layers=2,       # Only 2 layers
    num_attention_heads=2,     # 2 attention heads
    intermediate_size=512,
    max_position_embeddings=64,
    pad_token_id=tokenizer.pad_token_id,
)

# Initialize the model
encoder_model = BertForMaskedLM(encoder_config)

print(f"Encoder model parameters: {encoder_model.num_parameters():,}")
print(f"\nModel architecture:")
print(encoder_model)

### 6.2 Set Up MLM Data Collator

The data collator automatically masks tokens for MLM training.

In [None]:
# Data collator for MLM (automatically masks tokens)
mlm_data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,              # Enable Masked Language Modeling
    mlm_probability=0.15   # Mask 15% of tokens (as in BERT paper)
)

# Let's see an example of how masking works
print("Example of MLM masking:")
print("-" * 50)
example_text = "The lion is the king of the savanna."
inputs = tokenizer(example_text, return_tensors="pt", padding="max_length", max_length=64)

# Apply data collator to see masking
# Note: data collator expects a list of dictionaries, not tensors
batch = [{key: val[0] for key, val in inputs.items()}]
masked_batch = mlm_data_collator(batch)

print(f"Original text: {example_text}")
print(f"\nOriginal tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0].tolist()[:15])}")
print(f"\nMasked tokens: {tokenizer.convert_ids_to_tokens(masked_batch['input_ids'][0].tolist()[:15])}")
print(f"\n'[MASK]' represents tokens the model must predict!")

### 6.3 Train the Encoder Model

In [None]:
# Training arguments for encoder
encoder_training_args = TrainingArguments(
    output_dir="bert_toy",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=5e-4,
    weight_decay=0.01,
    warmup_steps=50,
    load_best_model_at_end=True,
    report_to="none",  # Disable wandb/tensorboard
)

# Create trainer
encoder_trainer = Trainer(
    model=encoder_model,
    args=encoder_training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=mlm_data_collator,
)

# Train the model
print("Starting encoder pre-training with Masked Language Modeling...")
print("This will take a few minutes.\n")
encoder_results = encoder_trainer.train()

print("\n" + "="*50)
print("Encoder Pre-training Complete!")
print("="*50)

### 6.4 Test the Encoder Model

Let's test if the model learned to predict masked words.

In [None]:
from transformers import pipeline

# Create a fill-mask pipeline
fill_mask = pipeline(
    "fill-mask",
    model=encoder_model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test sentences with [MASK]
test_sentences = [
    "The lion is the [MASK] of the savanna.",
    "Elephants are the [MASK] land animals.",
    "The cheetah is the [MASK] land animal.",
]

print("Testing Encoder Model (MLM):")
print("="*70)

for sent in test_sentences:
    print(f"\nSentence: {sent}")
    results = fill_mask(sent, top_k=3)
    print("Predictions:")
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result['token_str']:>10} (score: {result['score']:.4f})")

## Step 7: Pre-train a Decoder Model (GPT-style)

### 7.1 Understanding Causal Language Modeling (CLM)

CLM works by:
1. Predicting the next word based ONLY on previous words
2. Using unidirectional (left-to-right) context
3. Training the model to continue text naturally

Example:
- Input: "The lion is the"
- Model predicts: "king"
- Then: "The lion is the king"
- Model predicts: "of"
- And so on...

In [None]:
# Create a small GPT-2 model configuration
decoder_config = GPT2Config(
    vocab_size=len(tokenizer),
    n_positions=64,
    n_embd=128,
    n_layer=2,
    n_head=2,
    bos_token_id=tokenizer.cls_token_id,
    eos_token_id=tokenizer.sep_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Initialize the model
decoder_model = GPT2LMHeadModel(decoder_config)

print(f"Decoder model parameters: {decoder_model.num_parameters():,}")
print(f"\nModel architecture:")
print(decoder_model)

### 7.2 Set Up CLM Data Collator

In [None]:
# Data collator for CLM (no masking, just shift labels)
clm_data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Disable MLM for causal language modeling
)

print("CLM Training Setup:")
print("-" * 50)
print("In CLM, the model learns to predict the next token.")
print("\nExample:")
print("Input:  The lion is the king")
print("Target: lion is the king of")
print("\nThe model learns: given 'The', predict 'lion'")
print("                   given 'The lion', predict 'is'")
print("                   given 'The lion is', predict 'the'")
print("                   ... and so on.")

### 7.3 Train the Decoder Model

In [None]:
# Training arguments for decoder
decoder_training_args = TrainingArguments(
    output_dir="gpt_toy",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=5e-4,
    weight_decay=0.01,
    warmup_steps=50,
    load_best_model_at_end=True,
    report_to="none",
)

# Create trainer
decoder_trainer = Trainer(
    model=decoder_model,
    args=decoder_training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=clm_data_collator,
)

# Train the model
print("Starting decoder pre-training with Causal Language Modeling...")
print("This will take a few minutes.\n")
decoder_results = decoder_trainer.train()

print("\n" + "="*50)
print("Decoder Pre-training Complete!")
print("="*50)

### 7.4 Test the Decoder Model

Let's test if the model learned to generate text.

In [None]:
print("Tokenizer vocab size:", len(tokenizer))
print("Model embed size:", decoder_model.get_input_embeddings().weight.size(0))


In [None]:
from transformers import pipeline

# Create a text generation pipeline
text_generator = pipeline(
    "text-generation",
    model=decoder_model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test prompts
test_prompts = [
    "The lion is",
    "Elephants are",
    "The cheetah is the",
]

print("Testing Decoder Model (Text Generation):")
print("="*70)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    outputs = text_generator(
        prompt,
        max_new_tokens=30,          # <-- use this instead of max_length
        truncation=True,            # <-- required to avoid overflow
        num_return_sequences=2,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    )
    print("Generated texts:")
    for i, output in enumerate(outputs, 1):
        print(f"  {i}. {output['generated_text']}")

## Step 8: Compare Encoder vs Decoder

Let's visualize and compare the training losses.

In [None]:
# Extract training histories
encoder_history = encoder_trainer.state.log_history
decoder_history = decoder_trainer.state.log_history

# Get training losses
encoder_train_loss = [log['loss'] for log in encoder_history if 'loss' in log]
decoder_train_loss = [log['loss'] for log in decoder_history if 'loss' in log]

# Get eval losses
encoder_eval_loss = [log['eval_loss'] for log in encoder_history if 'eval_loss' in log]
decoder_eval_loss = [log['eval_loss'] for log in decoder_history if 'eval_loss' in log]

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training loss
axes[0].plot(encoder_train_loss, label='Encoder (BERT)', marker='o')
axes[0].plot(decoder_train_loss, label='Decoder (GPT)', marker='s')
axes[0].set_xlabel('Steps')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Validation loss
axes[1].plot(encoder_eval_loss, label='Encoder (BERT)', marker='o')
axes[1].plot(decoder_eval_loss, label='Decoder (GPT)', marker='s')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].set_title('Validation Loss Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nFinal Results:")
print("="*50)
print(f"Encoder final train loss: {encoder_train_loss[-1]:.4f}")
print(f"Encoder final eval loss:  {encoder_eval_loss[-1]:.4f}")
print(f"\nDecoder final train loss: {decoder_train_loss[-1]:.4f}")
print(f"Decoder final eval loss:  {decoder_eval_loss[-1]:.4f}")

## Step 9: Summary and Key Takeaways

### What We Learned

1. **Encoder Models (BERT-style)**
   - Use **Masked Language Modeling (MLM)** for pre-training
   - Look at **bidirectional context** (words before AND after)
   - Good for: understanding tasks, classification, NER, QA
   - Limitation: Cannot naturally generate long text

2. **Decoder Models (GPT-style)**
   - Use **Causal Language Modeling (CLM)** for pre-training
   - Look at **unidirectional context** (only previous words)
   - Good for: text generation, completion, creative writing
   - Limitation: Cannot see future context

3. **Pre-training Importance**
   - Teaches models general language patterns
   - Transfer learning: "pre-train once, fine-tune many times"
   - Real models use much larger datasets (billions of words)

### Analogy from Lecture
Remember: Pre-training is like going to school (learning general knowledge), and fine-tuning is like medical school (specializing)!

## Exercise Questions

1. **Understanding MLM**: Why do we mask 15% of tokens instead of 50% or 5%? What would happen with extreme values?

2. **Bidirectional vs Unidirectional**: Complete this sentence using both models: "The elephant uses its ___". Which model gives better predictions and why?

3. **Generation Quality**: Try generating text starting with "The" using the decoder. Why might the quality be limited?

4. **Dataset Size**: We used only 600 sentences. How would results change with 1 million sentences?

5. **Architecture Choice**: For each task below, would you use an encoder or decoder?
   - Sentiment analysis
   - Story completion
   - Named entity recognition
   - Chatbot
   - Text classification

## Optional: Save Your Models

You can save the pre-trained models for later use.

In [None]:
# Save encoder model
encoder_model.save_pretrained("pretrained_models/toy_bert")
tokenizer.save_pretrained("pretrained_models/toy_bert")

# Save decoder model
decoder_model.save_pretrained("pretrained_models/toy_gpt")
tokenizer.save_pretrained("pretrained_models/toy_gpt")

print("Models saved successfully!")
print("You can load them later using:")
print("  BertForMaskedLM.from_pretrained('pretrained_models/toy_bert')")
print("  GPT2LMHeadModel.from_pretrained('pretrained_models/toy_gpt')")