# Lab 3: Practice Exercises - Pre-training and Fine-tuning

## Overview
This notebook contains practice exercises to reinforce your understanding of pre-training strategies and fine-tuning. Work through these exercises at your own pace. Solutions are provided at the end.

## Learning Goals
- Apply pre-training concepts to new scenarios
- Experiment with different model architectures
- Fine-tune models for various NLP tasks
- Analyze and improve model performance
- Compare encoder vs decoder models

## Time Required
Approximately 2-3 hours

## Setup

In [None]:
!pip install transformers datasets torch accelerate scikit-learn seaborn -q

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    BertConfig,
    BertForMaskedLM,
    GPT2Config,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    DataCollatorForTokenClassification,
    pipeline
)
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

---
# Exercise 1: Understanding Masked Language Modeling (MLM)

## Objective
Understand how MLM masking works and its effect on model training.

## Task
1. Create a simple dataset with 5 sentences about African wildlife
2. Manually implement a masking function that masks 15% of tokens
3. Visualize which tokens are masked
4. Discuss why we don't mask 100% or 5% of tokens

In [None]:
# TODO: Exercise 1 - Your code here

# Step 1: Create your dataset
sentences = [
    # Add 5 sentences about African wildlife
]

# Step 2: Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Step 3: Write a function to mask tokens
def mask_tokens(text, mask_prob=0.15):
    """
    Mask random tokens in the text.
    
    Args:
        text: Input text string
        mask_prob: Probability of masking each token
    
    Returns:
        masked_text: Text with [MASK] tokens
        original_text: Original text
    """
    # YOUR CODE HERE
    pass

# Step 4: Test your function
for sent in sentences:
    masked, original = mask_tokens(sent)
    print(f"Original: {original}")
    print(f"Masked:   {masked}")
    print("-" * 80)

### Questions for Exercise 1

Answer these questions in the markdown cell below:

1. What happens if you mask 100% of tokens? Why would this be problematic?
2. What happens if you mask only 5% of tokens?
3. Why is 15% a good balance?
4. In BERT, why do we sometimes replace masked tokens with random tokens instead of [MASK]?

**Your Answers:**
1. 
2. 
3. 
4. 

---
# Exercise 2: Encoder vs Decoder - Practical Comparison

## Objective
Understand when to use encoders vs decoders through practical examples.

## Task
For each task below, decide whether an encoder or decoder is more appropriate and explain why:

1. Classifying news articles into categories
2. Generating creative stories
3. Extracting named entities from text
4. Completing code snippets
5. Determining if two sentences are paraphrases
6. Writing product descriptions
7. Question answering (extractive)
8. Chatbot responses
9. Text summarization
10. Sentiment analysis

### Your Answers for Exercise 2

Fill in the table below:

| Task | Model Type (Encoder/Decoder) | Reason |
|------|------------------------------|--------|
| 1. News classification | | |
| 2. Story generation | | |
| 3. Named entity recognition | | |
| 4. Code completion | | |
| 5. Paraphrase detection | | |
| 6. Product descriptions | | |
| 7. Question answering | | |
| 8. Chatbot | | |
| 9. Summarization | | |
| 10. Sentiment analysis | | |

---
# Exercise 3: Fine-tuning for Named Entity Recognition (NER)

## Objective
Fine-tune a model for token classification (NER) on MasakhaNER dataset.

## Background
MasakhaNER is a dataset for named entity recognition in African languages. It identifies:
- PER (Person names)
- ORG (Organizations)
- LOC (Locations)
- DATE (Dates)

## Task
1. Load MasakhaNER for Hausa or Yoruba
2. Fine-tune AfroXLMR for NER
3. Evaluate on test set
4. Test on custom examples

In [None]:
# TODO: Exercise 3 - Your code here

# Choose your language
NER_LANGUAGE = "hau"  # Options: hau (Hausa), yor (Yoruba), swa (Swahili), etc.

# Step 1: Load the dataset
print(f"Loading MasakhaNER dataset for {NER_LANGUAGE}...")
# HINT: Use load_dataset("masakhaner", NER_LANGUAGE)

# Step 2: Explore the dataset
# - Print dataset structure
# - Show example with tokens and NER tags
# - Count label distribution

# Step 3: Load model and tokenizer
# HINT: Use AutoModelForTokenClassification
# HINT: You need to specify num_labels based on the dataset

# Step 4: Tokenize the dataset
# HINT: For NER, you need to align labels with tokens

# Step 5: Set up training
# - Define training arguments
# - Create data collator (DataCollatorForTokenClassification)
# - Define compute_metrics function

# Step 6: Train the model

# Step 7: Evaluate on test set

# Step 8: Test on custom examples
test_sentences = [
    # Add test sentences in your chosen language
]

# YOUR CODE HERE

### Questions for Exercise 3

1. How does NER differ from sentiment analysis in terms of the prediction task?
2. Why is token alignment important in NER?
3. Which entity type was hardest for your model to identify? Why?
4. How could you improve NER performance?

**Your Answers:**
1. 
2. 
3. 
4. 

---
# Exercise 4: Comparing Pre-training Objectives

## Objective
Experiment with different masking probabilities in MLM.

## Task
1. Pre-train three small BERT models with different mask probabilities:
   - Model A: 5% masking
   - Model B: 15% masking (standard)
   - Model C: 30% masking
2. Use the same toy dataset and training steps
3. Compare their performance on a fill-mask task
4. Plot training loss curves
5. Analyze which performs best and why

In [None]:
# TODO: Exercise 4 - Your code here

# Step 1: Create a toy corpus (can reuse from Lab 1 or create new one)
toy_corpus = [
    # Add sentences
]

# Step 2: Train tokenizer

# Step 3: Function to train model with specific mask probability
def train_model_with_mask_prob(corpus, mask_prob, num_epochs=5):
    """
    Train a BERT model with specified masking probability.
    
    Args:
        corpus: List of text strings
        mask_prob: Masking probability (0.0 to 1.0)
        num_epochs: Number of training epochs
    
    Returns:
        model: Trained model
        training_loss: List of training losses
    """
    # YOUR CODE HERE
    pass

# Step 4: Train three models
print("Training Model A (5% masking)...")
# model_a, loss_a = train_model_with_mask_prob(toy_corpus, 0.05)

print("Training Model B (15% masking)...")
# model_b, loss_b = train_model_with_mask_prob(toy_corpus, 0.15)

print("Training Model C (30% masking)...")
# model_c, loss_c = train_model_with_mask_prob(toy_corpus, 0.30)

# Step 5: Compare performance
# - Test on fill-mask tasks
# - Plot loss curves
# - Analyze results

# YOUR CODE HERE

### Analysis for Exercise 4

Fill in your observations:

1. **Training Loss Comparison:**
   - Which model had the lowest final training loss?
   - How did masking probability affect convergence speed?

2. **Fill-mask Performance:**
   - Which model made the best predictions?
   - Did higher masking probability help or hurt?

3. **Why 15% is Standard:**
   - Based on your results, why might 15% be the sweet spot?

**Your Analysis:**
1. 
2. 
3. 

---
# Exercise 5: Multi-language Fine-tuning Comparison

## Objective
Compare AfroXLMR performance across different African languages.

## Task
1. Fine-tune AfroXLMR on AfriSenti for THREE different languages
2. Keep all hyperparameters identical
3. Compare test set performance
4. Analyze why some languages perform better than others
5. Create visualization comparing results

In [None]:
# TODO: Exercise 5 - Your code here

# Languages to compare
LANGUAGES = ["ha", "ig", "yo"]  # Hausa, Igbo, Yoruba

# Step 1: Function to fine-tune and evaluate for a language
def finetune_and_evaluate(language):
    """
    Fine-tune AfroXLMR for sentiment analysis in a specific language.
    
    Args:
        language: Language code
    
    Returns:
        results: Dictionary with test metrics
    """
    # YOUR CODE HERE
    pass

# Step 2: Fine-tune for each language
results = {}
for lang in LANGUAGES:
    print(f"\nFine-tuning for {lang}...")
    results[lang] = finetune_and_evaluate(lang)

# Step 3: Create comparison visualization
# - Bar chart comparing accuracy, F1, precision, recall
# - Table showing all metrics

# Step 4: Analysis
# - Which language performed best?
# - Why might there be differences?
# - What factors affect cross-lingual performance?

# YOUR CODE HERE

### Analysis for Exercise 5

1. **Performance Ranking:**
   - Rank the languages by F1-score
   - What was the performance difference between best and worst?

2. **Possible Explanations:**
   - Dataset size differences?
   - Language similarity to pre-training data?
   - Inherent language complexity?
   - Data quality?

3. **Improvement Strategies:**
   - How would you improve performance for the lowest-scoring language?

**Your Analysis:**
1. 
2. 
3. 

---
# Exercise 6: Hyperparameter Tuning

## Objective
Experiment with different hyperparameters and understand their effects.

## Task
1. Choose one language from AfriSenti
2. Try THREE different learning rates: 1e-5, 2e-5, 5e-5
3. Try THREE different batch sizes: 8, 16, 32
4. Keep epochs constant at 3
5. Document which combination works best
6. Explain why

In [None]:
# TODO: Exercise 6 - Your code here

# Hyperparameters to test
LEARNING_RATES = [1e-5, 2e-5, 5e-5]
BATCH_SIZES = [8, 16, 32]
LANGUAGE = "ha"  # Choose your language

# Step 1: Load dataset once
# YOUR CODE HERE

# Step 2: Grid search function
def run_experiment(lr, batch_size):
    """
    Train model with specific hyperparameters.
    
    Args:
        lr: Learning rate
        batch_size: Batch size
    
    Returns:
        results: Dictionary with test metrics
    """
    # YOUR CODE HERE
    pass

# Step 3: Run experiments
experiment_results = []
for lr in LEARNING_RATES:
    for batch_size in BATCH_SIZES:
        print(f"\nTesting LR={lr}, Batch Size={batch_size}")
        result = run_experiment(lr, batch_size)
        experiment_results.append({
            'learning_rate': lr,
            'batch_size': batch_size,
            **result
        })

# Step 4: Visualize results
# - Heatmap showing F1-scores for different combinations
# - Table with all results

# Step 5: Identify best configuration

# YOUR CODE HERE

### Analysis for Exercise 6

1. **Best Configuration:**
   - What learning rate worked best?
   - What batch size worked best?
   - What was the final F1-score?

2. **Observations:**
   - How did increasing learning rate affect results?
   - How did increasing batch size affect results?
   - Were there any surprising findings?

3. **Trade-offs:**
   - Larger batch size: faster training but...
   - Higher learning rate: faster convergence but...

**Your Analysis:**
1. 
2. 
3. 

---
# Exercise 7: Error Analysis and Improvement

## Objective
Analyze model errors and develop strategies for improvement.

## Task
1. Fine-tune AfroXLMR on one language
2. Find 10 examples the model got wrong
3. Categorize the types of errors
4. Propose strategies to fix each error type
5. Implement one improvement strategy and measure impact

In [None]:
# TODO: Exercise 7 - Your code here

# Step 1: Train model and get predictions
LANGUAGE = "ha"  # Choose language

# YOUR CODE HERE

# Step 2: Find misclassified examples
def find_errors(predictions, true_labels, texts, n=10):
    """
    Find misclassified examples.
    
    Args:
        predictions: Model predictions
        true_labels: Ground truth labels
        texts: Original texts
        n: Number of errors to find
    
    Returns:
        errors: List of error examples
    """
    # YOUR CODE HERE
    pass

# Step 3: Categorize errors
# Examples of error categories:
# - Sarcasm/irony not detected
# - Ambiguous sentiment
# - Wrong neutral vs negative
# - Wrong neutral vs positive
# - Language-specific expressions

# Step 4: Analyze each error
print("Error Analysis:")
print("="*80)
# For each error:
# - Print the text
# - Show true label vs predicted label
# - Categorize the error
# - Explain why the model might have failed

# YOUR CODE HERE

# Step 5: Improvement strategy
# Choose one strategy:
# - Data augmentation
# - Class weights for imbalanced data
# - Longer training
# - Different model architecture

# Implement and measure impact

# YOUR CODE HERE

### Error Analysis Report

Fill in your findings:

**Error Categories Found:**
1. Category 1: [Description] - [Count] examples
2. Category 2: [Description] - [Count] examples
3. Category 3: [Description] - [Count] examples

**Most Common Error Type:**
- Description:
- Why it happens:
- How to fix:

**Improvement Strategy Tested:**
- Strategy:
- Implementation:
- Results:
- Before F1:
- After F1:
- Improvement:

**Your Report:**
...

---
# Challenge Exercise 8: Build a Sentiment Analysis Pipeline

## Objective
Create a production-ready sentiment analysis pipeline.

## Task
Build a complete pipeline that:
1. Accepts text input in any of 3 African languages
2. Detects the language automatically
3. Uses the appropriate fine-tuned model
4. Returns sentiment with confidence score
5. Handles edge cases (empty text, very long text, mixed languages)
6. Provides explanations for predictions

## Bonus
- Add batch processing capability
- Create a simple web interface (optional)
- Save predictions to a file
- Add visualization of confidence scores

In [None]:
# TODO: Challenge Exercise 8 - Your code here

class MultilingualSentimentPipeline:
    """Production-ready multilingual sentiment analysis pipeline."""
    
    def __init__(self, languages, model_paths):
        """
        Initialize the pipeline.
        
        Args:
            languages: List of supported language codes
            model_paths: Dictionary mapping language codes to model paths
        """
        # YOUR CODE HERE
        pass
    
    def detect_language(self, text):
        """
        Detect the language of input text.
        
        Args:
            text: Input text
        
        Returns:
            language: Detected language code
        """
        # YOUR CODE HERE
        pass
    
    def predict(self, text, language=None):
        """
        Predict sentiment for input text.
        
        Args:
            text: Input text
            language: Language code (optional, will auto-detect if None)
        
        Returns:
            result: Dictionary with prediction, confidence, and explanation
        """
        # YOUR CODE HERE
        pass
    
    def batch_predict(self, texts, languages=None):
        """
        Predict sentiments for multiple texts.
        
        Args:
            texts: List of input texts
            languages: List of language codes (optional)
        
        Returns:
            results: List of prediction dictionaries
        """
        # YOUR CODE HERE
        pass
    
    def explain_prediction(self, text, prediction):
        """
        Provide explanation for prediction.
        
        Args:
            text: Input text
            prediction: Prediction dictionary
        
        Returns:
            explanation: String explaining the prediction
        """
        # YOUR CODE HERE
        pass

# Example usage
pipeline = MultilingualSentimentPipeline(
    languages=["ha", "ig", "yo"],
    model_paths={
        "ha": "/path/to/hausa/model",
        "ig": "/path/to/igbo/model",
        "yo": "/path/to/yoruba/model"
    }
)

# Test the pipeline
test_texts = [
    # Add test texts in different languages
]

for text in test_texts:
    result = pipeline.predict(text)
    print(f"Text: {text}")
    print(f"Language: {result['language']}")
    print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.2%})")
    print(f"Explanation: {result['explanation']}")
    print("-" * 80)

---
# Summary and Reflection

## What You've Learned

Through these exercises, you should now understand:

1. **Pre-training Strategies**
   - How MLM works and why masking probability matters
   - Differences between encoder and decoder pre-training
   - When to use each architecture

2. **Fine-tuning Process**
   - How to adapt pre-trained models to specific tasks
   - Importance of hyperparameter tuning
   - Cross-lingual transfer learning

3. **Model Evaluation**
   - Multiple metrics for comprehensive evaluation
   - Error analysis techniques
   - Strategies for improvement

4. **Practical Skills**
   - Working with Hugging Face transformers
   - Training and evaluation workflows
   - Building production pipelines

## Reflection Questions

Answer these questions to solidify your learning:

1. **Conceptual Understanding:**
   - In your own words, explain transfer learning and why it's important
   - What's the key difference between pre-training and fine-tuning?

2. **Practical Application:**
   - For your final project, would you use an encoder or decoder model? Why?
   - What hyperparameters would you tune first? Why?

3. **Real-world Scenarios:**
   - How would you deploy a sentiment analysis model for production use?
   - What challenges might you face with low-resource African languages?

**Your Reflections:**
1. 
2. 
3. 

---
# Additional Resources

## Datasets for Practice
- **AfriSenti**: Sentiment analysis in 14 African languages
- **MasakhaNER**: Named entity recognition in 21 African languages
- **MasakhaPOS**: Part-of-speech tagging in 20 African languages
- **MAFAND**: News article classification

## Pre-trained Models
- **AfroXLMR**: Multilingual encoder for African languages
- **AfriBERTa**: BERT-based model for African languages
- **mBERT**: Multilingual BERT (includes some African languages)
- **XLM-RoBERTa**: Cross-lingual model covering 100+ languages

## Further Reading
- Hugging Face Transformers documentation
- BERT paper: "Pre-training of Deep Bidirectional Transformers"
- GPT paper: "Improving Language Understanding by Generative Pre-Training"
- "How NLP Cracked Transfer Learning" blog post

## Community
- Masakhane NLP community: https://www.masakhane.io/
- Hugging Face forums: https://discuss.huggingface.co/
- AfricaNLP workshop at major conferences

---
# Next Steps

Ready to go further? Try these advanced challenges:

1. **Multi-task Learning**
   - Train one model for both sentiment analysis and NER
   - Compare to task-specific models

2. **Few-shot Learning**
   - Fine-tune with only 10 examples per class
   - Use data augmentation techniques

3. **Model Compression**
   - Use knowledge distillation to create smaller models
   - Compare size vs performance trade-offs

4. **Zero-shot Cross-lingual Transfer**
   - Train on one language, test on another
   - Analyze which language pairs transfer best

5. **Custom Pre-training**
   - Collect domain-specific data (e.g., medical texts)
   - Continue pre-training AfroXLMR on this data
   - Fine-tune for domain-specific tasks

---

You've completed the practice exercises for pre-training strategies and fine-tuning. You now have hands-on experience with:

- ✅ Masked Language Modeling (MLM)
- ✅ Causal Language Modeling (CLM)
- ✅ Fine-tuning for multiple tasks
- ✅ Hyperparameter optimization
- ✅ Error analysis and improvement
- ✅ Multi-lingual NLP for African languages

Keep practicing and exploring! The field of NLP for African languages is growing rapidly, and your skills can make a real impact.