### 🧪 Day 2: Fine-Tuning Language Models (Phi-1.5)

Welcome to the second session of the Generative AI workshop!

Today we'll dive into how to **fine-tune a pretrained language model** using your own dataset. We'll walk through the training pipeline using Microsoft's **Phi-1.5** model and Hugging Face tools.

🎯 **Objectives**
- Understand the difference between pretraining and fine-tuning
- Prepare a dataset for supervised fine-tuning
- Set up a training configuration using Hugging Face’s `transformers` and `trl`
- Train and evaluate a fine-tuned LLM
- Experiment with custom prompts to test your results


In [None]:
# Install and import the necessary libraries
!pip install torch
!pip install -q -U accelerate peft bitsandbytes transformers trl einops
!pip install fsspec==2025.3.2


In [None]:
# ⚙️ Configure environment to avoid memory fragmentation and disable Weights & Biases
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"


In [None]:
# 📦 Install and import libraries
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig
)
from peft import (
    LoraConfig,
    prepare_model_for_kbit_training
)
from trl import SFTTrainer


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
# 🔗 Load Phi-1.5 Tokenizer
# TODO: Set the base model name for Phi-1.5
base_model = "____"  # Hint: What's the model name for microsoft/phi-1_5?

# TODO: Load the tokenizer from the pretrained model
tokenizer = AutoTokenizer.____(____, trust_remote_code=____)  # Hint: What method loads pretrained tokenizers and should we trust remote code?

In [None]:
# ⚙️ Quantization Configuration for 4-bit Loading
# TODO: Create a 4-bit quantization configuration to save memory
bnb_config = BitsAndBytesConfig(
   load_in_4bit=____,  # Hint: Do we want to load the model in 4-bit precision?
   bnb_4bit_quant_type="____",  # Hint: What quantization type? (nf4 or fp4)
   bnb_4bit_compute_dtype=torch.____,  # Hint: What data type for computation? (float16 or bfloat16)
   bnb_4bit_use_double_quant=____,  # Hint: Should we use double quantization for better compression?
)

In [None]:
# 🧠 Load Model with Quantization and Offload as Needed
# TODO: Load the model with our quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    ____,  # Hint: What variable contains our model name?
    quantization_config=____,  # Hint: What variable contains our quantization config?
    trust_remote_code=____,  # Hint: Should we trust remote code for this model?
    low_cpu_mem_usage=____,  # Hint: Should we optimize for low CPU memory usage?
    device_map={"": ____}  # Hint: What device number should we use? (0 for first GPU)
)

In [None]:
# 🔧 Prep Model for LoRA + Training Efficiency
# TODO: Disable caching to save memory during training
model.config.use_cache = ____  # Hint: Should we use cache during training? (True/False)

# TODO: Prepare the model for efficient training with gradient checkpointing
model = prepare_model_for_kbit_training(____, use_gradient_checkpointing=____)  # Hint: What model to prepare and should we use gradient checkpointing?

In [None]:
from datasets import load_dataset

# 👀 Preview the Raw Structure (before mapping)
# Load a small sample of SQuAD v2 dataset for training
# Note: We're using SQuAD v2 for this exercise, but feel free to experiment with your own dataset!

# TODO: Load the SQuAD v2 dataset with a small sample
dataset = load_dataset("____", split="____")  # Hint: Use "squad_v2" and "train[:1%]" for a 1% sample

# TODO: Display the dataset structure to understand what we're working with
print("📊 Dataset Structure:")
print(____)  # Hint: What variable contains our dataset?

print("\n🔍 First Example:")
print(____[0])  # Hint: How do we access the first example in the dataset?

print("\n📋 Available Keys:")
print("Features:", ____[0].keys())  # Hint: How do we see the keys of the first example?

# 💡 UNDERSTANDING TIP:
# Look at the structure! You'll see 'question', 'answers', 'context', etc.
# We need to transform this into a simple "Question: ... Answer: ..." format for training

In [None]:
# 🔄 Format the Data to Plain QA Format
# Now that we understand the structure, let's transform it for training

def format_qa(example):
    # TODO: Extract the question from the example
    question = example["____"].strip()  # Hint: What key contains the question text?

    # TODO: Get the answer text, with fallback for unanswerable questions
    # SQuAD v2 structure: answers['text'] is a list, we want the first answer
    answer = example["____"]["____"][0] if example["____"]["____"] else "____"  # Hint: Navigate through answers->text array, fallback to "I don't know."

    # TODO: Create formatted text for training in Question/Answer format
    return {"text": f"Question: {____}\nAnswer: {____}"}  # Hint: Use the question and answer variables you created above

# TODO: Apply the formatting function to transform our dataset
dataset = dataset.map(____)  # Hint: What function should we apply to each example in the dataset?

print("✅ Dataset formatted successfully!")
print("\n🔍 Example of formatted data:")
print(____[0]["____"])  # Hint: What dataset and what key contains our formatted text?

# 💡 ADVANCED TIP: Want to use your own dataset?
# Replace "squad_v2" with your dataset name or create a custom dataset with this format:
# {"text": "Question: Your question here\nAnswer: Your answer here"}

In [None]:
from trl import SFTConfig

# ⚙️ Training Configuration - Fine-tune Your Training Parameters
# This configuration controls how our model will be trained

training_args = SFTConfig(
   # 📁 Output and Logging Configuration
   # TODO: Set the directory where training results will be saved
   output_dir="____",  # Hint: Create a descriptive folder name like "./model-results" to store all training outputs

   # TODO: Set the directory for logging training metrics
   logging_dir="____",  # Hint: TensorBoard reads from a "runs" subfolder of your output directory

   # TODO: Choose what to report training metrics to
   report_to="____",  # Hint: Popular options include "tensorboard", "wandb", or "none" for tracking progress

   # 🏃‍♂️ Training Schedule Configuration
   # TODO: Set how many complete passes through the dataset
   num_train_epochs=____,  # Hint: Range 1-5. More epochs = longer training but potentially better results. Start small for testing!

   # TODO: Set batch size per GPU device (affects memory usage)
   per_device_train_batch_size=____,  # Hint: Range 1-8. Higher = faster training but more memory. Start low if you get memory errors

   # TODO: Set gradient accumulation steps (simulates larger batch size)
   gradient_accumulation_steps=____,  # Hint: Range 2-8. Higher = more stable gradients but slower. Multiply with batch_size for effective batch size

   # 🎯 Learning Rate Configuration
   # TODO: Set the learning rate (how fast the model learns)
   learning_rate=____,  # Hint: Range 1e-5 to 5e-4. Too high = unstable training, too low = slow learning. Scientific notation like 2e-4

   # TODO: Choose learning rate scheduler type
   lr_scheduler_type="____",  # Hint: Options: "cosine" (gradual decrease), "linear" (steady decrease), "constant" (no change)

   # TODO: Set warmup ratio (gradually increase learning rate at start)
   warmup_ratio=____,  # Hint: Range 0.01-0.1. Prevents early training instability. Start around 3% of training

   # 📊 Logging and Saving Configuration
   # TODO: Set how often to log training metrics (in steps)
   logging_steps=____,  # Hint: Range 1-20. Lower = more frequent updates, higher = less storage. Balance monitoring vs performance

   # TODO: Set when to save model checkpoints
   save_strategy="____",  # Hint: Options: "epoch" (after each full pass), "steps" (every N steps), "no" (never save)

   # TODO: Set maximum number of checkpoints to keep
   save_total_limit=____,  # Hint: Range 1-5. More = keep more versions but use more disk space. Consider your storage limits

   # 🚀 Optimization Configuration
   # TODO: Choose the optimizer algorithm
   optim="____",  # Hint: "paged_adamw_32bit" is memory-efficient, "adamw_torch" is standard. Choose based on memory constraints

   # TODO: Set maximum gradient norm for clipping
   max_grad_norm=____,  # Hint: Range 0.1-1.0. Prevents gradient explosion. Lower = more conservative, higher = allows bigger updates

   # TODO: Enable gradient checkpointing to save memory
   gradient_checkpointing=____,  # Hint: True/False. True saves memory but slightly slower. Essential for large models on limited GPU

   # 🔧 Memory and Precision Configuration
   # TODO: Enable 16-bit floating point training
   fp16=____,  # Hint: True/False. True reduces memory and speeds up training on modern GPUs. Disable if you get numerical issues

   # TODO: Enable grouping sequences by length for efficiency
   group_by_length=____,  # Hint: True/False. True improves efficiency by batching similar-length sequences together

   # 🎲 Reproducibility Configuration
   # TODO: Set random seed for reproducible results
   seed=____,  # Hint: Any integer. Use the same number to get identical results across runs. Popular choices: 42, 123, 2024

   # 📝 Text Processing Configuration
   # TODO: Specify which field contains the training text
   dataset_text_field="____",  # Hint: Must match the field name from your formatted dataset. Check what you named it in the mapping step

   # TODO: Set maximum sequence length for training
   max_seq_length=____  # Hint: Range 256-2048. Longer = more context but more memory. Common choices: 512, 1024. Match your data needs
)

# 💡 CONFIGURATION STRATEGY:
# 1. Start with conservative settings (small epochs, batch size, learning rate)
# 2. Monitor memory usage and training progress
# 3. Gradually increase parameters if needed
# 4. If you get "CUDA out of memory": reduce batch_size, max_seq_length, or enable gradient_checkpointing
# 5. Use TensorBoard to visualize training: `tensorboard --logdir ./your-output-dir/runs`

In [None]:
# 🎯 LoRA Configuration for Low-Rank Adaptation
# LoRA allows efficient fine-tuning by only training small additional parameters

peft_config = LoraConfig(
   # TODO: Set the rank parameter (controls model capacity vs efficiency trade-off)
   r=____,  # Hint: Range 8-128. Higher = more parameters to train (better learning but slower). Start with powers of 2: 16, 32, 64

   # TODO: Set the alpha parameter (controls adaptation strength)
   lora_alpha=____,  # Hint: Range 8-64. Often set equal to or double the rank. Controls how much LoRA affects the original model

   # TODO: Set dropout rate for regularization
   lora_dropout=____,  # Hint: Range 0.0-0.3. Higher = more regularization but may hurt learning. 0.05-0.1 is common for preventing overfitting

   # Static configurations (advanced settings - leave these as-is)
   bias="none",  # No bias training for efficiency
   task_type="CAUSAL_LM",  # Language modeling task
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]  # Which model parts to adapt
)

# 💡 LoRA PARAMETER TIPS:
# - r (rank): Start with 16-32. Higher if you have complex data or need better performance
# - lora_alpha: Often set to r or 2*r. Higher values make LoRA adaptations more influential
# - lora_dropout: Start around 0.05. Increase if you see overfitting, decrease if underfitting
# - Total trainable parameters ≈ 2 * r * (sum of target module dimensions)

In [None]:
# 🚀 Setup Trainer with SFT (Supervised Fine-Tuning)
# This combines our model, dataset, and configurations into a training pipeline

# TODO: Create the SFT trainer with all our configurations
trainer = SFTTrainer(
   model=____,  # Hint: What variable contains our loaded and prepared model?
   train_dataset=____,  # Hint: What variable contains our formatted dataset?
   peft_config=____,  # Hint: What variable contains our LoRA configuration?
   args=____  # Hint: What variable contains our training arguments/configuration?
)

# 💡 TRAINER SETUP EXPLANATION:
# - model: The quantized Phi model we prepared earlier
# - train_dataset: Our formatted SQuAD dataset with "Question: ... Answer: ..." format
# - peft_config: LoRA settings that determine which parts of the model to fine-tune
# - args: All the training hyperparameters we configured (learning rate, batch size, etc.)

In [None]:
# 🔥 Start Training
# This will begin the fine-tuning process - monitor the output for progress!

# TODO: Start the training process
____.____()  # Hint: What object do we call the train method on?

# 💡 TRAINING TIPS:
# - Training will show loss decreasing over time (good sign!)
# - Watch for "CUDA out of memory" errors - reduce batch_size if this happens
# - Training time depends on your settings: epochs, dataset size, and hardware
# - You can stop training anytime with Ctrl+C and still save progress

In [None]:
# 💾 Save the Fine-tuned Model Components
# After training, we need to save our work for future use

# TODO: Save the LoRA adapter weights (lightweight and efficient)
____.model.save_pretrained("____")  # Hint: Use the trainer object to access the model and give it a descriptive name like "phi15-lora-adapter"

# TODO: Save the tokenizer for future inference
____.save_pretrained("____")  # Hint: What tokenizer object should we save and what folder name should match our model?

# 💡 SAVING EXPLANATION:
# - LoRA adapter: Only saves the small additional weights we trained (few MB vs GB)
# - Tokenizer: Essential for processing text inputs during inference
# - These files can be loaded later with the base model for predictions
# - Much more storage-efficient than saving the entire model

In [None]:
%load_ext tensorboard
%tensorboard --logdir phi15-results/runs

In [None]:
# 🧪 Test the Fine-tuned Model
# Let's see how our fine-tuned model performs on a new question

# TODO: Create a test prompt in the same format as training
prompt = "____"  # Hint: Use the same "Question: ... \nAnswer: " format we trained on

# TODO: Tokenize the prompt for the model
inputs = tokenizer(____, return_tensors="____").to(model.device)  # Hint: What prompt variable and tensor format?

# Set model to evaluation mode and disable cache to avoid warnings
model.eval()
model.config.use_cache = False

# TODO: Generate a response with controlled parameters
with torch.no_grad():  # Saves memory during inference
    outputs = model.generate(
        ____,  # Hint: What input variable contains our tokenized prompt?
        max_new_tokens=____,  # Hint: Range 20-100. How many new tokens should the model generate?
        temperature=____,  # Hint: Range 0.1-1.0. Lower = more focused, higher = more creative
        do_sample=____,  # Hint: True/False. True for creative responses, False for deterministic
        pad_token_id=tokenizer.eos_token_id,
        use_cache=False  # Prevents cache warnings
    )

# TODO: Decode and print the generated response
response = tokenizer.decode(____[0], skip_special_tokens=____)  # Hint: What outputs to decode and should we skip special tokens?
print("🤖 Model Response:")
print(____)  # Hint: What variable contains our decoded response?

# 💡 TESTING TIPS:
# - Try different questions to see how well your model learned
# - Compare responses to the base model (before fine-tuning)
# - Lower temperature = more consistent answers
# - Higher temperature = more creative but potentially less accurate answers

## 🔍 Let's Compare: Base vs Fine-Tuned Phi-1.5

Now that we've trained our model, let's see the difference in how it responds to a prompt compared to the original (pretrained) version.

We’ll use the same input for both models and observe their generated responses.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# 🔍 Let's Compare: Base vs Fine-Tuned Phi-1.5
# We'll test both models with the same prompt to see the difference

# TODO: Load the tokenizer for comparison testing
tokenizer = AutoTokenizer.from_pretrained("____", trust_remote_code=____)  # Hint: What's the Phi-1.5 model name and should we trust remote code?

# TODO: Define a test prompt (notice we removed "Answer:" to let models complete naturally)
prompt = '''
   Question: ____
'''  # Hint: Create an interesting question to test both models - try something practical or educational

# TODO: Tokenize the prompt for both models
inputs = tokenizer(____, return_tensors="____").to("____")  # Hint: What prompt variable, tensor format, and device?

# 📦 TEST BASE MODEL
print("🔄 Loading base model...")
# TODO: Load the original Phi-1.5 model
base_model = AutoModelForCausalLM.from_pretrained(
   "____",  # Hint: Same model name as tokenizer
   trust_remote_code=____,  # Hint: Should we trust remote code?
   device_map="____",  # Hint: What device should we use?
   low_cpu_mem_usage=____  # Hint: Should we optimize CPU memory usage?
)
base_model.eval()

# TODO: Generate response from base model with proper EOS handling
with torch.no_grad():
   base_output = base_model.generate(
       **____,
       max_new_tokens=____,                     # Hint: How many new tokens? (30-100)
       eos_token_id=tokenizer.____,             # Hint: What token ID signals end of sequence?
       pad_token_id=tokenizer.____,             # Hint: What token for padding consistency?
       temperature=____,                        # Hint: Range 0.1-1.0 for creativity balance
       do_sample=____                          # Hint: True/False for natural responses
   )

print("📦 Base model output:")
print(tokenizer.decode(____[0], skip_special_tokens=____))  # Hint: What output to decode and skip special tokens?

# Clear memory before loading fine-tuned model
del base_model
torch.cuda.empty_cache()

# 🧠 TEST FINE-TUNED MODEL
print("\n🔄 Loading fine-tuned model...")
# TODO: Load your fine-tuned model
finetuned_model = AutoModelForCausalLM.from_pretrained(
   "____",  # Hint: What folder did you save your LoRA adapter to?
   trust_remote_code=____,  # Hint: Should we trust remote code?
   device_map="____",  # Hint: What device should we use?
   low_cpu_mem_usage=____  # Hint: Should we optimize CPU memory usage?
)
finetuned_model.eval()

# TODO: Generate response from fine-tuned model with proper EOS handling
with torch.no_grad():
   tuned_output = finetuned_model.generate(
       **____,
       max_new_tokens=____,                     # Hint: How many new tokens? (30-100)
       eos_token_id=tokenizer.____,             # Hint: What token ID signals end of sequence?
       pad_token_id=tokenizer.____,             # Hint: What token for padding consistency?
       temperature=____,                        # Hint: Range 0.1-1.0 for creativity balance
       do_sample=____                          # Hint: True/False for natural responses
   )

print("🧠 Fine-tuned model output:")
print(tokenizer.decode(____[0], skip_special_tokens=____))  # Hint: What output to decode and skip special tokens?

# 🤔 COMPARISON QUESTIONS:
print("\n" + "="*50)
print("🤔 REFLECTION QUESTIONS:")
print("1. Which response was more helpful and detailed?")
print("2. Did the fine-tuned model follow the Q&A format better?")
print("3. Which model gave more accurate information?")
print("4. How did the training on SQuAD data affect the responses?")
print("5. Try different questions - what patterns do you notice?")
print("\n💡 TIP: Try different questions to test various aspects:")
print("   - Factual questions: 'What is the largest ocean?'")
print("   - Reasoning questions: 'Why do seasons change?'")
print("   - Practical questions: 'How do you change a tire?'")

# 💡 PROMPT STRATEGY TIP:
# Notice we removed "Answer:" from the prompt to let both models complete the response naturally.
# This shows how each model approaches question-answering differently:
# - Base model: May provide direct answers or continue with more questions
# - Fine-tuned model: Should provide structured Q&A format responses

# 🎉 Congratulations! Core Fine-tuning Complete

You've successfully fine-tuned your first language model! You now understand:
- How to prepare data for training
- LoRA configurations and memory optimization  
- Training pipelines and model evaluation
- The difference between base and fine-tuned model performance

## 🚀 Ready for More? Advanced Experiments Below!

For students who finish early or want to dive deeper, try the advanced experiments below. Pick any that interest you and document your findings!

In [None]:
# 🤖 EXPERIMENT: Different Model Architectures
# Test fine-tuning on different model families

models_to_try = [
    "microsoft/DialoGPT-small",     # Conversational model
    "distilgpt2",                   # Smaller, faster GPT-2
    "microsoft/phi-2",              # Larger Phi model
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Tiny but capable
]

# TODO: Pick a model and compare fine-tuning results
# Which model learns your task best? Which is fastest to train?
# Document: training time, memory usage, output quality

In [None]:
# 📚 EXPERIMENT: Try Different Datasets
# See how the same model learns different types of tasks

datasets_to_try = [
   "imdb",                    # Movie reviews (sentiment)
   "ag_news",                 # News classification
   "eli5",                    # Simple explanations
   "math_qa"                  # Math word problems
]

# TODO: Pick a dataset and compare fine-tuning results
# Which dataset is easiest to learn? Which gives best responses?
# Document: training progress, final quality, task difficulty

In [None]:
import gc
import torch

# Clear any existing model
try:
    del base_model
except:
    pass

try:
    del finetuned_model
except:
    pass

gc.collect()
torch.cuda.empty_cache()
