## This notebook requires GPU

This lab must be run in Google Colab in order to use GPU acceleration for model training. Click the button below to open this notebook in Colab, then set your runtime to GPU:

**Runtime > Change Runtime Type > T4 GPU**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scott2b/coursera-msds-public/blob/main/notebooks/5_advanced_fine_tuning_unsloth.ipynb)

# 🎯 Advanced Fine-tuning with Unsloth and PEFT

This notebook provides comprehensive coverage of modern fine-tuning techniques using Unsloth and Parameter-Efficient Fine-Tuning (PEFT) methods.

## 🎯 Learning Objectives

By the end of this notebook, you will:
1. Master Parameter-Efficient Fine-Tuning (PEFT) techniques
2. Understand and implement LoRA, QLoRA, and other efficient methods
3. Use Unsloth for ultra-fast fine-tuning
4. Implement advanced training strategies and optimizations
5. Handle memory constraints with gradient checkpointing
6. Fine-tune for classification, generation, and instruction following
7. Evaluate and compare different fine-tuning approaches
8. Deploy fine-tuned models for production use

## 🔧 Prerequisites

- Completed Notebooks 1 & 2 (LLM Fundamentals and vLLM Inference)
- Understanding of transformers and attention mechanisms
- Basic knowledge of PyTorch and training loops
- Familiarity with classification tasks and datasets

In [None]:
# Install packages not pre-installed in Colab
!pip install transformers datasets accelerate peft trl
!pip install unsloth bitsandbytes
!pip install evaluate

In [None]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding,
    BitsAndBytesConfig, pipeline
)
from peft import (
    LoraConfig, get_peft_model, prepare_model_for_kbit_training,
    TaskType, PeftModel, PeftConfig
)
from trl import SFTTrainer
import evaluate
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("🎯 Advanced Fine-tuning Environment Ready!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.1f} GB")

## 🎨 Understanding PEFT Methods

Learn about Parameter-Efficient Fine-Tuning techniques and their advantages.

In [None]:
# PEFT method comparison
peft_methods = {
    "LoRA (Low-Rank Adaptation)": {
        "description": "Learns low-rank updates to weight matrices",
        "parameters": "~0.5-2% of original",
        "memory": "Minimal increase",
        "training_speed": "~1-2x slower than full fine-tuning",
        "inference": "Negligible overhead",
        "best_for": "Fine-tuning large models, memory-constrained environments"
    },
    "QLoRA (Quantized LoRA)": {
        "description": "Combines quantization with LoRA for extreme efficiency",
        "parameters": "~0.2-1% of original",
        "memory": "~70% reduction vs full fine-tuning",
        "training_speed": "~2-3x faster than LoRA",
        "inference": "Requires quantization-aware serving",
        "best_for": "Very large models, extreme memory constraints"
    },
    "AdaLoRA": {
        "description": "Dynamically allocates parameters based on importance",
        "parameters": "Adaptive (0.5-5%)",
        "memory": "Adaptive reduction",
        "training_speed": "Similar to LoRA",
        "inference": "Minimal overhead",
        "best_for": "Dynamic parameter allocation, task-specific optimization"
    },
    "Prompt Tuning": {
        "description": "Learns soft prompts instead of model weights",
        "parameters": "~0.01% of original",
        "memory": "Negligible",
        "training_speed": "Very fast",
        "inference": "Minimal overhead",
        "best_for": "Few-shot learning, rapid adaptation"
    },
    "P-Tuning": {
        "description": "Learns prompt embeddings with continuous optimization",
        "parameters": "~0.1% of original",
        "memory": "Very low",
        "training_speed": "Fast",
        "inference": "Minimal overhead",
        "best_for": "Natural language understanding tasks"
    }
}

# Display comparison table
comparison_data = []
for method, details in peft_methods.items():
    comparison_data.append({
        'Method': method,
        'Parameters': details['parameters'],
        'Memory': details['memory'],
        'Training Speed': details['training_speed'],
        'Best For': details['best_for']
    })

df_comparison = pd.DataFrame(comparison_data)
print("🔬 PEFT Methods Comparison:")
print(df_comparison.to_string(index=False))

## 🚀 Unsloth Setup and Configuration

Learn how to set up and configure Unsloth for ultra-fast fine-tuning.

In [None]:
# Unsloth setup
def setup_unsloth_model(model_name: str, max_seq_length: int = 2048):
    """Setup Unsloth model for efficient fine-tuning"""

    try:
        from unsloth import FastLanguageModel

        print(f"🚀 Setting up Unsloth with {model_name}...")

        # Load model with Unsloth optimizations
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            dtype=None,  # Auto-detect
            load_in_4bit=True,  # Use 4-bit quantization
        )

        print("✅ Unsloth model loaded successfully!")
        print(f"📊 Model: {model_name}")
        print(f"🎯 Max sequence length: {max_seq_length}")
        print(f"🧠 Quantization: 4-bit")

        return model, tokenizer

    except ImportError:
        print("❌ Unsloth not available. Falling back to standard PEFT...")

        # Fallback to standard transformers + PEFT
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_8bit=True,
            device_map="auto",
            trust_remote_code=True
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Prepare for PEFT
        model = prepare_model_for_kbit_training(model)

        return model, tokenizer

# Setup model
model_name = "unsloth/mistral-7b-bnb-4bit"  # Try Unsloth first
try:
    model, tokenizer = setup_unsloth_model(model_name)
    using_unsloth = True
except:
    # Fallback model
    model_name = "microsoft/DialoGPT-medium"
    model, tokenizer = setup_unsloth_model(model_name)
    using_unsloth = False

print(f"\n🔧 Using {'Unsloth' if using_unsloth else 'Standard PEFT'}")
print(f"📊 Model parameters: {model.num_parameters():,}")
print(f"🎯 Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

In [None]:
# Configure LoRA
def setup_lora_config(model, r: int = 16, alpha: int = 16, dropout: float = 0.0):
    """Setup LoRA configuration for the model"""

    if using_unsloth:
        from unsloth import FastLanguageModel

        model = FastLanguageModel.get_peft_model(
            model,
            r=r,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                           "gate_proj", "up_proj", "down_proj"],
            lora_alpha=alpha,
            lora_dropout=dropout,
            bias="none",
            use_gradient_checkpointing=True,
            random_state=3407,
            use_rslora=False,
            loftq_config=None,
        )
    else:
        # Standard PEFT LoRA
        lora_config = LoraConfig(
            r=r,
            lora_alpha=alpha,
            target_modules=["c_attn", "c_proj"],  # GPT-2 style
            lora_dropout=dropout,
            bias="none",
            task_type="CAUSAL_LM"
        )
        model = get_peft_model(model, lora_config)

    # Print trainable parameters info
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())

    print("🎯 LoRA Configuration:")
    print(f"   Rank (r): {r}")
    print(f"   Alpha: {alpha}")
    print(f"   Dropout: {dropout}")
    print(f"   Trainable params: {trainable_params:,}")
    print(f"   Total params: {total_params:,}")
    print(f"   Percentage trainable: {100 * trainable_params / total_params:.2f}%")

    return model

# Setup LoRA
model = setup_lora_config(model, r=16, alpha=16, dropout=0.05)

## 📚 Data Preparation for Fine-tuning

Learn how to prepare and format data for effective fine-tuning.

In [None]:
# Load and prepare dataset
def load_classification_dataset():
    """Load and prepare a classification dataset for fine-tuning"""

    # Load IMDB dataset
    dataset = load_dataset("imdb", split="train[:10%]")  # Small subset for demo

    print(f"📊 Dataset loaded: {len(dataset)} examples")
    print(f"📝 Sample: {dataset[0]}")

    # Convert to instruction format
    def format_instruction(example):
        instruction = "Classify the sentiment of this movie review as positive or negative."

        formatted_text = f"""### Instruction:
{instruction}

### Input:
{example['text']}

### Response:
{'positive' if example['label'] == 1 else 'negative'}
"""

        return {"text": formatted_text, "label": example["label"]}

    # Format dataset
    formatted_dataset = dataset.map(format_instruction)

    print("\n🔄 Dataset formatted for instruction tuning")
    print(f"📝 Formatted sample: {formatted_dataset[0]['text'][:200]}...")

    return formatted_dataset

# Load dataset
dataset = load_classification_dataset()

# Split dataset
train_val_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]

print(f"\n📊 Training set: {len(train_dataset)} examples")
print(f"📊 Validation set: {len(val_dataset)} examples")

## 🎯 Fine-tuning Implementation

Learn how to implement and execute fine-tuning with modern techniques.

In [None]:
# Training configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,  # Short training for demo
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=50,  # Very short for demo
    learning_rate=2e-4,
    fp16=True if torch.cuda.is_available() else False,
    logging_steps=10,
    save_strategy="steps",
    save_steps=25,
    eval_steps=25,
    eval_strategy="steps", # Set eval_strategy to match save_strategy
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_total_limit=2,
    report_to="none",  # Disable wandb for demo
)

# Custom data collator for instruction tuning
def data_collator(features):
    """Custom data collator for instruction tuning"""
    batch = tokenizer(
        [f["text"] for f in features],
        truncation=True,
        padding=True,
        max_length=2048,
        return_tensors="pt"
    )
    return batch

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("🚀 Starting fine-tuning...")
print(f"📊 Training for {training_args.max_steps} steps")
print(f"📊 Batch size: {training_args.per_device_train_batch_size}")
print(f"📊 Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"📊 Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

# Note: Actual training commented out for demo
# trainer.train()
print("\n✅ Fine-tuning setup complete! (Training commented out for demo)")
print("💡 To run actual training, uncomment trainer.train() above")

## 📊 Training Monitoring and Evaluation

Learn how to monitor training progress and evaluate model performance.

In [None]:
# Training metrics and evaluation
def evaluate_fine_tuned_model(model, tokenizer, test_dataset):
    """Evaluate the fine-tuned model on test data"""

    print("🔍 Evaluating fine-tuned model...")

    # Load evaluation metrics
    accuracy_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")

    predictions = []
    references = []

    model.eval()

    # Evaluate on a small subset
    eval_subset = test_dataset.select(range(min(20, len(test_dataset))))

    for example in eval_subset:
        # Format prompt
        prompt = f"""Classify the sentiment of this movie review as positive or negative.

Review: {example['text']}

Sentiment:"""

        # Generate prediction
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=10,
                temperature=0.1,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )

        # Decode prediction
        generated_text = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)

        # Extract sentiment (simple keyword matching)
        pred_sentiment = 1 if "positive" in generated_text.lower() else 0
        true_sentiment = example['label']

        predictions.append(pred_sentiment)
        references.append(true_sentiment)

        print(f"📝 True: {'positive' if true_sentiment else 'negative'}, Pred: {'positive' if pred_sentiment else 'negative'}")

    # Calculate metrics
    accuracy = accuracy_metric.compute(predictions=predictions, references=references)
    f1 = f1_metric.compute(predictions=predictions, references=references)

    print("\n📊 Evaluation Results:")
    print(f"   Accuracy: {accuracy['accuracy']:.3f}")
    print(f"   F1-Score: {f1['f1']:.3f}")

    return {
        "accuracy": accuracy['accuracy'],
        "f1": f1['f1'],
        "predictions": predictions,
        "references": references
    }

# Note: Evaluation commented out since training was skipped
print("📊 Evaluation setup complete (would run after training)")
# eval_results = evaluate_fine_tuned_model(model, tokenizer, val_dataset)

## 🔧 Advanced Training Techniques

Explore advanced training strategies and optimizations.

In [None]:
# Advanced training techniques
def demonstrate_advanced_training():
    """Demonstrate advanced training techniques"""

    techniques = {
        "Gradient Checkpointing": {
            "purpose": "Reduce memory usage during training",
            "trade_off": "~20% slower training, ~50% less memory",
            "when_to_use": "Large models, limited GPU memory",
            "implementation": "model.gradient_checkpointing_enable()"
        },
        "Mixed Precision Training": {
            "purpose": "Faster training with FP16",
            "trade_off": "Potential numerical instability",
            "when_to_use": "Modern GPUs with Tensor Cores",
            "implementation": "TrainingArguments(fp16=True)"
        },
        "Gradient Accumulation": {
            "purpose": "Simulate larger batch sizes",
            "trade_off": "Slower training per step",
            "when_to_use": "Limited GPU memory, want stable training",
            "implementation": "gradient_accumulation_steps=N"
        },
        "Learning Rate Scheduling": {
            "purpose": "Dynamic learning rate adjustment",
            "trade_off": "Requires hyperparameter tuning",
            "when_to_use": "Long training runs, stable convergence",
            "implementation": "lr_scheduler_type='cosine'"
        },
    "Early Stopping": {
            "purpose": "Prevent overfitting",
            "trade_off": "May stop training too early",
            "when_to_use": "Limited compute, validation data available",
            "implementation": "EarlyStoppingCallback()"
        },
        "Weight Decay": {
            "purpose": "Regularization to prevent overfitting",
            "trade_off": "May slow convergence",
            "when_to_use": "Small datasets, prone to overfitting",
            "implementation": "weight_decay=0.01"
        }
    }

    print("🛠️  Advanced Training Techniques:")
    print("=" * 50)

    for technique, details in techniques.items():
        print(f"\n🎯 {technique}:")
        print(f"   📋 Purpose: {details['purpose']}")
        print(f"   ⚖️  Trade-off: {details['trade_off']}")
        print(f"   🎯 When to use: {details['when_to_use']}")
        print(f"   💻 Implementation: {details['implementation']}")

demonstrate_advanced_training()

## 💾 Model Saving and Merging

Learn how to save, merge, and deploy fine-tuned models.

In [None]:
# Model saving and merging
def save_and_merge_model(model, tokenizer, output_dir: str = "./fine_tuned_model"):
    """Save and optionally merge the fine-tuned model"""

    print(f"💾 Saving model to {output_dir}...")

    # Save the PEFT model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    print("✅ PEFT adapters saved!")

    # Optionally merge with base model
    try:
        print("🔄 Merging LoRA weights with base model...")

        # Load base model
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

        # Merge
        merged_model = model.merge_and_unload()

        # Save merged model
        merged_model.save_pretrained(f"{output_dir}_merged")
        tokenizer.save_pretrained(f"{output_dir}_merged")

        print("✅ Merged model saved!")

        return merged_model

    except Exception as e:
        print(f"⚠️  Could not merge model: {e}")
        print("💡 PEFT model still usable for inference")
        return model

# Save model
merged_model = save_and_merge_model(model, tokenizer, "./fine_tuned_sentiment_model")
print("\n🎉 Model saving complete!")
print("📁 PEFT model: ./fine_tuned_sentiment_model")
print("📁 Merged model: ./fine_tuned_sentiment_model_merged")

## 🧪 Comparing Fine-tuning Methods

Compare different fine-tuning approaches and their effectiveness.

In [None]:
# Fine-tuning method comparison
def compare_fine_tuning_methods():
    """Compare different fine-tuning approaches"""

    methods = {
        "Full Fine-tuning": {
            "parameters": "100% of model",
            "memory": "High (full model)",
            "training_time": "Long",
            "performance": "Best (potentially)",
            "flexibility": "High",
            "use_case": "Small models, unlimited compute"
        },
        "LoRA": {
            "parameters": "0.5-2% of model",
            "memory": "Low",
            "training_time": "Medium",
            "performance": "Very good",
            "flexibility": "High",
            "use_case": "Large models, limited compute"
        },
        "QLoRA": {
            "parameters": "0.2-1% of model",
            "memory": "Very low",
            "training_time": "Fast",
            "performance": "Good",
            "flexibility": "Medium",
            "use_case": "Very large models, extreme constraints"
        },
        "Prompt Tuning": {
            "parameters": "<0.01% of model",
            "memory": "Minimal",
            "training_time": "Very fast",
            "performance": "Fair",
            "flexibility": "Low",
            "use_case": "Few-shot learning, rapid prototyping"
        },
        "Instruction Tuning": {
            "parameters": "Varies (0.1-5%)",
            "memory": "Low to medium",
            "training_time": "Medium",
            "performance": "Excellent",
            "flexibility": "High",
            "use_case": "Chatbots, instruction-following tasks"
        }
    }

    # Create comparison DataFrame
    df = pd.DataFrame.from_dict(methods, orient='index')
    df.index.name = 'Method'

    print("🔬 Fine-tuning Methods Comparison:")
    print(df.to_string())

    # Performance vs Efficiency plot
    methods_short = ['Full FT', 'LoRA', 'QLoRA', 'Prompt', 'Instruction']
    performance_scores = [9, 8, 7, 5, 8.5]  # Subjective performance scores
    efficiency_scores = [1, 7, 9, 10, 8]    # Efficiency scores (higher = more efficient)

    import matplotlib.pyplot as plt

    plt.figure(figsize=(10, 6))
    plt.scatter(efficiency_scores, performance_scores, s=100)

    for i, method in enumerate(methods_short):
        plt.annotate(method, (efficiency_scores[i], performance_scores[i]),
                    xytext=(5, 5), textcoords='offset points')

    plt.xlabel('Efficiency Score')
    plt.ylabel('Performance Score')
    plt.title('Fine-tuning Methods: Performance vs Efficiency')
    plt.grid(True, alpha=0.3)
    plt.show()

    return df

comparison_df = compare_fine_tuning_methods()

## 🎯 Production Deployment Considerations

Learn how to deploy fine-tuned models for production use.

In [None]:
# Production deployment considerations
def production_deployment_guide():
    """Guide for deploying fine-tuned models in production"""

    considerations = {
        "Model Format": {
            "PEFT": "Small, fast loading, requires base model",
            "Merged": "Larger, self-contained, slower loading",
            "Quantized": "Smallest, fastest inference, accuracy trade-off",
            "Recommendation": "PEFT for development, Merged for production"
        },
        "Inference Optimization": {
            "Batch Processing": "Use vLLM or similar for batched inference",
            "Caching": "Cache frequent prompts and responses",
            "Model Parallelism": "Distribute large models across GPUs",
            "Quantization": "Use 4-bit or 8-bit for faster inference"
        },
        "Monitoring": {
            "Performance Metrics": "Latency, throughput, memory usage",
            "Model Metrics": "Accuracy, F1-score, drift detection",
            "System Metrics": "CPU/GPU utilization, error rates",
            "Business Metrics": "User satisfaction, task completion"
        },
        "Scaling": {
            "Horizontal Scaling": "Multiple model instances",
            "Vertical Scaling": "Larger GPUs, more memory",
            "Load Balancing": "Distribute requests across instances",
            "Auto-scaling": "Scale based on demand patterns"
        }
    }

    print("🚀 Production Deployment Guide:")
    print("=" * 50)

    for category, details in considerations.items():
        print(f"\n📋 {category}:")
        for key, value in details.items():
            print(f"   • {key}: {value}")

production_deployment_guide()

## 📚 Key Takeaways

1. **PEFT Methods**: LoRA, QLoRA, and other parameter-efficient techniques dramatically reduce training costs
2. **Unsloth**: Ultra-fast fine-tuning with automatic optimizations
3. **Memory Management**: Gradient checkpointing and quantization enable training of large models
4. **Data Formatting**: Proper instruction tuning format is crucial for good performance
5. **Evaluation**: Comprehensive metrics beyond just loss (accuracy, F1, perplexity)
6. **Model Merging**: Combining LoRA adapters with base models for deployment
7. **Production Ready**: Monitoring, scaling, and optimization considerations

## 🚀 Next Steps

Now that you understand advanced fine-tuning, proceed to:
- **Notebook 4**: Production Deployment and Scaling
- **Notebook 5**: Evaluation, Benchmarking, and Ethics

## 🔗 Additional Resources

- [LoRA Paper](https://arxiv.org/abs/2106.09685) - Original LoRA research
- [QLoRA Paper](https://arxiv.org/abs/2305.14314) - Quantized LoRA
- [Unsloth Documentation](https://github.com/unslothai/unsloth)
- [PEFT Documentation](https://huggingface.co/docs/peft/index)

## 🎯 Hands-on Exercises

1. **LoRA Configuration**: Experiment with different LoRA ranks (8, 16, 32) and alpha values
2. **Quantization Comparison**: Compare FP16, 8-bit, and 4-bit training performance
3. **Dataset Formatting**: Try different instruction formats and compare results
4. **Hyperparameter Tuning**: Use Optuna to optimize learning rate and batch size
5. **Model Merging**: Practice merging and deploying LoRA adapters
6. **Memory Optimization**: Train a model with gradient checkpointing on limited GPU memory

## 🎉 Conclusion

You've now mastered advanced fine-tuning techniques! Key achievements:
- ✅ Understanding PEFT methods and their trade-offs
- ✅ Implementing Unsloth for ultra-fast training
- ✅ Configuring LoRA and QLoRA adapters
- ✅ Preparing data for instruction tuning
- ✅ Advanced training techniques and optimizations
- ✅ Model evaluation and comparison
- ✅ Production deployment considerations

Ready to move on to production deployment! 🚀