[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/05_fine_tuning_trainer.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/05_fine_tuning_trainer.ipynb)

# 05 - Fine-Tuning with Trainer API

## Learning Objectives
By the end of this notebook, you will understand:
- How to fine-tune pre-trained models using the Trainer API
- Setting up training arguments and configurations
- Implementing custom evaluation metrics
- Monitoring training progress with callbacks
- Saving and loading fine-tuned models
- Best practices for fine-tuning

The Trainer API is Hugging Face's high-level interface for training and fine-tuning transformer models. It handles most of the complexity while providing flexibility for customization.

In [None]:
# Import necessary libraries
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    TrainerCallback
)
from datasets import load_dataset, Dataset, DatasetDict
import torch
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from evaluate import load as load_metric
import os
import json
import warnings
warnings.filterwarnings('ignore')

# Load environment variables from .env.local for local development
try:
    from dotenv import load_dotenv
    load_dotenv('.env.local', override=True)
    print("Environment variables loaded from .env.local")
except ImportError:
    print("python-dotenv not installed, skipping .env.local loading")

# Credential management function
def get_api_key(key_name: str) -> str:
    """Get API key from environment or Colab secrets."""
    try:
        # Try to import Colab userdata (only available in Colab)
        from google.colab import userdata
        return userdata.get(key_name)
    except (ImportError, Exception):
        # Fall back to local environment variable
        api_key = os.getenv(key_name)
        if not api_key:
            print(f"Warning: {key_name} not found. Some features may be limited.")
            print(f"   For Colab: Add {key_name} to Colab secrets")
            print(f"   For local use: Add {key_name} to .env.local")
            return None
        return api_key

# Device detection function
def get_device():
    """Get the best available device for training/inference."""
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps") 
    else:
        return torch.device("cpu")

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Setup authentication and device
hf_token = get_api_key('HF_TOKEN')
if hf_token:
    os.environ['HF_TOKEN'] = hf_token
    print("Hugging Face token configured")

device = get_device()
print("\n=== Setup Information ===")
print(f"Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif device.type == 'mps':
    print("Apple Silicon GPU (MPS) detected")

## Part 1: Data Preparation for Fine-tuning

Let's prepare a dataset for fine-tuning. We'll use a sentiment analysis task as an example.

In [None]:
# Load and prepare dataset
print("Loading dataset...")
dataset = load_dataset("imdb")

# Use smaller subset for demonstration
train_size = 2500
eval_size = 500

small_train_dataset = dataset['train'].shuffle(seed=42).select(range(train_size))
small_eval_dataset = dataset['test'].shuffle(seed=42).select(range(eval_size))

print(f"Training examples: {len(small_train_dataset)}")
print(f"Evaluation examples: {len(small_eval_dataset)}")

# Examine the data
example = small_train_dataset[0]
print(f"\nExample:")
print(f"Text: {example['text'][:200]}...")
print(f"Label: {example['label']} ({'positive' if example['label'] == 1 else 'negative'})")

# Check label distribution
from collections import Counter
train_labels = [ex['label'] for ex in small_train_dataset]
eval_labels = [ex['label'] for ex in small_eval_dataset]

print(f"\nTraining label distribution: {Counter(train_labels)}")
print(f"Evaluation label distribution: {Counter(eval_labels)}")

## Part 2: Model and Tokenizer Setup

In [None]:
# Choose model for fine-tuning
model_name = "distilbert-base-uncased"
num_labels = 2  # Binary classification

print(f"Loading model and tokenizer: {model_name}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model for classification with error handling
try:
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        id2label={0: "NEGATIVE", 1: "POSITIVE"},
        label2id={"NEGATIVE": 0, "POSITIVE": 1}
    )
    
    # Move model to the appropriate device
    model = model.to(device)
    
    print(f"✓ Model loaded successfully with {model.num_parameters():,} parameters")
    print(f"✓ Model moved to device: {device}")
    print(f"\nModel configuration:")
    print(f"  Hidden size: {model.config.hidden_size}")
    print(f"  Number of attention heads: {model.config.num_attention_heads}")
    print(f"  Number of layers: {model.config.num_hidden_layers}")
    print(f"  Max position embeddings: {model.config.max_position_embeddings}")
    
except Exception as e:
    print(f"✗ Error loading model: {e}")
    raise

# Check if we need to add a pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("✓ Added pad token")

## Part 3: Text Preprocessing and Tokenization

In [None]:
# Define tokenization function
def preprocess_function(examples):
    """Tokenize and prepare text for training"""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,  # We'll use dynamic padding
        max_length=256  # Reduced for faster training
    )

# Apply preprocessing
print("Tokenizing datasets...")
tokenized_train = small_train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=['text']  # Remove original text to save memory
)

tokenized_eval = small_eval_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=['text']
)

# Rename label column to match trainer expectations
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

print(f"Tokenized training features: {tokenized_train.features}")
print(f"Example tokenized input: {tokenized_train[0]}")

# Analyze token length distribution
token_lengths = [len(ex['input_ids']) for ex in tokenized_train]
print(f"\nToken length statistics:")
print(f"  Mean: {np.mean(token_lengths):.1f}")
print(f"  Median: {np.median(token_lengths):.1f}")
print(f"  95th percentile: {np.percentile(token_lengths, 95):.1f}")
print(f"  Max: {max(token_lengths)}")

## Part 4: Evaluation Metrics Setup

In [None]:
# Define evaluation metrics
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    
    # Get predicted classes
    predictions = np.argmax(predictions, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Test the metrics function
print("Testing metrics computation...")
dummy_predictions = np.array([[0.3, 0.7], [0.8, 0.2], [0.1, 0.9]])
dummy_labels = np.array([1, 0, 1])
test_metrics = compute_metrics((dummy_predictions, dummy_labels))
print(f"Test metrics: {test_metrics}")

## Part 5: Training Configuration

In [None]:
# Set up training arguments with device-aware settings
training_args = TrainingArguments(
    output_dir="./results",
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    
    # Optimization
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    
    # Evaluation and logging
    evaluation_strategy="steps",
    eval_steps=200,
    logging_dir="./logs",
    logging_steps=100,
    
    # Saving
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    
    # Device-aware settings
    seed=42,
    fp16=(device.type == 'cuda'),  # Use mixed precision only on CUDA
    bf16=(device.type == 'cuda' and torch.cuda.is_bf16_supported()),  # Use bfloat16 if available
    report_to="none",  # Disable wandb/tensorboard for this demo
    
    # For demonstration - remove if you want full training
    max_steps=1000,  # Limit steps for demo
)

print("\n=== Training Configuration ===")
print(f"  Device: {device}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size (train): {training_args.per_device_train_batch_size}")
print(f"  Batch size (eval): {training_args.per_device_eval_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Mixed precision (fp16): {training_args.fp16}")
print(f"  BFloat16: {training_args.bf16}")
print(f"  Max steps: {training_args.max_steps}")

## Part 6: Custom Callbacks and Monitoring

In [None]:
# Create custom callback for monitoring
class CustomCallback(TrainerCallback):
    """Custom callback to monitor training progress"""
    
    def __init__(self):
        self.training_losses = []
        self.eval_losses = []
        self.eval_accuracies = []
        self.learning_rates = []
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        """Called when logging occurs"""
        if logs is not None:
            if 'loss' in logs:
                self.training_losses.append(logs['loss'])
            if 'eval_loss' in logs:
                self.eval_losses.append(logs['eval_loss'])
            if 'eval_accuracy' in logs:
                self.eval_accuracies.append(logs['eval_accuracy'])
            if 'learning_rate' in logs:
                self.learning_rates.append(logs['learning_rate'])
    
    def on_evaluate(self, args, state, control, model=None, eval_dataloader=None, **kwargs):
        """Called after evaluation"""
        print(f"Evaluation completed at step {state.global_step}")
    
    def get_training_history(self):
        """Return training history for visualization"""
        return {
            'training_losses': self.training_losses,
            'eval_losses': self.eval_losses,
            'eval_accuracies': self.eval_accuracies,
            'learning_rates': self.learning_rates
        }

# Initialize custom callback
custom_callback = CustomCallback()

# Set up data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print("Custom callback and data collator initialized")

## Part 7: Initialize Trainer and Start Training

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[
        custom_callback,
        EarlyStoppingCallback(early_stopping_patience=3)
    ],
)

print("Trainer initialized successfully")
print(f"Device: {trainer.args.device}")

# Run initial evaluation
print("\nRunning initial evaluation...")
initial_eval = trainer.evaluate()
print(f"Initial evaluation results:")
for key, value in initial_eval.items():
    print(f"  {key}: {value:.4f}")

In [None]:
# Start training
print("\nStarting training...")
print("=" * 50)

try:
    train_result = trainer.train()
    
    # Print training results
    print("\nTraining completed!")
    print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
    print(f"Training loss: {train_result.metrics['train_loss']:.4f}")
    print(f"Training steps: {train_result.metrics['train_steps']}")
    print(f"Training samples per second: {train_result.metrics['train_samples_per_second']:.2f}")
    
except Exception as e:
    print(f"Training error: {e}")
    print("This might be due to memory constraints or missing dependencies.")

## Part 8: Post-Training Evaluation

In [None]:
# Final evaluation
print("Running final evaluation...")
final_eval = trainer.evaluate()

print(f"\nFinal evaluation results:")
for key, value in final_eval.items():
    print(f"  {key}: {value:.4f}")

# Compare initial vs final performance
print(f"\nPerformance comparison:")
print(f"  Accuracy: {initial_eval['eval_accuracy']:.4f} → {final_eval['eval_accuracy']:.4f} (Δ{final_eval['eval_accuracy'] - initial_eval['eval_accuracy']:+.4f})")
print(f"  F1 Score: {initial_eval['eval_f1']:.4f} → {final_eval['eval_f1']:.4f} (Δ{final_eval['eval_f1'] - initial_eval['eval_f1']:+.4f})")
print(f"  Loss: {initial_eval['eval_loss']:.4f} → {final_eval['eval_loss']:.4f} (Δ{final_eval['eval_loss'] - initial_eval['eval_loss']:+.4f})")

# Get predictions for detailed analysis
predictions = trainer.predict(tokenized_eval)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = predictions.label_ids

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Fine-tuning Results', fontsize=16)

# Training history (if available)
history = custom_callback.get_training_history()
if history['training_losses']:
    axes[0, 0].plot(history['training_losses'], label='Training Loss')
    if history['eval_losses']:
        # Align eval losses with training steps
        eval_steps = np.linspace(0, len(history['training_losses']), len(history['eval_losses']))
        axes[0, 0].plot(eval_steps, history['eval_losses'], label='Validation Loss')
    axes[0, 0].set_title('Training Loss')
    axes[0, 0].set_xlabel('Steps')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

# Accuracy over time
if history['eval_accuracies']:
    axes[0, 1].plot(history['eval_accuracies'], 'g-o')
    axes[0, 1].set_title('Validation Accuracy')
    axes[0, 1].set_xlabel('Evaluation Steps')
    axes[0, 1].set_ylabel('Accuracy')
    axes[0, 1].grid(True, alpha=0.3)

# Confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0])
axes[1, 0].set_title('Confusion Matrix')
axes[1, 0].set_xlabel('Predicted')
axes[1, 0].set_ylabel('Actual')

# Learning rate schedule
if history['learning_rates']:
    axes[1, 1].plot(history['learning_rates'], 'r-')
    axes[1, 1].set_title('Learning Rate Schedule')
    axes[1, 1].set_xlabel('Steps')
    axes[1, 1].set_ylabel('Learning Rate')
    axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Classification report
from sklearn.metrics import classification_report
print(f"\nDetailed Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

## Part 9: Model Inference and Testing

In [None]:
# Test the fine-tuned model on new examples
def predict_sentiment(text, model, tokenizer, device):
    """Predict sentiment for a given text"""
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    
    # Move to device if using GPU
    if torch.cuda.is_available():
        inputs = {k: v.to(device) for k, v in inputs.items()}

    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Convert to probabilities
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1)
    
    # Get confidence
    confidence = probabilities[0][predicted_class].item()
    
    # Convert to labels
    label = "Positive" if predicted_class.item() == 1 else "Negative"
    
    return {
        'text': text,
        'label': label,
        'confidence': confidence,
        'probabilities': probabilities[0].cpu().numpy()
    }

# Test examples
test_examples = [
    "I absolutely loved this movie! Best film I've seen all year.",
    "This was terrible. Waste of time and money.",
    "It was okay, nothing special but not bad either.",
    "Amazing performance by the actors. Highly recommend!",
    "Boring and predictable. Fell asleep halfway through.",
    "A masterpiece of cinema. Perfect in every way."
]

print("Testing fine-tuned model on new examples:")
print("=" * 50)

results = []
for example in test_examples:
    result = predict_sentiment(example, trainer.model, tokenizer, device)
    results.append(result)
    
    print(f"\nText: {example}")
    print(f"Prediction: {result['label']} (confidence: {result['confidence']:.3f})")
    print(f"Probabilities: Negative={result['probabilities'][0]:.3f}, Positive={result['probabilities'][1]:.3f}")

# Visualize predictions
fig, ax = plt.subplots(figsize=(12, 6))

labels = [r['label'] for r in results]
confidences = [r['confidence'] for r in results]
colors = ['red' if label == 'Negative' else 'green' for label in labels]

bars = ax.bar(range(len(results)), confidences, color=colors, alpha=0.7)
ax.set_xlabel('Example')
ax.set_ylabel('Confidence')
ax.set_title('Model Predictions on Test Examples')
ax.set_xticks(range(len(results)))
ax.set_xticklabels([f'Ex {i+1}' for i in range(len(results))])
ax.set_ylim(0, 1)

# Add value labels on bars
for i, (bar, label, conf) in enumerate(zip(bars, labels, confidences)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
            f'{label}\n{conf:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## Part 10: Model Saving and Loading

In [None]:
# Save the fine-tuned model
save_path = "./fine-tuned-sentiment-model"

print(f"Saving model to {save_path}...")
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

# Save training arguments and metrics
with open(f"{save_path}/training_args.json", "w") as f:
    json.dump(training_args.to_dict(), f, indent=2)

with open(f"{save_path}/final_metrics.json", "w") as f:
    json.dump(final_eval, f, indent=2)

print("Model saved successfully!")

# List saved files
import os
saved_files = os.listdir(save_path)
print(f"Saved files: {saved_files}")

# Test loading the saved model
print("\nTesting model loading...")
try:
    loaded_model = AutoModelForSequenceClassification.from_pretrained(save_path)
    loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)
    
    # Test loaded model
    test_text = "This is a great product!"
    result_original = predict_sentiment(test_text, trainer.model, tokenizer, device)
    result_loaded = predict_sentiment(test_text, loaded_model, loaded_tokenizer, device)
    
    print(f"Original model: {result_original['label']} ({result_original['confidence']:.3f})")
    print(f"Loaded model: {result_loaded['label']} ({result_loaded['confidence']:.3f})")
    print(f"Results match: {np.allclose(result_original['probabilities'], result_loaded['probabilities'], rtol=1e-5)}")
    
except Exception as e:
    print(f"Error loading model: {e}")

## Part 11: Fine-tuning Best Practices Summary

In [None]:
# Create a comprehensive summary
def create_fine_tuning_summary():
    """Create a summary of fine-tuning best practices and results"""
    
    print("FINE-TUNING BEST PRACTICES SUMMARY")
    print("=" * 45)
    
    print("\n🎯 KEY LEARNINGS:")
    print("  • Start with a pre-trained model close to your domain")
    print("  • Use appropriate learning rates (2e-5 to 5e-5 for most cases)")
    print("  • Implement early stopping to prevent overfitting")
    print("  • Monitor both training and validation metrics")
    print("  • Use mixed precision training for efficiency")
    print("  • Save the best model based on validation metrics")
    
    print("\n📊 TRAINING CONFIGURATION:")
    print(f"  • Model: {model_name}")
    print(f"  • Training examples: {len(tokenized_train):,}")
    print(f"  • Validation examples: {len(tokenized_eval):,}")
    print(f"  • Batch size: {training_args.per_device_train_batch_size}")
    print(f"  • Learning rate: {training_args.learning_rate}")
    print(f"  • Max steps: {training_args.max_steps}")
    
    print("\n📈 PERFORMANCE RESULTS:")
    print(f"  • Initial accuracy: {initial_eval['eval_accuracy']:.4f}")
    print(f"  • Final accuracy: {final_eval['eval_accuracy']:.4f}")
    print(f"  • Improvement: {final_eval['eval_accuracy'] - initial_eval['eval_accuracy']:+.4f}")
    print(f"  • Final F1 score: {final_eval['eval_f1']:.4f}")
    
    print("\n⚡ OPTIMIZATION TIPS:")
    print("  • Use dynamic padding with DataCollatorWithPadding")
    print("  • Implement gradient accumulation for larger effective batch sizes")
    print("  • Use learning rate scheduling")
    print("  • Consider freezing early layers for domain adaptation")
    print("  • Monitor GPU memory usage and adjust batch size accordingly")
    
    print("\n🚀 PRODUCTION CONSIDERATIONS:")
    print("  • Validate on held-out test set")
    print("  • Test model robustness on edge cases")
    print("  • Consider model compression techniques")
    print("  • Implement proper error handling")
    print("  • Document training configuration and metrics")
    
    print("\n✅ CHECKLIST FOR FINE-TUNING:")
    checklist = [
        "Data preprocessing and tokenization",
        "Train/validation/test split",
        "Appropriate model architecture",
        "Hyperparameter tuning",
        "Evaluation metrics definition",
        "Training monitoring and callbacks",
        "Model saving and versioning",
        "Performance validation",
        "Error analysis",
        "Documentation"
    ]
    
    for i, item in enumerate(checklist, 1):
        print(f"  {i:2d}. {item}")

create_fine_tuning_summary()

print("\n" + "=" * 60)
print("Fine-tuning tutorial completed successfully!")
print("Your model is ready for production use.")
print("Next: Try Notebook 06 for fine-tuning from scratch!")
print("=" * 60)

## Summary

In this notebook, we covered the complete fine-tuning workflow using the Trainer API:

### 🎯 **What We Accomplished**
1. **Data Preparation**: Loaded and preprocessed text data for classification
2. **Model Setup**: Configured pre-trained model for fine-tuning
3. **Training Configuration**: Set up comprehensive training arguments
4. **Custom Metrics**: Implemented evaluation metrics and callbacks
5. **Model Training**: Fine-tuned model with monitoring
6. **Performance Analysis**: Evaluated and visualized results
7. **Model Deployment**: Saved and tested the fine-tuned model

### �� **Key Concepts Mastered**
- **Trainer API**: High-level interface for model training
- **Training Arguments**: Comprehensive configuration options
- **Callbacks**: Custom monitoring and early stopping
- **Evaluation Metrics**: Custom metric computation
- **Model Saving**: Proper model persistence and loading

### 📈 **Best Practices Learned**
- Start with appropriate learning rates (2e-5 to 5e-5)
- Use validation data for model selection
- Implement early stopping to prevent overfitting
- Monitor training progress with callbacks
- Use mixed precision training for efficiency
- Save models with proper versioning

### 🚀 **Next Steps**
- **Notebook 06**: Fine-tuning from scratch with custom training loops
- **Notebook 07**: Specialized applications (summarization)
- **Notebook 08**: Question answering systems
- **Notebook 09**: Advanced techniques (LoRA, QLoRA)

The Trainer API provides an excellent balance between ease of use and flexibility, making it the go-to choice for most fine-tuning tasks. You now have the foundation to fine-tune models for any classification task!