[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic3.6/HF-full-training-demo.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic3.6/HF-full-training-demo.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic3.6/HF-full-training-demo.ipynb)

# HF Full Training Demo - Fast Fine-tuning for Cloud Environments

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to efficiently fine-tune DistilBERT for sentiment classification
- Optimizing training for AWS SageMaker Studio and cloud environments
- Using GLUE SST-2 dataset for rapid convergence
- Best practices for fast, cost-effective fine-tuning
- Knowledge distillation benefits in production scenarios

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals

## 🚀 What We'll Cover
1. **Environment Setup**: Optimized imports and device detection
2. **Dataset Loading**: GLUE SST-2 for binary sentiment classification
3. **Model Setup**: DistilBERT - 40% smaller, 97% performance
4. **Training Configuration**: Fast convergence settings
5. **Fine-tuning Execution**: 15-30 minute training cycle
6. **Evaluation & Testing**: Performance validation
7. **Model Deployment**: Saving for production use

## 💡 Why This Combination?

| Component | Recommendation | Efficiency Reason |
| :--- | :--- | :--- |
| **Model** | `distilbert-base-uncased` | 40% fewer parameters than BERT, faster training |
| **Task** | Binary Sentiment Classification | Simplest downstream task, maximum speed |
| **Dataset** | GLUE SST-2 | ~67K samples, rapid convergence in 3-4 epochs |

**Expected Training Time**: 15-30 minutes on AWS ml.g4dn.xlarge or similar GPU instances.

## Part 1: Environment Setup and Imports

First, let's set up our environment with the necessary libraries optimized for cloud training.

In [None]:
# Essential imports for fast fine-tuning
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
from datasets import load_dataset
import torch
import numpy as np
import time
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print("🚀 Environment setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {__import__('transformers').__version__}")

## Part 2: Device Detection - AWS SageMaker Optimized

Automatically detect the best available device, with special consideration for AWS SageMaker Studio environments.

In [None]:
def get_optimal_device():
    """
    Get the best available device for training, optimized for AWS environments.
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"🔥 Using CUDA GPU: {gpu_name}")
        print(f"💾 GPU Memory: {gpu_memory:.1f} GB")
        
        # AWS SageMaker instance detection
        if 'ml.g4dn' in gpu_name or 'ml.g5' in gpu_name:
            print("☁️  AWS SageMaker GPU instance detected - optimal for training!")
            
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS for Apple Silicon optimization")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU - consider GPU for better performance")
    
    return device

# Get optimal device
device = get_optimal_device()
print(f"\n✅ Selected device: {device}")

## Part 3: GLUE SST-2 Dataset Loading

Load the Stanford Sentiment Treebank v2 dataset - perfect for fast binary sentiment classification training.

In [None]:
# Load GLUE SST-2 dataset for binary sentiment classification
print("📥 Loading GLUE SST-2 dataset...")

# Load the complete dataset
dataset = load_dataset("glue", "sst2")

print(f"✅ Dataset loaded successfully!")
print(f"📊 Training examples: {len(dataset['train']):,}")
print(f"🔬 Validation examples: {len(dataset['validation']):,}")

# Examine the dataset structure
print(f"\n📋 Dataset features: {dataset['train'].features}")

# Show example data
train_example = dataset['train'][0]
print(f"\n📝 Example training sample:")
print(f"   Sentence: '{train_example['sentence']}'")
print(f"   Label: {train_example['label']} ({'Positive' if train_example['label'] == 1 else 'Negative'})")

# Dataset statistics
from collections import Counter
train_labels = [ex['label'] for ex in dataset['train']]
val_labels = [ex['label'] for ex in dataset['validation']]

print(f"\n📈 Label distribution:")
print(f"   Training: {Counter(train_labels)}")
print(f"   Validation: {Counter(val_labels)}")

## Part 4: DistilBERT Model Setup

Load DistilBERT - a distilled version of BERT that's 40% smaller while maintaining 97% of BERT's performance.

In [None]:
# Model configuration for fast training
model_name = "distilbert-base-uncased"
num_labels = 2  # Binary classification

print(f"🔄 Loading {model_name}...")
print(f"💡 DistilBERT info: 40% fewer parameters than BERT, 97% performance retained")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model for sequence classification
try:
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        id2label={0: "NEGATIVE", 1: "POSITIVE"},
        label2id={"NEGATIVE": 0, "POSITIVE": 1}
    )
    
    # Move model to optimal device
    model = model.to(device)
    
    print(f"✅ Model loaded successfully!")
    print(f"📊 Parameters: {model.num_parameters():,} (vs ~110M for BERT-base)")
    print(f"🎯 Model moved to: {device}")
    
    # Model architecture info
    print(f"\n🏗️  Model Architecture:")
    print(f"   Hidden size: {model.config.hidden_size}")
    print(f"   Attention heads: {model.config.num_attention_heads}")
    print(f"   Hidden layers: {model.config.num_hidden_layers} (vs 12 for BERT)")
    print(f"   Max position embeddings: {model.config.max_position_embeddings}")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    raise

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("✅ Pad token configured")

print(f"\n⚡ Knowledge Distillation Benefits:")
print(f"   • 40% fewer parameters = faster training")
print(f"   • Lower memory usage = larger batch sizes possible")
print(f"   • Faster inference = better production performance")
print(f"   • 97% of BERT performance retained")

## Part 5: Training Configuration and Fine-tuning

Configure training for optimal performance on cloud GPU instances and start fine-tuning.

In [None]:
# Tokenization function optimized for speed
def tokenize_function(examples):
    return tokenizer(
        examples['sentence'],
        truncation=True,
        padding=False,  # Dynamic padding during training
        max_length=128  # Shorter sequences = faster training
    )

# Tokenize datasets
print("🔄 Tokenizing datasets...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['sentence', 'idx']  # Remove unnecessary columns
)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

# Evaluation metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    return {'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1}

# Training configuration for AWS SageMaker
training_args = TrainingArguments(
    output_dir="./distilbert-sst2-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=200,
    logging_steps=100,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",
    greater_is_better=True,
    fp16=(device.type == 'cuda'),
    dataloader_pin_memory=True,
    seed=42,
    report_to="none"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("🎯 Training configuration:")
print(f"   📅 Epochs: {training_args.num_train_epochs}")
print(f"   📦 Batch size: {training_args.per_device_train_batch_size}")
print(f"   📈 Learning rate: {training_args.learning_rate}")
print(f"   ⚡ Mixed precision: {training_args.fp16}")
print(f"   🎯 Expected time: 15-30 minutes on GPU")

# Start training
print("\n🚀 Starting fine-tuning...")
start_time = time.time()

train_result = trainer.train()

training_time = time.time() - start_time
print(f"\n✅ Training completed in {training_time:.2f}s ({training_time/60:.1f} min)")

## Part 6: Evaluation and Model Testing

Evaluate the fine-tuned model and test it with real examples.

In [None]:
# Final evaluation
print("🔬 Running final evaluation...")
eval_results = trainer.evaluate()

print("\n📊 Final Results:")
for metric, value in eval_results.items():
    if metric.startswith('eval_'):
        metric_name = metric.replace('eval_', '').title()
        if 'loss' in metric:
            print(f"   {metric_name}: {value:.4f}")
        else:
            print(f"   {metric_name}: {value:.4f} ({value*100:.2f}%)")

# Test with examples
test_examples = [
    "This movie is absolutely fantastic!",
    "What a terrible waste of time.",
    "The acting was mediocre but engaging.",
    "Outstanding performance by the actors!",
    "Worst movie I've ever seen.",
    "A masterpiece of cinema."
]

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    return 'POSITIVE' if predicted_class == 1 else 'NEGATIVE', confidence

print("\n🧪 Testing model with examples:")
for i, example in enumerate(test_examples, 1):
    prediction, confidence = predict_sentiment(example)
    print(f"{i}. '{example[:50]}...'")
    print(f"   → {prediction} (confidence: {confidence:.3f})")
    print()

## Part 7: Model Saving and Deployment

Save the fine-tuned model for production deployment.

In [None]:
# Save the model
model_save_path = "./distilbert-sst2-finetuned"
print(f"💾 Saving model to: {model_save_path}")

trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Create model card
accuracy = eval_results.get('eval_accuracy', 0)
model_card = f"""
# DistilBERT Fine-tuned for Sentiment Classification

## Model Description
Fine-tuned DistilBERT on GLUE SST-2 for binary sentiment classification.

## Performance
- Validation Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)
- Training Time: {training_time/60:.1f} minutes
- Parameters: {model.num_parameters():,} (40% smaller than BERT)

## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("{model_save_path}")
model = AutoModelForSequenceClassification.from_pretrained("{model_save_path}")

text = "This movie is great!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

Optimized for AWS SageMaker Studio deployment.
"""

with open(f"{model_save_path}/README.md", "w") as f:
    f.write(model_card)

print("✅ Model saved successfully!")
print(f"📊 Final accuracy: {accuracy*100:.2f}%")
print(f"⏱️  Total training time: {training_time/60:.1f} minutes")
print(f"🚀 Ready for deployment on AWS SageMaker!")

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Knowledge Distillation**: DistilBERT provides 97% of BERT performance with 40% fewer parameters
- **Efficient Fine-tuning**: GLUE SST-2 enables rapid convergence in 15-30 minutes
- **Cloud Optimization**: AWS SageMaker Studio specific configurations for maximum efficiency
- **Production Pipeline**: Complete workflow from data loading to model deployment

### 📈 Best Practices Learned
- **Model Selection**: Choose distilled models for speed without sacrificing performance
- **Dataset Strategy**: Use benchmark datasets like GLUE for reliable, fast convergence
- **Training Optimization**: Mixed precision, dynamic padding, and early stopping
- **Evaluation Strategy**: Comprehensive metrics and real-world testing examples

### 🚀 Next Steps
- **Advanced Techniques**: Explore PEFT techniques like LoRA for even more efficiency
- **Model Serving**: Deploy on AWS SageMaker endpoints
- **Domain Adaptation**: Apply to domain-specific datasets
- **Performance Monitoring**: Implement MLOps practices

### 💡 AWS SageMaker Key Takeaways
- **Instance Recommendation**: ml.g4dn.xlarge or ml.g5.xlarge optimal for this workflow
- **Cost Efficiency**: 15-30 minute training keeps costs low while achieving high performance
- **Scalability**: Approach scales well for larger datasets and production workloads
- **Integration**: Works seamlessly with SageMaker Hugging Face containers

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*