In [None]:
pip install sagemaker -U

In [None]:
!huggingface-cli login --token XXXXX

## SageMaker Distributed Training using HuggingFace Managed Container Image

### 🚀 SageMaker Warm Pools Overview

SageMaker warm pools help reduce training job startup time by keeping instances "warm" and ready for reuse.

### ⚠️ **IMPORTANT: Warm Pool Availability**

**The warm pool is ONLY available AFTER your current job completes (succeeds or fails).**

- ❌ **First job**: Always experiences cold start
- ❌ **Concurrent jobs**: Cannot share warm pools - each gets cold start
- ✅ **Sequential jobs**: Second job onwards can use warm pool

> **Don't create multiple jobs expecting immediate warm pool benefits!** The warm instance becomes available only after the current job finishes.

### ⚡ How Warm Pools Work

**First Training Job (Cold Start):**
- Instance provisioning: ~3-10 minutes
- Container image download: ~2-3 minutes  
- Environment setup: ~1-2 minutes
- **Total overhead: 6-15 minutes**

**Subsequent Jobs (Warm Start):**
- Instance reuse: ~5 seconds
- Cached container: ~0 seconds
- Environment setup: ~15 seconds
- **Total overhead: ~20 seconds** ⚡

### 🎯 Warm Pool Requirements

For an instance to be reused, the following must **exactly match**:
- ✅ **Instance type** (e.g., `ml.g5.12xlarge`)
- ✅ **Instance count** (e.g., `1`)
- ✅ **Volume size** (e.g., `100 GB`)
- ✅ **Network configuration** (VPC, subnets, security groups)

### 🏗️ Container Image Caching

**SageMaker Managed Images** (like HuggingFace):
- ✅ **Cached automatically** - no re-download needed
- ✅ **Faster startup** - image already available locally

**Custom Docker Images**:
- ❌ **Not cached** - must be pulled each time
- ⏱️ **Slower startup** - full image download required

### 💡 Best Practices

1. **Configure warm pools** with `keep_alive_period_in_seconds`
2. **Use consistent configurations** across related training jobs
3. **Leverage SageMaker managed images** when possible
4. **Plan job sequences** to maximize warm pool utilization

### 🔄 Typical Workflow for Warm Pool Benefits

```python
# Job 1: Cold start (6-15 min overhead)
job_1 = huggingface_estimator.fit(wait=True)  # First job - always cold start
job_1.wait()  # Wait for completion

# Job 2: Warm start (~20 sec overhead) ⚡
job_2 = huggingface_estimator.fit(wait=True)  # Now uses warm pool!
job_2.wait()

# Job 3: Warm start (~20 sec overhead) ⚡
job_3 = huggingface_estimator.fit(wait=True)  # Continues using warm pool

In [11]:
import sagemaker
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
import time
from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner, CategoricalParameter

# Define Training Job Name with timestamp for uniqueness
job_name = f'qwen-dpo-training-{int(time.time())}'
role = sagemaker.get_execution_role()

# Training metrics for SageMaker SDK to track
metric_definitions = [

    # Average training loss (logged at end)
    {
        "Name": "train:loss",
        "Regex": r"'train_loss':\s*([0-9\.]+)"  # As printed out to stdout
    }
]

# Create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='dpo_train.py',              # train script
    source_dir='code-hpo',                   # directory with all training files
    instance_type='ml.g5.2xlarge',          # GPU instance
    instance_count=1,                        # instance count
    base_job_name=job_name,                  # base name for training job
    role=role,                               # IAM role for AWS resources
    volume_size=30,                         # EBS volume size in GB
    transformers_version='4.49.0',           # transformers version
    pytorch_version='2.5.1',                 # pytorch version
    py_version='py311',                      # python version
    disable_output_compression=True,         # faster training completion
    output_path="s3://BUCKET NAME/PREFIX/",      # S3 location where training output is stored

    # Environment variables
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",  # cache location
        "HF_TOKEN": HfFolder.get_token(),        # Hf token id
        "TOKENIZERS_PARALLELISM": "false",        # Prevent tokenizer warnings
        "PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip"  # Persisitent dir across warm pool runs used to cache pip installs
    },

    # Hyperparameters - these override values in dpo_config.yaml
    hyperparameters={
        # Config file name
        "config_path": "dpo_config.yaml",  # As saved in the `source_dir` parameter above
        
        # LoRA parameters
        "lora_r": "32",  
        "lora_alpha": 64,  
        "lora_dropout": 0.05, 

        # Training parameters
        "learning_rate": 1e-4,  
        "per_device_train_batch_size": 4,  
        "num_train_epochs": 3,  
        "gradient_accumulation_steps": 2,  

        # DPO parameters
        "dpo_beta": 0.2,

        # Data parameters
        "num_samples": 1000, 
        "data_processor_type": "anthropic_hh_rlhf",
    },

    # Add metric definitions to capture to sagemaker default metrics chart
    metric_definitions=metric_definitions,
    
    distribution={
        "torch_distributed": {  # this uses torchrun command on your script with the following arguments
            "enabled": True,
        }
    },
    keep_alive_period_in_seconds=1800,       # 30 minutes warm pool, maximum of 60 mins
)

# No need to call the .fit method to start the training job since we are using SageMaker automatic model tuning below
# huggingface_estimator.fit()

## 🎯 Hyperparameter Optimization & Model Fine-tuning

This section leverages **Amazon SageMaker's powerful ML infrastructure** to perform automated hyperparameter optimization (HPO) for fine-tuning our OSS models.

**SageMaker Automatic Model Tuning (AMT)** eliminates manual hyperparameter guesswork:
- 🔬 **Optimization strategies** select from `bayesian` `grid` `random` and `hyperband` strategies to explore hyperparameter space
- 📊 **Multi-metric tracking** monitors training progress in real-time
- ⚡ **Parallel experimentation** runs multiple configurations simultaneously
- 💰 **Cost-effective** by stopping underperforming jobs early
- **Framework Agnostic** can work with any ML framework and/or custom images

### 🔍 Automated Hyperparameter Search

**SageMaker AMT** intelligently explores the parameter space:

Hyperparameter Tuning ensures we find optimal hyperparameters efficiently while maintaining reproducibility and cost control for finetuning job.

In [12]:
hyperparameter_ranges_qwen = {
    "lora_r": CategoricalParameter(["32", "64"]),
    "lora_dropout": ContinuousParameter(0, 0.1),
    "learning_rate": ContinuousParameter(0.000005, 0.0002),
    "per_device_train_batch_size": IntegerParameter(1,4)
}

tuner = HyperparameterTuner(
        base_tuning_job_name="dpo-qwen-sft",
        estimator=huggingface_estimator,
        objective_metric_name="train:loss",  # Name of a metric defined in the Estimator class above to track for hpo
        hyperparameter_ranges=hyperparameter_ranges_qwen,
        metric_definitions=metric_definitions,
        max_jobs=4,  # Total number of jobs (candidates)
        max_parallel_jobs=2,  # Number of jobs (candidates) to run in parallel
        objective_type="Minimize",  # Eithe `Maximize` or `Minimize` the objective metric
        strategy="Bayesian"  # Search strategy can be any of `Random` `Grid` `Hyperband`
    )

tuner.fit(wait=False  # Toggle to False so that the cell does not wait for the job to finish (default is True) 
        )

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: dpo-qwen-sft-250909-2058


### You can check the status of the Tuning Job from the sagemaker console

<img src="hpo-dpo.PNG" width="1000"/>