In [None]:
!huggingface-cli login --token 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

## SageMaker Distributed Training using HuggingFace Managed Container Image

### 🚀 SageMaker Warm Pools Overview

SageMaker warm pools help reduce training job startup time by keeping instances "warm" and ready for reuse.

### ⚠️ **IMPORTANT: Warm Pool Availability**

**The warm pool is ONLY available AFTER your current job completes (succeeds or fails).**

- ❌ **First job**: Always experiences cold start
- ❌ **Concurrent jobs**: Cannot share warm pools - each gets cold start
- ✅ **Sequential jobs**: Second job onwards can use warm pool

> **Don't create multiple jobs expecting immediate warm pool benefits!** The warm instance becomes available only after the current job finishes.

### ⚡ How Warm Pools Work

**First Training Job (Cold Start):**
- Instance provisioning: ~3-10 minutes
- Container image download: ~2-3 minutes  
- Environment setup: ~1-2 minutes
- **Total overhead: 6-15 minutes**

**Subsequent Jobs (Warm Start):**
- Instance reuse: ~5 seconds
- Cached container: ~0 seconds
- Environment setup: ~15 seconds
- **Total overhead: ~20 seconds** ⚡

### 🎯 Warm Pool Requirements

For an instance to be reused, the following must **exactly match**:
- ✅ **Instance type** (e.g., `ml.g5.12xlarge`)
- ✅ **Instance count** (e.g., `1`)
- ✅ **Volume size** (e.g., `100 GB`)
- ✅ **Network configuration** (VPC, subnets, security groups)

### 🏗️ Container Image Caching

**SageMaker Managed Images** (like HuggingFace):
- ✅ **Cached automatically** - no re-download needed
- ✅ **Faster startup** - image already available locally

**Custom Docker Images**:
- ❌ **Not cached** - must be pulled each time
- ⏱️ **Slower startup** - full image download required

### 🔄 Typical Workflow for Warm Pool Benefits

```python
# Job 1: Cold start (6-15 min overhead)
job_1 = huggingface_estimator.fit(wait=True)  # First job - always cold start
job_1.wait()  # Wait for completion

# Job 2: Warm start (~20 sec overhead) ⚡
job_2 = huggingface_estimator.fit(wait=True)  # Now uses warm pool!
job_2.wait()

# Job 3: Warm start (~20 sec overhead) ⚡
job_3 = huggingface_estimator.fit(wait=True)  # Continues using warm pool

In [105]:
import sagemaker
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
import time

# Define Training Job Name with timestamp for uniqueness
job_name = f'qwen-dpo-training-{int(time.time())}'
role = sagemaker.get_execution_role()

# Create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='dpo_train.py',              # train script
    source_dir='code',                       # directory with all training files
    instance_type='ml.g5.12xlarge',          # GPU instance
    instance_count=1,                        # instance count
    base_job_name=job_name,                  # base name for training job
    role=role,                               # IAM role for AWS resources
    volume_size=30,                         # EBS volume size in GB
    transformers_version='4.49.0',           # transformers version
    pytorch_version='2.5.1',                 # pytorch version
    py_version='py311',                      # python version
    disable_output_compression=True,         # faster training completion
    output_path="s3://mm-fsi-fix/dpo/",      # S3 location where training output is stored

    # Environment variables
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",  # cache location
        "HF_TOKEN": HfFolder.get_token(),        # Hf Token ID
        "TOKENIZERS_PARALLELISM": "false",        # Prevent tokenizer warnings
        "PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip"  # Persisitent dir across warm pool runs used to cache pip installs
    },
    distribution={
        "torch_distributed": {  # this uses torchrun command on your script with the following arguments
            "enabled": True,
        }
    },
    keep_alive_period_in_seconds=1800,       # 30 minutes warm pool, maximum of 60 mins
)

# Start training
huggingface_estimator.fit()

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: qwen-dpo-training-1757440457-2025-09-09-17-54-17-220


2025-09-09 17:54:20 Starting - Found matching resource for reuse..[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35mCUDA compat package should be installed for NVIDIA driver smaller than 550.163.01[0m
[35mCurrent installed NVIDIA driver version is 570.172.08[0m
[35mSkipping CUDA compat setup as newer NVIDIA driver is installed[0m
[35m2025-09-09 17:54:30,946 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2025-09-09 17:54:30,984 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[35m2025-09-09 17:54:30,993 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35m2025-09-09 17:54:30,995 sagemaker_pytorch_container.training INFO     Invoking TorchDistributed...[0m
[35m2025-09-09 17:54:30,995 sagemaker_pytorch_container.training INFO     Invoking user training scri