## ML Training at SageMaker Training Job (with W&B)

This is an example notebook to invoke SageMaker Training Job service for MNIST classification ML training. 

In the notebook, we will work through SageMaker Execution Role creation, W&B secret creation (for API key) and invoke SageMaker Training Job service. 




### Setup

Enable scripts loading dynamically.

In [None]:
# to load scripts dynamically
%load_ext autoreload
%autoreload 2

Load environment variables from `.env` file.

In [None]:
import boto3
import json
import os

from botocore.exceptions import ClientError

from dotenv import load_dotenv
load_dotenv("../../.env")

Create SageMaker Execution Role if it doesn't exist. The key permissions for the IAM role is to:
* download training image from ECR
* read wandb secrets
* add tags to sagemaker training job for marking wandb project & checkpoint so as for training resiliency 
* CloudWatch logs and metrics
* S3 output path

In [None]:
from scripts.utils import create_sagemaker_execution_role, create_wandb_secret, create_s3_bucket

iam_role = create_sagemaker_execution_role("sagemaker-execution-role")

Create WANDB secret on AWS Secret Manager, which will be used in Training Job for integration on ML experimentation, tracking and checkpoint storage.

In [None]:
# creating api key secret
wandb_secret_name = "wandb-secret"
create_wandb_secret(wandb_secret_name, os.environ.get("WANDB_API_KEY"))

Create S3 bucket for SageMaker Training Job output. 
* Please ensure that the bucket naming pattern aligned with IAM role (by `create_sagemaker_execution_role` function) permissions.
* Reference - with key word `sagemaker`:

```
{
                "Sid": "AllowS3ObjectActions",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:AbortMultipartUpload"
                ],
                "Resource": [
                    "arn:aws:s3:::*SageMaker*",
                    "arn:aws:s3:::*Sagemaker*",
                    "arn:aws:s3:::*sagemaker*"
                ]
            },
```

In [None]:
# by default, creating the bucket in us-east-1 region without providing region parameter.
bucket_name = "sagemaker-wandb-samples"
create_s3_bucket(bucket_name)

### Set hyperparameters

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

session = sagemaker.Session()
region = session.boto_region_name

instance_type = 'ml.g6.xlarge'
training_job_output = f"s3://{bucket_name}/training-jobs/"

# image uri when using Bring Your Own Container
# image_uri = f"{AWS_ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/mnist-training:latest"

estimator = PyTorch(
    # image_uri=image_uri,
    framework_version="2.7",
    py_version="py312",
    entry_point="train.py",
    source_dir="./src",
    role=iam_role,
    instance_type=instance_type, 
    instance_count=1,
    volume_size=50,
    output_path=training_job_output,
    hyperparameters={
        "epochs": 5
    }, 
    environment={
        "WANDB_SECRET_NAME": wandb_secret_name,
        "WANDB_PROJECT": "MNIST",
        "AWS_DEFAULT_REGION": region, # for training script to access region-based resources - secret.
        # "WANDB_CHECKPOINT_NAME": 
        # "WANDB_CHECKPOINT_TAG": "latest"
    },
)

In [None]:
estimator.fit(wait=False)