# [Draft] STG406 - ML Notebook

This is an evolving notebook to collaborate and develop ML part of STG406 workshop.

## (Optional) Create synthetic dataset and store it on S3

> ⚠️ Note: this step will likely NOT be part of this notebook. This optional step is provided here for convenience of dev work only, and is meant to setup a minimal benchmark environment for testing w/o leaving this NB.

#### Clone the SageMaker Bencher repo to bootstrap S3

We will only use the capability of SageMaker Bencher to automatically create synthetic datasets and upload them to S3. No other actions will be performed by SageMaker Bencher.

In [8]:
!git clone https://github.com/aws-samples/sagemaker-bencher.git

fatal: destination path 'sagemaker-bencher' already exists and is not an empty directory.


#### Install all Python modules to run SageMaker Bencher locally

In [None]:
!pip install -q -r sagemaker-bencher/requirements.txt

#### Staging synthetic data in S3 bucket
> ⚠️ Note: before continuing, make sure set the desired AWS region for the S3 staging bucket in the YAML file below

In [15]:
%%writefile sagemaker-bencher/experiments/stg406-bootstrap-s3.yml

# --- MAIN EXPERIMENT SETTINGS
name: dummy
role: dummy
region: us-west-2
output_prefix: dummy
bucket: workshop-us-west-2-277453393386

parallelism: 0

# --- DATASET DEFINITIONS
datasets:
    Synth-jpg-10GB-a-100KB:
        type: synthetic
        format: jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_files: 10000
        num_copies: 10
        num_classes: 4

    Synth-jpg-5GB-a-100KB:
        type: synthetic
        format: jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_files: 10000
        num_copies: 5
        num_classes: 4

    Synth-jpg-100MB-a-100KB:
        type: synthetic
        format: jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_files: 1000
        num_copies: 1
        num_classes: 4
    
    Synth-tar-jpg-10GB-a-100MB:
        type: synthetic
        format: tar/jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_records: 1000
        num_files: 10
        num_copies: 10
        num_classes: 4

    Synth-tar-jpg-5GB-a-100MB:
        type: synthetic
        format: tar/jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_records: 1000
        num_files: 10
        num_copies: 5
        num_classes: 4

    100k-samples-small-files:
        type: synthetic
        format: jpg
        bucket: workshop-us-west-2-277453393386
        prefix: 'datasets/synthetic'
        dimension: [290, 290, 3]
        num_files: 10000
        num_copies: 10
        num_classes: 4


# --- TRIAL DEFINITIONS
# Default values for trials (if not overriden by specific trial)
base_trial:
  script: benchmark-tensorflow.py


# Trial definitions
trials:

  - inputs:
      train:
        dataset: Synth-jpg-10GB-a-100KB

  - inputs:
      train:
        dataset: Synth-jpg-100MB-a-100KB

  - inputs:
      train:
        dataset: 100k-samples-small-files

Overwriting sagemaker-bencher/experiments/stg406-bootstrap-s3.yml


In [16]:
!python sagemaker-bencher/bencher.py -f sagemaker-bencher/experiments/stg406-bootstrap-s3.yml --bootstrap

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/jupyter-admin/.config/sagemaker/config.yaml
2024-10-29 15:49:58.539257: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1730216998.556987   55985 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1730216998.562400   55985 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-29 15:49:58.580433: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: A

## Benchmark training job setup

### Create the remotely runnable Python training script

In [None]:
%%writefile scripts/benchmark_dev.py

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

import os
import glob
import json
import argparse
import time
import random

import numpy as np
from PIL import Image

import webdataset as wds
#import s3torchconnector as s3pt

import torch
import torch.nn as nn
import torchdata
from torchvision.transforms import v2 as tvt
#from transformers import ViTForImageClassification

import ray.train.torch


################## BENCHMARK PARAMETERS DEFINITION ###################

def parse_args():
    
    def none_or_int(value):
        if str(value).upper() == 'NONE':
            return None
        return int(value)
    
    def none_or_str(value):
        if str(value).upper() == 'NONE':
            return None
        return str(value)
    
    def str_bool(value):
        if str(value).upper() == 'TRUE':
            return True
        elif str(value).upper() == 'FALSE':
            return False
        else:
            raise TypeError("Must be True or False.")
    
    parser = argparse.ArgumentParser()

    ### Parameters that define dataloader part 
    parser.add_argument('--batch_size', type=int, default=64)
    parser.add_argument('--num_workers', type=int, default=0)
    parser.add_argument('--prefetch_size', type=none_or_int, default=0)
    parser.add_argument('--input_dim', type=int, default=224)
    parser.add_argument('--pin_memory', type=str_bool, default=True)
    parser.add_argument('--batch_drop_last', type=str_bool, default=False)
    
    ### Parameters that define computation part 
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--compute_time', type=none_or_int, default=None) # in MS
    parser.add_argument('--num_nodes', type=int, default=2)
   
    ### Parameters that define some storage and dataset details for benchmarking 
    #parser.add_argument('--bucket_mount_path', type=str)
    #parser.add_argument('--bucket_dataset_prefix', type=str)
    parser.add_argument('--dataset_path', type=str)
    parser.add_argument('--dataset_format', type=str)
    parser.add_argument('--dataset_num_samples', type=int, default=100000)

    return parser.parse_known_args()


################## MODEL IMPLEMENTATION ###################

class ModelMock(torch.nn.Module):
    '''Model mock to emulate a computation of a training step'''
    def __init__(self, config):
        super().__init__()
        self.dummy_module = torch.nn.Linear(10, 10)
        self.config = config
    
    def forward(self, data, target, batch_idx):
        if self.config.compute_time > 0:
            return time.sleep(self.config.compute_time / 1000)
        return 


################## DATASET IMPLEMENTATIONS ###################

def identity(x):
    return x

class MapDataset(torch.utils.data.Dataset):
    def __init__(self, files, transform):
        self._files = np.array(files)
        self._transform = transform
   
    @staticmethod
    def _get_label(file):
        return file.split(os.path.sep)[-2]
    
    @staticmethod
    def _read(file):
        return Image.open(file).convert('RGB')
    
    def __len__(self):
        return len(self._files)
    
    def __getitem__(self, idx):
        file = self._files[idx]
        sample = self._transform(self._read(file))
        label = int(self._get_label(file))    # Labels in [0, MAX) range
        return sample, label

def _make_pt_dataset(config, transform):
    files = glob.glob(config.dataset_path + '/**/*.jpg')
    dataset = MapDataset(files, transform)
    return dataset

def _make_wds_dataset(config, transform):
    files = glob.glob(config.dataset_path + '/*.tar')        
    dataset = wds.WebDataset(files, shardshuffle=True, resampled=True, nodesplitter=wds.split_by_node)
    dataset = dataset.decode('pil')
    dataset = dataset.to_tuple('jpg', 'cls')
    dataset = dataset.map_tuple(transform, identity)
    #dataset = dataset.with_length(config.dataset_num_samples)
    dataset = dataset.with_epoch(config.dataset_num_samples // (config.num_nodes * config.num_workers))
    return dataset


################## BENCHMARK IMPLEMENTATIONS #################

def build_dataloader(config):

    transform = tvt.Compose([
        tvt.ToImage(),
        tvt.ToDtype(torch.uint8, scale=True),
        tvt.RandomResizedCrop(size=(config.input_dim, config.input_dim), antialias=False), #antialias=True
        tvt.ToDtype(torch.float32, scale=True),
        tvt.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    # Build dataset
    if config.dataset_format == 'jpg':
        dataset = _make_pt_dataset(config, transform)
    elif config.dataset_format == 'tar':
        dataset = _make_wds_dataset(config, transform)
    else:
        raise NotImplementedError("Unknown dataset format '%s'.." % config.dataset_format)


    return torch.utils.data.DataLoader(
        dataset,
        num_workers=config.num_workers,
        batch_size=config.batch_size,
        prefetch_factor=config.prefetch_size,
        pin_memory=config.pin_memory
    )


def build_model(config):
    if config.compute_time is not None:
        model = ModelMock(config)
    else:
        raise NotImplementedError("Compute time parameter must be set explicitly..")
    return model


def train_model(model, dataloader, config):
    metrics = {}
    img_tot_list, ep_times, ckpt_times = [], [], []
    t_train_start = t_epoch_start = time.perf_counter()

    for epoch in range(config.epochs):

        if config.num_nodes > 1:
            dataloader.sampler.set_epoch(epoch)

        img_tot = 0
        for iteration, (images, labels) in enumerate(dataloader, 1):
                
            # do training step
            batch_size = len(images)
            img_tot += batch_size

            result = model(images, labels, iteration)

            if result:
                ckpt_times.append(result)

            if iteration % 50 == 0:
                #print(iteration, '-->', images.shape, labels.shape, labels)
                print("Epoch =", epoch, "/ Iteration =", iteration)

        # log metrics
        img_tot_list.append(img_tot)
        ep_times.append(time.perf_counter() - t_epoch_start)
        t_epoch_start = time.perf_counter()

    # log metrics
    t_train_tot = time.perf_counter() - t_train_start
    metrics['t_training_exact'] = t_train_tot
    metrics['img_sec_ave_tot'] = sum(img_tot_list) / t_train_tot
    metrics['img_tot'] = sum(img_tot_list)
    metrics.update({f't_epoch_{i}': t for i, t in enumerate(ep_times, 1)})
    metrics.update({f't_ckpt_{i}': t for i, t in enumerate(ckpt_times, 1)})
    return metrics
    

def main(config):

    # Debug
    files = glob.glob(config.dataset_path + '/**/*.jpg')
    
    print("Dataset files:")
    for i, f in enumerate(files):
        print(' -', f)
        if i > 20: break
    

    # Log params
    print("Benchmarking params:\n" + json.dumps(vars(config), indent=2))
    
    # Build dataloader
    dataloader = build_dataloader(config)
    dataloader = ray.train.torch.prepare_data_loader(dataloader)

    # Build model
    model = build_model(config)
    model = ray.train.torch.prepare_model(model)

    # Do training run
    metrics = train_model(model, dataloader, config)
    
    # Finish
    print("All logged metrics:\n" + json.dumps(metrics, indent=2))
    time.sleep(3)

    return
        

if __name__ == '__main__':

    # Step 1: Parse the parameters sent by the SageMaker client to the script
    train_config, unknown = parse_args()

    scaling_config = ray.train.ScalingConfig(num_workers=train_config.num_nodes, use_gpu=False)
    
    trainer = ray.train.torch.TorchTrainer(
        main,
        scaling_config=scaling_config,
        train_loop_config=train_config)

    result = trainer.fit()

In [None]:
import time
from ray.job_submission import JobSubmissionClient, JobStatus

In [None]:
ray_head_dns = "ws-ray-head-nlb-76930d7bd7033359.elb.eu-central-1.amazonaws.com"
ray_head_port = 8265
ray_address = f"http://{ray_head_dns}:{ray_head_port}"

In [None]:
client = JobSubmissionClient(ray_address)

In [None]:
# local_dataset_dir = '/mnt/default'
# local_dataset_dir = '/mnt/metadata-cache'
# local_dataset_dir = '/mnt/full-cache'

In [None]:
entrypoint_cmd = "python benchmark_dev.py" \
                 " --epochs=3" \
                 " --num_workers=12" \
                 " --num_nodes=2" \
                 " --prefetch_size=2" \
                 " --compute_time=0" \
                 " --dataset_path=/mnt/full-cache/datasets/synthetic/Synth-jpg-5GB-a-100KB" \
                 " --dataset_format=jpg"
                 

job_id = client.submit_job(
    entrypoint=entrypoint_cmd,
    runtime_env={
        "working_dir": "./scripts",
        "pip": ['torch', 'torchvision', 'torchdata', 'webdataset'],
        "env_vars": {'RAY_DEDUP_LOGS': '0'}}
)

print(job_id)

In [None]:
len(list(range(10)))