# Workshop: Onboarding & optimizing AI/ML workloads on AWS with Amazon S3 and EKS

---

<font size='5' color='blue'>**Bonus Notebook: Using Amazon S3 Connector for PyTorch**</font>

---

Throughout this workshop, you utilized Mountpoint for Amazon S3 as the primary file interface for data I/O in your ML workloads. This notebook introduces the [**Amazon S3 Connector for PyTorch**](https://github.com/awslabs/s3-connector-for-pytorch), another open-source alternative designed for efficient data I/O directly from/to Amazon S3, specifically tailored for PyTorch training workloads.

While this notebook complements the broader discussion on efficient data I/O from the main workshop notebook, it is designed to stand alone. Its primary focus is to provide a practical demonstration of the efficient **dataloading** and **checkpointing** capabilities offered by the **S3 Connector for PyTorch**.

---

> _⚠️ **NOTE:** Before using this notebook, you must have run sections 1, 3 and 5 from `1_main_notebook.ipynb`_

---

## Contents

1. Set everything up
2. Introducing **Amazon S3 Connector for PyTorch**
3. Benchmarking dataloading with **S3 Connector for PyTorch**
4. Checkpointing with **S3 Connector for PyTorch**
5. Summary of this notebook

---

# 1. Set everything up

<a id='sec-1'></a>

---

<font size='4' color='gray'>**_A few instructions before you start_**</font>

<font size='3' color='green'>**_(1) Run each of the following code cells in turn with Shift+Enter._**</font>

_It may take a few seconds to a few minutes for a code cell to run. You can determine whether a cell is running by examining the `[]:` indicator in the left margin next to each cell: a cell will show `[*]:` when running, and `[<a number>]:` when complete. **Please read on while you wait**._

_Please feel free to review the code, but it is not essential for you to understand it as the important elements will be explained._

<font size='3' color='red'>**_(2) If any cell output is in red, this indicates an error._**</font>

_Check the  cells have been run in order, and seek help from a workshop assistant if needed._

---

## 1.1 Install and import the required libraries

In [None]:
# Standard library imports
import sys
import os
import json
import time
import random
from datetime import datetime

# Third-party imports
import boto3                    # AWS SDK for Python
import ray                      # Ray SDK for Python
import ray.job_submission       # Ray Jobs interface module of Ray SDK for Python

# Local imports
from utilities import utils     # Collection of helper functions

# Version check for Python compatibility
MIN_PYTHON_VERSION = (3, 7)
assert sys.version_info >= MIN_PYTHON_VERSION, f"Python version must be {MIN_PYTHON_VERSION[0]}.{MIN_PYTHON_VERSION[1]} or higher."

# Print SDK versions
print(f"Python version: {sys.version.split()[0]}")
print(f"Boto3 SDK version: {boto3.__version__}")
print(f"Ray SDK version: {ray.__version__}")


## 1.2 Initial setup for clients and global variables

In [None]:
# AWS region and S3 bucket configuration
aws_region = os.getenv("AWS_REGION")
s3_bucket_name = os.getenv("WORKSHOP_BUCKET")
s3_bucket_prefix = "dataset"

# S3 bucket mountpoints
local_mountpoint_dir = "/s3_data"       # S3 bucket mountpoint path on this Jupyter instance
eks_mountpoint_dir = "/mnt/s3_data"     # S3 bucket mountpoint path on EKS cluster, as seen by Ray workers

# Ray client configuration
ray_head_dns = os.getenv("RAY_HEAD_NLB_DNS")
ray_head_port = 8265
ray_address = f"http://{ray_head_dns}:{ray_head_port}"

# Initialize Ray client
ray_client = ray.job_submission.JobSubmissionClient(ray_address)

# Print configurations
print(f"AWS Region: {aws_region}")
print(f"S3 Bucket: {s3_bucket_name}")
print(f"Ray Head DNS: {ray_client.get_address()}")

# 2. Introducing **Amazon S3 Connector for PyTorch**

The [**Amazon S3 Connector for PyTorch**](https://github.com/awslabs/s3-connector-for-pytorch) is an open-source toolset designed to provide high throughput for PyTorch training jobs that read from or write to Amazon S3. It streamlines the process by automatically optimizing performance when downloading training data or saving checkpoints, removing the need for lower-level custom code to handle tasks like listing S3 buckets or managing concurrent requests.

The S3 Connector for PyTorch includes implementations of PyTorch's dataset primitives, enabling seamless loading of training data from Amazon S3. It supports both map-style datasets, suited for random data access patterns, and iterable-style datasets, ideal for streaming sequential data. Additionally, the connector offers a checkpointing interface that allows checkpoints to be saved and loaded directly from Amazon S3, bypassing the need for intermediate local storage.

## 2.1 When to use **S3 Connector for PyTorch**

Both **Mountpoint for S3** and the **S3 Connector for PyTorch** rely on the [**AWS Common Runtime**](https://aws.amazon.com/blogs/storage/improving-amazon-s3-throughput-for-the-aws-cli-and-boto3-with-the-aws-common-runtime/), which handles tasks such as automatic request parallelization, managing timeouts and retries, and reusing connections to optimize network performance and prevent overload. However, the two tools differ in their scope and use cases. Mountpoint for S3 serves as a general-purpose file interface to S3, supporting a wide range of workloads beyond ML training. In contrast, the S3 Connector for PyTorch provides a set of PyTorch-specific primitives designed to tightly integrate S3 with PyTorch training pipelines.

One key distinction is in how each tool is used. When using S3 Connector for PyTorch, it is not required to install Mountpoint for S3 on your training nodes, which makes it great for training directly in notebooks or otherwise in environments where you can’t install software. The trade-off, however, is that using the S3 Connector for PyTorch requires minor modifications to your training code. These changes will be explained in the following sections.

# 3. Benchmarking dataloading with **S3 Connector for PyTorch**
First we will benchmark the dataloading performance of the S3 Connector for PyTorch. We will submit the same `benchmark.py` training script used in the previous notebook (`1_main_notebook.ipynb`) as a remote Ray job. The benchmark parameters will remain identical to the earlier setup, but this time we will add a flag to use the S3 Connector for PyTorch for dataloading. Additionally, we will now need to provide the full S3 URI of the dataset in the `dataset_path` variable.

> 💡 _**TIP:** While the job runs, read on to learn more. It should take <font color='red'>**around 2 minutes**</font> to complete._


## 3.1 Submit the job to Ray

In [None]:
### --------
### STEP #1: Compose the entrypoint command for Ray workers
### -------

# Set dataset name, dataset format, and benchmark name
dataset_name = '100k-samples-large-files'
dataset_format = 'tar'
dataloader_use_s3pt = True
dataset_path = os.path.join('s3://' + s3_bucket_name, s3_bucket_prefix, dataset_name)
benchmark_name = f'benchmark-dataloading-{dataset_name}-{dataset_format}-{datetime.now().strftime("%Y%m%d%H%M%S-%f")}'

# Compose entrypoint command string
entrypoint_command = "python benchmark.py" \
                     "  --epochs=3" \
                     "  --batch_size=64" \
                     "  --prefetch_size=2" \
                     "  --input_dim=224" \
                     "  --dataloader_workers=16" \
                    f"  --dataloader_use_s3pt={dataloader_use_s3pt}" \
                    f"  --dataset_path={dataset_path}" \
                    f"  --dataset_format={dataset_format}" \
                     "  --model_compute_time=0" \
                     "  --ray_workers=2" \
                     "  --ray_cpus_per_worker=8" \
                    f"  --benchmark_name={benchmark_name}"


### --------
### STEP #2: Define the runtime environment parameters for Ray workers
### -------

runtime_environment = {
    "working_dir": "./scripts",        # <--- the working dir is copied over to each Ray worker
    "pip": [                           # <--- PYPI packages to be installed on each Ray worker before executing entrypoint command
        'torch',
        'torchvision',
        'torchdata==0.9',
        'webdataset',
        's3torchconnector'],
    "env_vars": {                      # <--- any custom env vars to be set in Ray worker runtime
        'EKS_MOUNTPOINT_DIR': eks_mountpoint_dir,
        'AWS_REGION': aws_region
    }
}

### --------
### STEP #3: Submit job to Ray cluster
### -------

job_id = ray_client.submit_job(entrypoint=entrypoint_command, runtime_env=runtime_environment)


### Print out the Ray Job ID and other details
print(f"Submitted a new Ray job with ID '{job_id}' and the following entrypoint command: \n")
for line in entrypoint_command.split('  '):
    print(line)

## 3.2 Dataloading with **S3 Connector for PyTorch**

The S3 Connector for PyTorch provides two abstractions for directly accessing data from S3 in your ML training scripts. For streaming scenarios, it includes a PyTorch **iterable-style dataset** that streams all objects from a specified prefix in your S3 bucket. This abstraction handles object listing and efficient loading internally, ensuring data is read only when necessary. With just a single line of code, you can construct a streaming dataset for the entire prefix of an S3 bucket. For random access needs, the connector offers a PyTorch **map-style dataset**. This dataset automatically lists and maps objects in a bucket prefix, allowing for direct access to specific objects.

<img src="assets/pic_s3pt_dataloading.png" width="1000"/>

Both data loaders are designed to be simple to use, **requiring only one line of code**, and they integrate seamlessly with PyTorch’s existing interfaces for data loading:

- **Map-style datasets**:

```python
  from s3torchconnector import S3MapDataset

  dataset = S3MapDataset.from_prefix("s3://reinvent-demo-bucket/dataset")
```

- **Iterable-style datasets**:

```python
  from s3torchconnector import S3IterableDataset

  dataset = S3IterableDataset.from_prefix("s3://reinvent-demo-bucket/dataset")
```


## 3.3 Analyze and plot results (dataloading with S3 Connector for PyTorch)
You can monitor the job in your Ray Dashboard and monitoring performance via the Grafana Dashboard:

In [None]:
print(f"Grafana Dashboard: http://{os.getenv('GRAFANA_NLB_DNS')}")
print(f"Ray Dashboard: http://{os.getenv('RAY_DASHBOARD_NLB_DNS')}")

Let's now plot the results of the benchmark that we have just performed:

In [None]:
# Wait for the job to finish, so that we can plot the results
utils.wait_for_job_to_finish(job_id, ray_client)

# Load the log file and plot the results
logfile_path = os.path.join(local_mountpoint_dir, 'logs', benchmark_name + '.json')
print(f"Plotting results from '{logfile_path}'..")
utils.plot_dataloading_results(logfile_path, plot_for='s3pt')

### Explanation

The plot above illustrates the results of the data loading benchmark conducted using the S3 Connector for PyTorch with a sharded dataset. As demonstrated, streaming a sharded dataset directly from S3 achieves performance on par with our earlier benchmarks that utilized Mounpoint for S3. This approach enables efficient, on-demand dataset streaming directly from S3 into the PyTorch training script, eliminating the need for intermediate local storage.

# 4. Checkpointing with **S3 Connector for PyTorch**

The S3 Connector for PyTorch also includes a checkpointing interface to save and load checkpoints directly to Amazon S3, without first saving to local storage. Similarly to Mountpoint for S3, it leverages [**AWS Common Runtime**](https://aws.amazon.com/blogs/storage/improving-amazon-s3-throughput-for-the-aws-cli-and-boto3-with-the-aws-common-runtime/) to distribute large file writes elastically across the S3 fleet, resulting in significantly faster model checkpointing performance than saving model snapshots to local EBS volumes or NVMe instance storage.

The S3 Connector for PyTorch makes it also easy to save your PyTorch model checkpoints directly to S3. With just a single extra line of code, you can tell PyTorch to write a checkpoint directly to S3:

```python
import torch
from s3torchconnector import S3Checkpoint

ckpt = S3Checkpoint(region="us-west-2")

with ckpt.writer("s3://reinvent-demo-bucket/checkpoints/epoch-0.ckpt") as writer:
    torch.save(model.state_dict(), writer)
```


## 4.1 Benchmarking checkpointing with **S3 Connector for PyTorch**

To quantitatively evaluate the checkpointing performance directly to S3 using **S3 Connector for PyTorch**, let's now run a comparative benchmark against PyTorch's native checkpointing capability to the local storage of Ray workers (which, in this case, are the attached EBS gp3 volumes). The benchmarks that we are about to run will utilize our previous benchmarking script, with a few additional configuration parameters to control checkpointing behavior:

- `ckpt_steps` - defines number of steps between checkpoints (set to `100` in this benchmark, but setting to `0` will disable checkpointing);
- `ckpt_mode` -  checkpointing backend, either `disk` (for checkpointing to local path), or `s3pt` (to use S3 Connector for PyTorch);
- `ckpt_path` - storage path (S3 URI or local filesystem path);
- `model_num_parameters` - model size in millions of parameters, which effectively defines the model snapshot size.

> ⚠️ _**FEW NOTES:**_
> - In this benchmark we are effectively using S3 Connector for PyTorch for **both** dataloading and checkpointing.
> - As we set `epochs=1` and `ckpt_steps=100` below, we will checkpoint exactly **7 times** during our benchmark job and report the **average checkpointing time** (this is because each Ray worker processes ~750 batches per epoch, assuming 100k sample dataset, `batch_size=64` and `ray_workers=2`).
> - As the we set `model_num_parameters=1000`, the resulting model snapshots will be approx. **4GB** in size. This is because we are saving 1000M weights in `fp32` format (i.e. allocating 4 bytes per model weight).

In [None]:
ckpt_benchmarks = {}

for ckpt_destination in ('s3_connector', 'local_disk'):

    ### --------
    ### STEP #1: Compose the entrypoint command for Ray workers
    ### -------

    # Set dataset name, dataset format, and benchmark name
    dataset_name = '100k-samples-large-files'
    dataset_format = 'tar'
    dataloader_use_s3pt = True
    dataset_path = os.path.join('s3://' + s3_bucket_name, s3_bucket_prefix, dataset_name)
    benchmark_name = f'benchmark-checkpointing-{ckpt_destination}-{datetime.now().strftime("%Y%m%d%H%M%S-%f")}'

    # Set checkpoint path and mode
    if ckpt_destination == 's3_connector':
        ckpt_mode = 's3pt'
        ckpt_path = os.path.join('s3://' + s3_bucket_name, 'checkpoints')
    else:
        ckpt_mode = 'disk'
        ckpt_path = 'checkpoints'
    
    # Compose entrypoint command string
    entrypoint_command = "python benchmark.py" \
                         "  --epochs=1" \
                         "  --batch_size=64" \
                         "  --prefetch_size=2" \
                         "  --input_dim=224" \
                         "  --dataloader_workers=4" \
                        f"  --dataloader_use_s3pt={dataloader_use_s3pt}" \
                        f"  --dataset_path={dataset_path}" \
                        f"  --dataset_format={dataset_format}" \
                         "  --model_compute_time=0" \
                         "  --ray_workers=2" \
                         "  --ray_cpus_per_worker=8" \
                        f"  --benchmark_name={benchmark_name}" \
                         "  --model_num_parameters=1000" \
                         "  --ckpt_steps=100" \
                        f"  --ckpt_mode={ckpt_mode}" \
                        f"  --ckpt_path={ckpt_path}"
    
    
    ### --------
    ### STEP #2: Define the runtime environment parameters for Ray workers
    ### -------
    
    # Nothing to do! We use the same runtime environment definition as for the first benchmark.
    
    ### --------
    ### STEP #3: Submit job to Ray cluster
    ### -------
    
    job_id = ray_client.submit_job(entrypoint=entrypoint_command, runtime_env=runtime_environment)
    
    
    ### Print out the Ray Job ID and other details
    print(f"Submitted a new Ray job with ID '{job_id}' and the following entrypoint command: \n")
    for line in entrypoint_command.split('  '):
        print(line)
    print('^'*40, '\n')
    time.sleep(5)

    # Keep track of our benchmarks
    ckpt_benchmarks[job_id] = {'name': benchmark_name, 'tag': ckpt_destination}

### You can go to the Ray and Grafana dashboards to observe the TWO jobs running:

In [None]:
print(f"Grafana Dashboard: http://{os.getenv('GRAFANA_NLB_DNS')}")
print(f"Ray Dashboard: http://{os.getenv('RAY_DASHBOARD_NLB_DNS')}")

## 4.2 Analyze and plot results (checkpointing with S3 Connector for PyTorch)

Now plot the results of your checkpointing benchmarks.

> ⚠️ _**NOTE:** The **two** Ray jobs that we have just submnitted will run in parallel, and the longest job will take <font color='red'>**around 4 minutes**</font> to complete. The cell below will automatically wait for the job to complete, and then plot the benchmark results._

In [None]:
# Wait for the job to finish, so that we can plot the results
for job_id in ckpt_benchmarks:
    utils.wait_for_job_to_finish(job_id, ray_client)

# Load the log files and plot the results
benchmark_files = {
    benchmark['tag'].replace('_', ' ').title(): os.path.join(local_mountpoint_dir, 'logs', benchmark['name'] + '.json')
    for benchmark in ckpt_benchmarks.values()
}

print(f"Plotting results for '{json.dumps(benchmark_files, indent=2)}'..")
utils.plot_checkpointing_results(benchmark_files)

### Explanation

The benchmark results above compare the time required to save periodic model checkpoints either directly to S3 using S3 Connector for PyTorch (red) or to a local EBS volume of the Ray workers using native PyTorch model checklpointing capability (blue). Similar to the benchmarks for Mountpoint for S3 in the previous notebook, the results here with S3 Connector for PyTorch demonstrate a clear advantage for saving checkpoints directly to S3.

This high throughput, approaching the maximum network bandwidth of the EC2 instance, delivers substantial time and cost savings during training. This is particularly beneficial for long-running distributed training jobs with frequent checkpointing requirements, as it reduces overall training overhead and enables more efficient resource utilization.

>💡 **_TIP:_** _In this workshop, we used EBS storage. If you use EC2 instances with instance storage, you'll also have access to one or more physically attached ephemeral volumes. Instance storage is ideal for temporary data such as buffers, caches, scratch data, or other transient content._

<br>

# 5. Summary of this notebook

This notebook introduced the **Amazon S3 Connector for PyTorch**, showcasing its capability for high-throughput dataset streaming directly from S3 and efficient model checkpointing back to S3. By offering PyTorch-specific primitives, the S3 Connector for PyTorch simplifies building efficient data I/O pipelines with seamless integration to Amazon S3. Benchmarks illustrated its potential to enhance ML workflows with minimal code changes, improving scalability and efficiency for real-world workloads.

<br>

- If you wish to share this workshop with colleagues, and/or run it in your own time in your own AWS Account, make a note of this link: https://s12d.com/stg406

- If you *are* running this in your own AWS Account, in order to prevent ongoing charges you must clean up your deployed resources. Instructions for doing this are in the **Summary and clean-up section** of the workshop instructions.

- If you are at an AWS event, **please fill out the session survey provided by AWS staff**. Your feedback helps us improve, and justifies our efforts in creating content such as this. Thank you.
