## Scaling Workloads on SageMaker using Ray

In this notebook we will see how we can scale our data processing workloads to multi-node clusters using [Ray](https://www.ray.io/) on Amazon SageMaker. We will use the SageMaker Python SDK to create a training job that uses Ray to distribute the workload across multiple instances. Ray is a distributed computing framework that makes it easy to scale your applications from a single machine to a large cluster.Furthermore, the [modin library](https://github.com/modin-project/modin) provides an pandas API that is scaled using ray. That mean that we can use the familiar pandas API to process large datasets in parallel across multiple instances. The awswrangler library can utilized either pandas or modin as the underlying engine for processing dataframes.

<div style="border: 1px solid black; padding: 10px; background-color: #ffffcc; color: black;">
<strong>Note:</strong> Make sure to fully run the first notebook to ingest the data into Athena before running this notebook.
</div>

In [None]:
import boto3
import json
import sagemaker
from pathlib import Path
from sagemaker.pytorch import PyTorch  # PyTorch Estimator for running our training job

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
bucket = sess.default_bucket()  # default bucket name
account_id = sess.account_id()

In [2]:
# load values from the first notebook

if not Path("lab_values.json").exists():
    raise FileNotFoundError("Please run the first notebook first.")
else:
    lab_values = json.loads(Path("lab_values.json").read_text())
    input_data_location = lab_values["s3_csv_folder"]

In [3]:
output_location = f"s3://{bucket}/ml_workshop/aggregation-job/output"

Even though we are not training an ML model, we'll use a training job to leverage the SageMaker infrastructure to run our Ray job. There are several advantages of the training job as compared to a Processing job namely:
- Wider range of instance types and sizes
- Ability to pass arguments to the script via a python dictionary rather than a list of strings (see the `hyperparameters` argument below)
- Ability to keep instances in a waiting state after the job completes for debugging purposes (`keep_alive_period_in_seconds` argument) which allows reusing warm instances for sequential jobs

The [compute_aggregations.py](./ray_script/compute_aggregations.py) script will compute simple and more complex aggregations on the data. You can set `USE_RAY` variable to False to see the difference in performance between pandas and modin.  

In [4]:
USE_RAY = True

job = PyTorch(
    source_dir="ray_script",
    entry_point="compute_aggregations.py",
    framework_version="2.2",
    py_version="py310",
    role=role,
    environment={"USE_RAY": str(USE_RAY)},           # we can pass environment variables to the training job
    hyperparameters={                                # hyperparameters are passed as command line arguments to the training script
        "input_data_location": input_data_location,
        "output_data_location": output_location,
    },
    instance_type="ml.m5.xlarge",
    instance_count = 3 if USE_RAY else 1,            # use 3 instances if Ray is enabled otherwise use 1 instance
    max_run=1000,                                    # maximum allowed runtime in seconds
    keep_alive_period_in_seconds=300                 # instances will be kept alive for 300 seconds after the job finishes for use in future jobs
)

In [None]:
# SageMaker training job is started by calling the fit method
job.fit()

In [None]:
!aws s3 ls $output_location/ --recursive --human-readable