# Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch an MNIST Training Job with Model Parallelization

SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script, `utils/pt_mnist.py` and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). 


### Additional Resources

If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. 

* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

## Amazon SageMaker Initialization

Run the following cells to initialize the notebook instance. Get the SageMaker execution role used to run this notebook.

In [1]:
pip install sagemaker-experiments

Collecting sagemaker-experiments
  Downloading sagemaker_experiments-0.1.34-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s  eta 0:00:01
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.34
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install sagemaker --upgrade

Collecting sagemaker
  Downloading sagemaker-2.52.0.tar.gz (435 kB)
[K     |████████████████████████████████| 435 kB 31.4 MB/s eta 0:00:01
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.52.0-py2.py3-none-any.whl size=613289 sha256=2f51f881112c6482b0d180aa12db21237eba345b7f2b7f50ee422164c0dfb858
  Stored in directory: /home/ec2-user/.cache/pip/wheels/1b/0b/05/f42f221810f419089bb19bcde0555c5e36f975f30423fadd99
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.45.0
    Uninstalling sagemaker-2.45.0:
      Successfully uninstalled sagemaker-2.45.0
Successfully installed sagemaker-2.52.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packa

In [3]:
%%time
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3
from time import gmtime, strftime

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

session = boto3.session.Session()

SageMaker Execution Role:arn:aws:iam::804604702169:role/service-role/AmazonSageMaker-ExecutionRole-20210803T195128
CPU times: user 1.03 s, sys: 165 ms, total: 1.2 s
Wall time: 1.71 s


## Prepare your training script

Run the following cell to view an example-training script you will use in this demo. This is a PyTorch 1.6 trianing script that uses the MNIST dataset. 

You will see that the script contains `SMP` specific operations and decorators, which configure model parallel training. See the training script comments to learn more about the SMP functions and types used in the script.

In [4]:
# Run this cell to see an example of a training scripts that you can use to configure -
# SageMaker Distributed Model Parallel with PyTorch version 1.6
!cat utils/pt_mnist.py

# Future
from __future__ import print_function

import argparse
import math

# Standard Library
import os
import random
import time

# Third Party
import numpy as np

# First Party
import smdistributed.modelparallel.torch as smp
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.cuda.amp import autocast
from torch.optim.lr_scheduler import StepLR
from torchnet.dataset import SplitDataset
from torchvision import datasets, transforms

# SM Distributed: import scaler from smdistributed.modelparallel.torch.amp, instead of torch.cuda.amp

# Make cudnn deterministic in order to get the same losses across runs.
# The following two lines can be removed if they cause a performance impact.
# For more details, see:
# https://pytorch.org/docs/stable/notes/randomness.html#cudnn
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


def aws_s3_sync(source, destination):

    """aws

## Define SageMaker Training Job

Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use an [`Estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. 

You can update the following:
* `processes_per_host`
* `entry_point`
* `instance_count`
* `instance_type`
* `base_job_name`

In addition, you can supply and modify configuration parameters for the SageMaker Distributed Model Parallel library. These parameters will be passed in through the `distributions` argument, as shown below.

### Update the Type and Number of EC2 Instances Used

Specify your `processes_per_host`. Note that it must be a multiple of your partitions, which by default is 2.

The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. 

You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. 


To look up instances types, see [Amazon EC2 Instance Types](https://aws.amazon.com/sagemaker/pricing/).


### Uploading Checkpoint During Training or Resuming Checkpoint from Previous Training
We also provide a custom way for users to upload checkpoints during training or resume checkpoints from previous training. We have integrated this into our `pt_mnist.py` example script. Please see the functions `aws_s3_sync`, `sync_local_checkpoints_to_s3`, and `sync_s3_checkpoints_to_local`. For the purpose of this example, we are only uploading a checkpoint during training, by using `sync_local_checkpoints_to_s3`. 


After you have updated `entry_point`, `instance_count`, `instance_type` and `base_job_name`, run the following to create an estimator. 

In [5]:
sagemaker_session = sagemaker.session.Session(boto_session=session)
mpioptions = "-verbose -x orte_base_help_aggregate=0 "

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]

# choose an experiment name (only need to create it once)
experiment_name = "SM-MP-DEMO"

# Load the experiment if it exists, otherwise create
if experiment_name not in all_experiment_names:
    customer_churn_experiment = Experiment.create(
        experiment_name=experiment_name, sagemaker_boto_client=boto3.client("sagemaker")
    )
else:
    customer_churn_experiment = Experiment.load(
        experiment_name=experiment_name, sagemaker_boto_client=boto3.client("sagemaker")
    )

# Create a trial for the current run
trial = Trial.create(
    trial_name="SMD-MP-demo-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
    experiment_name=customer_churn_experiment.experiment_name,
    sagemaker_boto_client=boto3.client("sagemaker"),
)


smd_mp_estimator = PyTorch(
    entry_point="pt_mnist.py",  # Pick your train script
    source_dir="utils",
    role=role,
    instance_type="ml.p3.16xlarge",
    sagemaker_session=sagemaker_session,
    framework_version="1.6.0",
    py_version="py36",
    instance_count=1,
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "microbatches": 4,
                    "placement_strategy": "spread",
                    "pipeline": "interleaved",
                    "optimize": "speed",
                    "partitions": 2,
                    "ddp": True,
                },
            }
        },
        "mpi": {
            "enabled": True,
            "processes_per_host": 2,  # Pick your processes_per_host
            "custom_mpi_options": mpioptions,
        },
    },
    base_job_name="SMD-MP-demo",
)

Finally, you will use the estimator to launch the SageMaker training job.

In [6]:
smd_mp_estimator.fit(
    experiment_config={
        "ExperimentName": customer_churn_experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",
    }
)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: SMD-MP-demo-2021-08-06-19-01-06-885


2021-08-06 19:01:07 Starting - Starting the training job...
2021-08-06 19:01:16 Starting - Launching requested ML instancesProfilerReport-1628276467: InProgress
.........
2021-08-06 19:02:57 Starting - Preparing the instances for training...ProfilerReport-1628276467: Error
......
2021-08-06 19:04:38 Downloading - Downloading input data
2021-08-06 19:04:38 Training - Downloading the training image..................
2021-08-06 19:07:30 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-06 19:07:30,778 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-06 19:07:30,856 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-06 19:07:30,866 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021

[34m[1,0]<stdout>:Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw[0m
[34m[1,0]<stdout>:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz[0m
[34m[1,0]<stdout>:Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw[0m
[34m[1,0]<stdout>:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz[0m
[34m[1,0]<stdout>:Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw[0m
[34m[1,0]<stdout>:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz[0m
[34m[1,0]<stdout>:Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw[0m
[34m[1,0]<stdout>:Processing...[0m
[34m[1,0]<stdout>:Done![0m
[34m[1,1]<stdout>:[2021-08-06 19:12:32.326 algo-1:33 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m

[34m[1,0]<stdout>:[0m
[34m[1,0]<stdout>:Test set: Average loss: 0.0534, Accuracy: 9818/10000 (98%)[0m
[34m[1,0]<stdout>:[0m
[34m[1,0]<stdout>:-INFO- PATH DO NOT EXIST[0m
[34m[1,0]<stdout>:Start syncing[0m
[34m[1,0]<stdout>:S3 Bucket: sagemaker-us-east-1-804604702169[0m
[34m[1,0]<stdout>:Syncing files from /opt/ml/local_checkpoints to s3://sagemaker-us-east-1-804604702169/SMD-MP-demo-2021-08-06-19-01-06-885/checkpoints/algo-1/[0m
[34m[1,0]<stdout>:Time Taken to Sync:  1.2393248081207275[0m
[34m[1,0]<stdout>:Finished syncing[0m
[34m[1,0]<stderr>:#0150it [00:00, ?it/s][1,0]<stderr>:#015  0%|          | 0/9912422 [00:00<?, ?it/s][1,0]<stderr>:#015  0%|          | 16384/9912422 [00:00<01:08, 144761.57it/s][1,0]<stderr>:#015  0%|          | 24576/9912422 [00:00<01:28, 111343.91it/s][1,0]<stderr>:#015  0%|          | 32768/9912422 [00:00<01:41, 97136.16it/s] [1,0]<stderr>:#015  0%|          | 40960/9912422 [00:00<01:47, 91433.93it/s][1,0]<stderr>:#015  0%|          | 49152/


2021-08-06 19:13:40 Uploading - Uploading generated training model
2021-08-06 19:13:40 Completed - Training job completed
Training seconds: 549
Billable seconds: 549
