# ***BONUS: PyTorch SageMaker Data Parallel Distributed Training with Amazon SageMaker***

In [1]:
# !pip install sagemaker --upgrade -q
# !pip install ipywidgets -q

**Step 1:** Import essentials packages, start a sagemaker session and specify the bucket name you created in the pre-requsites section of this workshop.

In [2]:
import os
import boto3
import time
import numpy as np
import sagemaker

sess = boto3.Session()
sm   = sess.client('sagemaker')
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket_name    = sagemaker_session.default_bucket()
jobs_folder    = 'jobs'
dataset_folder = 'datasets'

![](https://miro.medium.com/max/1000/0*GRfvsrvtfpRm400-)

#### Prepare the training data
The CIFAR-10 dataset is a subset of the 80 million tiny images dataset. It consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.

In [3]:
import torchvision
cifar10_dataset = torchvision.datasets.CIFAR10('cifar10-dataset', 
                                     train=True, 
                                     download=True)

Files already downloaded and verified


In [4]:
datasets = sagemaker_session.upload_data(path='cifar10-dataset', 
                                         key_prefix=f'{dataset_folder}/cifar10-dataset')

**Step 2:** Specify hyperparameters, instance type and number of instances to distribute training to. 

In [5]:
job_name   = f'pytorch-smddp-dist-{time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())}'
output_path = f's3://{bucket_name}/{jobs_folder}'

hyperparameters = {'epochs'       : 15, 
                   'lr'           : 0.01,
                   'momentum'     : 0.9,
                   'batch-size'   : 256,
                   'model-type'   : 'resnet18',
                   'backend'      : 'smddp'}

In [6]:
distribution = { "smdistributed": { 
                    "dataparallel": { "enabled": True } 
                } 
               }

In [7]:
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point          = 'cifar10-distributed-smddp-gpu.py', 
                    source_dir           = 'code',
                    output_path          = output_path + '/',
                    code_location        = output_path,
                    role                 = role,
                    instance_count       = 1,
                    instance_type        = 'ml.p4d.24xlarge', # 'ml.p3.16xlarge', 'ml.p3dn.24xlarge', 'ml.p4d.24xlarge',
                    framework_version    = '1.11.0', 
                    py_version           = 'py38',
                    distribution         = distribution,
                    hyperparameters      = hyperparameters)

**Step 4:** Specify dataset locations in Amazon S3 and then call the fit function.

In [8]:
estimator.fit({'train': datasets}, 
              job_name=job_name, 
              wait=True)

2022-06-21 04:58:58 Starting - Starting the training job...ProfilerReport-1655787537: InProgress
...
2022-06-21 04:59:46 Starting - Preparing the instances for training...........................
2022-06-21 05:04:24 Downloading - Downloading input data
2022-06-21 05:04:24 Training - Downloading the training image........................
2022-06-21 05:08:23 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34mINFO:sagemaker-training-toolkit:No exception classes found in smdistributed.dataparallel[0m
[34mINFO:sagemaker-training-toolkit:Imported framework sagemaker_pytorch_container.training[0m
[34mINFO:sagemaker_pytorch_container.training:Block until all host DNS lookups succeed.[0m
[34mINFO:sagemaker_pytorch_container.training:Invoking SMDataParallel[0m
[34mINFO:sagemaker_pytorch_container.training:Invoking user training script.[0m