## PT DDP Launcher Testing
This notebook tests the following combination:

* image: PT training DLC with my changes
* distribution = pytorchddp, backend = nccl

In [1]:
#!pip uninstall -y sagemaker

Found existing installation: sagemaker 2.100.0
Uninstalling sagemaker-2.100.0:
  Successfully uninstalled sagemaker-2.100.0


In [9]:
#Need Sagemaker v2.103.0 
%pip install -U sagemaker

Collecting sagemaker
  Downloading sagemaker-2.104.0.tar.gz (566 kB)
     |████████████████████████████████| 566 kB 7.5 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.104.0-py2.py3-none-any.whl size=782442 sha256=853b35b9f602defb51e173178f7727a422bf0e7cbd06849cbadae064dba26f03
  Stored in directory: /home/ec2-user/.cache/pip/wheels/68/29/60/75316863ecebc619cc6bb2cd9dcf3a6cab99f719f0911c5997
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.101.8.dev0
    Uninstalling sagemaker-2.101.8.dev0:
      Successfully uninstalled sagemaker-2.101.8.dev0
Successfully installed sagemaker-2.104.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
%pip show sagemaker

Name: sagemaker
Version: 2.104.0
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages
Requires: attrs, boto3, google-pasta, importlib-metadata, numpy, packaging, pandas, pathos, protobuf, protobuf3-to-dict, smdebug-rulesconfig
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [2]:
import sagemaker
from sagemaker.local import LocalSession

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

#sess = LocalSession()
#sess.config = {"local": {"local_mode": True }}
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
#Add instructions for local environment later, if needed

sagemaker role arn: arn:aws:iam::570106654206:role/Dev
sagemaker bucket: sagemaker-us-west-2-570106654206
sagemaker session region: us-west-2


In [30]:
region = "us-west-2"
image = (
    "pt-ddp-custom"  # Example: pt-smdataparallel-efficientnet-sagemaker
)
tag = "1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker-2.6.0-numproc"  # Example: latest


In [7]:
# Uncomment and run only when docker push fails with OOM errors
#! docker system prune -af

In [4]:
! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 570106654206.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [3]:
from sagemaker.pytorch import PyTorch

# refer https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers to get the right uri's based on region
#Using URI from 08/11
image_uri = '570106654206.dkr.ecr.us-west-2.amazonaws.com/pt-ddp-custom:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker-2.6.0-numproc'

# configuration for running training on smdistributed Data Parallel
# this is the only line of code change required to leverage SageMaker Distributed Data Parallel
distribution_pt_mpi = {'pytorchddp':{ 'enabled': True },
               'mpi':{'enabled':True, 'num_of_processes_per_host':1}}
distribution = {'pytorchddp':{ 'enabled': True }}

#### Test non-SMDDP supported instance type (g5.16xlarge)


In [18]:
estimator_g5 = PyTorch(
    base_job_name="ptddp-mnist-test-g5",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=2,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.g5.16xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)

In [19]:
estimator_g5.fit(wait=False)
estimator_g5.latest_training_job.name

'ptddp-mnist-test-g5-2022-08-12-21-01-47-371'

#### Test CPU runs with backend = gloo

In [34]:
estimator_cpu = PyTorch(
    base_job_name="ptddp-mnist-test-gloo",
    source_dir="../code",
    entry_point="train_ptddp_mnist_gloo.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=2,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.p4d.24xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)

In [35]:
estimator_cpu.fit(wait=False)
estimator_cpu.latest_training_job.name

'ptddp-mnist-test-gloo-2022-08-13-18-54-03-474'

#### Test bigger clusters

In [36]:
estimator_8node = PyTorch(
    base_job_name="ptddp-mnist-test-8node",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=8,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.p4d.24xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)

In [37]:
estimator_8node.fit(wait=False)
estimator_8node.latest_training_job.name

'ptddp-mnist-test-8node-2022-08-13-19-24-33-584'

In [38]:
estimator_16node = PyTorch(
    base_job_name="ptddp-mnist-test-16node",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=16,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.p4d.24xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)

In [39]:
estimator_16node.fit(wait=False)
estimator_16node.latest_training_job.name

'ptddp-mnist-test-16node-2022-08-13-19-41-02-914'

#### Regression Testing for SMDDP


In [43]:
estimator_smddp = PyTorch(
    base_job_name="ptddp-mnist-test-regression",
    source_dir="../code",
    entry_point="train_ptddp_mnist_smddp.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=4,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.p4d.24xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": "true"}}, 'mpi':{'enabled':True, 'num_of_processes_per_host':1}},
    debugger_hook_config=False,
)
estimator_smddp.fit(wait=False)
estimator_smddp.latest_training_job.name

'ptddp-mnist-test-regression-2022-08-16-00-21-31-467'

#### Test on CPU instance

In [5]:
estimator_c5 = PyTorch(
    base_job_name="ptddp-mnist-test-c5",
    source_dir="../code",
    entry_point="train_ptddp_mnist_gloo_cpu.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=2,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.c5.xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)
estimator_c5.fit(wait=False)
estimator_c5.latest_training_job.name

'ptddp-mnist-test-c5-2022-08-17-17-49-24-730'

#### Test different g* instances

In [49]:
estimator_g5g = PyTorch(
    base_job_name="ptddp-mnist-test-g5g",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=1,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.g5g.xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)
estimator_g5g.fit(wait=False)
estimator_g5g.latest_training_job.name

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'ml.g5g.xlarge' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p4d.24xlarge, ml.g5.2xlarge, ml.c5n.xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.g5.4xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.g5.8xlarge, ml.c5.4xlarge, ml.c5n.18xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.g5.xlarge, ml.c5n.2xlarge, ml.g5.12xlarge, ml.g5.24xlarge, ml.c5n.4xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.g5.48xlarge, ml.g5.16xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.c5n.9xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

In [51]:
estimator_g4dn = PyTorch(
    base_job_name="ptddp-mnist-test-g4dn",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=1,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.g4dn.xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)
estimator_g4dn.fit(wait=False)
estimator_g4dn.latest_training_job.name

'ptddp-mnist-test-g4dn-2022-08-16-21-29-46-220'

#### Local mode test

In [53]:
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

estimator_local = PyTorch(
    base_job_name="ptddp-mnist-test-local",
    source_dir="../code",
    entry_point="train_ptddp_mnist.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=1,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="local_gpu",
    sagemaker_session=sagemaker_session,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)

estimator_local.fit(wait=False)
estimator_local.latest_training_job.name

Creating mdehmvb6o3-algo-1-7vo3t ... 
Creating mdehmvb6o3-algo-1-7vo3t ... error

ERROR: for mdehmvb6o3-algo-1-7vo3t  Cannot start service algo-1-7vo3t: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown

ERROR: for algo-1-7vo3t  Cannot start service algo-1-7vo3t: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Encountered errors while bringing up the project.


RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp27n26ri4/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

#### Test instance_type validation

In [54]:
%pip install sagemaker-2.101.8.dev0-py2.py3-none-any.whl

Processing ./sagemaker-2.101.8.dev0-py2.py3-none-any.whl
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.103.0
    Uninstalling sagemaker-2.103.0:
      Successfully uninstalled sagemaker-2.103.0
Successfully installed sagemaker-2.101.8.dev0
Note: you may need to restart the kernel to use updated packages.


In [1]:
%pip show sagemaker

Name: sagemaker
Version: 2.101.8.dev0
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages
Requires: attrs, boto3, google-pasta, importlib-metadata, numpy, packaging, pandas, pathos, protobuf, protobuf3-to-dict, smdebug-rulesconfig
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [7]:
estimator_c5 = PyTorch(
    base_job_name="ptddp-mnist-test-c5",
    source_dir="../code",
    entry_point="train_ptddp_mnist_gloo.py",
    role=role,
    py_version="py38",
    image_uri=image_uri,
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=2,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.c5.xlarge",
    sagemaker_session=sess,
    # Training using SMDataParallel Distributed Training Framework
    distribution=distribution,
    debugger_hook_config=False,
)
estimator_c5.fit(wait=False)
estimator_c5.latest_training_job.name

ValueError: CPU training in not supported by pytorchddp. Please pick a GPU-based instance type from here: https://aws.amazon.com/ec2/instance-types/