<h1>Basic Custom Training Container</h1>

This notebook demonstrates how to build and use a basic custom Docker container for training with Amazon SageMaker. Reference documentation is available at https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [17]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'basic-training-container'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

716664005094
us-west-2
arn:aws:iam::716664005094:role/TeamRole
sagemaker-us-west-2-716664005094


Let's take a look at the Dockerfile which defines the statements for building our custom SageMaker training container:

In [20]:
! pygmentize ../docker/Dockerfile

[37m# Part of the implementation of this container is based on the Amazon SageMaker Apache MXNet container.[39;49;00m
[37m# https://github.com/aws/sagemaker-mxnet-container[39;49;00m

[34mFROM[39;49;00m [33mubuntu:16.04[39;49;00m

[34mLABEL[39;49;00m [31mmaintainer[39;49;00m=[33m"Amazon AI"[39;49;00m

[37m# Defining some variables used at build time to install Python3[39;49;00m
[34mARG[39;49;00m [31mPYTHON[39;49;00m=python3
[34mARG[39;49;00m [31mPYTHON_PIP[39;49;00m=python3-pip
[34mARG[39;49;00m [31mPIP[39;49;00m=pip3
[34mARG[39;49;00m [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m# Install some handful libraries like curl, wget, git, build-essential, zlib[39;49;00m
[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install -y --no-install-recommends [33m

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we copy all contents in <strong>code/</strong> (which is where our training code is) to the WORKDIR and define the ENTRYPOINT</li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [21]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [19]:
%%capture
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [22]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

716664005094.dkr.ecr.us-west-2.amazonaws.com/sagemaker-training-containers/basic-training-container:latest


Given the purpose of this example is explaining how to build custom containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the code first:

In [23]:
! pygmentize ../docker/code/main.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m ExitSignalHandler
[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m write_failure_file, print_json_object, load_json_object, save_model_artifacts, print_files_in_path

hyperparameters_file_path = [33m"[39;49;00m[33m/opt/ml/input/config/hyperparameters.json[39;49;00m[33m"[39;49;00m
inputdataconfig_file_path = [33m"[39;49;00m[33m/opt/ml/input/config/inputdataconfig.json[39;49;00m[33m"[39;49;00m
resource_file_path = [33m"[39;49;00m[33m/opt/ml/input/config/resourceconfig.json[39;49;00m[33m"[39;49;00m
data_files_path = [33m"[39;49;00m[33m/opt/ml/input/data/[39;49;00m[33m"[39;49;00m
failure_file_path = [33m"[39;49;00m[33m/opt/ml

We upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [24]:
#! echo "val1, val2, val3" > dummy.csv
#print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/train'))
#print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/val'))
#! rm dummy.csv

s3://sagemaker-us-west-2-716664005094/basic-training-container/train/dummy.csv
s3://sagemaker-us-west-2-716664005094/basic-training-container/val/dummy.csv


Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [34]:
%%time

import sagemaker

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    #train_instance_type='local', # use local mode
                                    train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix)

est.set_hyperparameters(hp1='value1',
                        hp2=300,
                        hp3=0.001)

train_config = sagemaker.session.s3_input('s3://landsat-pds/L8/001/002/LC80010022016230LGN00/', content_type='image/jpeg')
val_config = sagemaker.session.s3_input('s3://landsat-pds/L8/001/002/LC80010022016246LGN00/', content_type='image/jpeg')

est.fit({'train': train_config, 'validation': val_config })

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-01 17:40:10 Starting - Starting the training job...
2021-02-01 17:40:36 Starting - Launching requested ML instancesProfilerReport-1612201210: InProgress
.........
2021-02-01 17:41:57 Starting - Preparing the instances for training...
2021-02-01 17:42:37 Downloading - Downloading input data...
2021-02-01 17:42:58 Training - Downloading the training image...
2021-02-01 17:43:38 Training - Training image download completed. Training in progress..[34mRunning training...
[0m
[34mHyperparameters configuration:[0m
[34m{'hp1': 'value1', 'hp2': '300', 'hp3': '0.001'}
[0m
[34mInput data configuration:[0m
[34m{'train': {'ContentType': 'image/jpeg',
           'RecordWrapperType': 'None',
           'S3DistributionType': 'FullyReplicated',
           'TrainingInputMode': 'File'},
 'validation': {'ContentType': 'image/jpeg',
                'RecordWrapperType': 'None',
                'S3DistributionType': 'FullyReplicated',
                'TrainingInputMode': 'File'}}
[0m
[34mL

In [33]:
!sudo yum install -y amazon-efs-utils

Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper,
              : versionlock
amzn-main                                                | 2.1 kB     00:00     
amzn-updates                                             | 3.8 kB     00:00     
copr:copr.fedorainfracloud.org:vbatts:shadow-utils-newxi | 3.0 kB     00:00     
libnvidia-container/x86_64/signature                     |  833 B     00:00     
libnvidia-container/x86_64/signature                     | 2.1 kB     00:00 !!! 
nvidia-container-runtime/x86_64/signature                |  833 B     00:00     
nvidia-container-runtime/x86_64/signature                | 2.1 kB     00:00 !!! 
nvidia-docker/x86_64/signature                           |  488 B     00:00     
nvidia-docker/x86_64/signature                           | 2.1 kB     00:00 !!! 
Package amazon-efs-utils-1.28.2-1.amzn1.noarch already installed and latest version
Nothing to do


In [None]:
#!mkdir efs

In [31]:
!echo $HOME

/home/ec2-user


In [None]:
#!sudo mount -t efs fs-df51afdb:/ $HOME/efs

In [None]:
#!aws s3 cp --recursive s3://landsat-pds/L8/001/002/ $HOME/efs

In [26]:
from sagemaker.inputs import FileSystemInput

# Specify EFS ile system id.
file_system_id = 'fs-df51afdb'
print(f"EFS file-system-id: {file_system_id}")

# Specify directory path for input data on the file system. 
# You need to provide normalized and absolute path below.
file_system_directory_path = '/LC80010022016230LGN00'
print(f'EFS file-system data input path: {file_system_directory_path}')

# Specify the access mode of the mount of the directory associated with the file system. 
# Directory must be mounted  'ro'(read-only).
file_system_access_mode = 'ro'

# Specify your file system type
file_system_type = 'EFS'

train = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)

file_system_directory_path = '/LC80010022016246LGN00'

validation = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)


EFS file-system-id: fs-df51afdb
EFS file-system data input path: /LC80010022016230LGN00


In [28]:
%%time

import sagemaker

security_group_ids = ['sg-0c21b70b7f1480a4e']
subnets = ['subnet-0370146357224ed30']

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    #train_instance_type='local', # use local mode
                                    train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix,
                                    subnets=subnets,
                                    security_group_ids=security_group_ids)

est.set_hyperparameters(hp1='value1',
                        hp2=300,
                        hp3=0.001)

data_channels = {'train': train, 'validation': validation}

est.fit(inputs=data_channels)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-01-31 05:38:19 Starting - Starting the training job...
2021-01-31 05:38:43 Starting - Launching requested ML instancesProfilerReport-1612071499: InProgress
......
2021-01-31 05:39:46 Starting - Preparing the instances for training......
2021-01-31 05:40:44 Downloading - Downloading input data
2021-01-31 05:40:44 Training - Downloading the training image...
2021-01-31 05:41:19 Uploading - Uploading generated training model[34mRunning training...
[0m
[34mHyperparameters configuration:[0m
[34m{'hp1': 'value1', 'hp2': '300', 'hp3': '0.001'}
[0m
[34mInput data configuration:[0m
[34m{'train': {'RecordWrapperType': 'None',
           'S3DistributionType': 'FullyReplicated',
           'TrainingInputMode': 'File'},
 'validation': {'RecordWrapperType': 'None',
                'S3DistributionType': 'FullyReplicated',
                'TrainingInputMode': 'File'}}
[0m
[34mList of files in validation channel: [0m
[34m/opt/ml/input/data/validation/LC80010022016230LGN00_B5.TIF[0m
