<h1>Basic Custom Training Container</h1>

This notebook demonstrates how to build and use a basic custom Docker container for training with Amazon SageMaker. Reference documentation is available at https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'aws_batch_tesseract/'
prefix = 'basic'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

348831852500
us-east-1
arn:aws:iam::348831852500:role/service-role/AmazonSageMaker-ExecutionRole-20180906T165759
sagemaker-us-east-1-348831852500


Let's take a look at the Dockerfile which defines the statements for building our custom SageMaker training container:

In [2]:
! pygmentize ./TESSERACT-SAGEMAKER-CONTAINER/Dockerfile

[34mFROM[39;49;00m[33m ubuntu:18.04[39;49;00m

[37m#[39;49;00m
[37m# Defining some variables used at build time to install Python3[39;49;00m
ARG [31mPYTHON[39;49;00m=python3
ARG [31mPYTHON_PIP[39;49;00m=python3-pip
ARG [31mPIP[39;49;00m=pip3
ARG [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m#Non Interactive front end to avoid docker build stucks[39;49;00m
[34mENV[39;49;00m[33m DEBIAN_FRONTEND noninteractive[39;49;00m
[34mENV[39;49;00m[33m DEBIAN_FRONTEND teletype[39;49;00m
[37m#Install Basic Packages [39;49;00m
[37m#Install tesseract[39;49;00m
[34mRUN[39;49;00m [36mecho[39;49;00m ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula [34mselect[39;49;00m [36mtrue[39;49;00m |  debconf-set-selections

[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 18.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>tesseract and pdfsandwitch related libraries including fonts used for training</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we copy all contents in <strong>code/</strong> (which is where our training code is) to the WORKDIR </li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [7]:
! pygmentize ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

docker build -f ../Dockerfile -t [31m$REPO_NAME[39;49;00m ../.

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [None]:
!  bash -x ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script/build_and_push.sh $account_id $region $ecr_repository_name

+ ACCOUNT_ID=348831852500
+ REGION=us-east-1
+ REPO_NAME=aws_batch_tesseract/basic
++ dirname ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script/build_and_push.sh
+ cd ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script
+ docker build -f ../Dockerfile -t aws_batch_tesseract/basic ../.
Sending build context to Docker daemon  88.25MB
Step 1/26 : FROM ubuntu:18.04
18.04: Pulling from library/ubuntu

[1B9e3a4d10: Pulling fs layer 
[1B19cdbe7a: Pulling fs layer 
[1B61ea6baf: Pulling fs layer 
[1BDigest: sha256:8d31dad0c58f552e890d68bbfb735588b6b820a46e459672d96e585871acc110[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[4A[1K[K[3A[1K[K[2A[1K[K[1A[1K[K[1A[1K[K
Status: Downloaded newer image for ubuntu:18.04
 ---> ccc6e87d482b
Step 2/26 : ARG PYTHON=python3
 ---> Running in 018d83f7bd41
Removing intermediate container 018d83f7bd41
 ---> 67a35e3d320b
Step 3/26 : ARG PYTHON_PIP=python3-pip
 ---> Running 

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [2]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

348831852500.dkr.ecr.us-east-1.amazonaws.com/aws_batch_tesseract/basic:latest


Given the purpose of this example is explaining how to build custom containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the code first:

In [5]:
! cat ./code/main.py

import os
import sys

print ("Starting Processing")
print ("Arguments")
print (sys.argv[0:])

print ("Environment")
print(os.environ)
print ("End Environment")

if 'SAGEMAKER_BATCH' in os.environ:
    if (os.environ['SAGEMAKER_BATCH'].lower()=='true'): 
        print ("Start Image Processing")
        os.system('/usr/bin/find /opt/ml')
  
else:
    print ("Start Training")
    os.system('/bin/bash sagemaker_train_tesseract.sh')
    print ("Training Complete")

print ("End Processing training")



Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [47]:
import sagemaker

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    #train_instance_type='local', # use local mode
                                    train_instance_type='ml.c5.2xlarge',
                                    base_job_name=prefix
                                    )

est.set_hyperparameters(epoch=2000)

train_data = sagemaker.session.s3_input('s3://348831852500-sagemaker-us-east-1/Tesseract/training/')

est.fit({'train': train_data })

2020-01-24 22:34:32 Starting - Starting the training job...
2020-01-24 22:34:34 Starting - Launching requested ML instances......
2020-01-24 22:35:42 Starting - Preparing the instances for training...
2020-01-24 22:36:28 Downloading - Downloading input data...
2020-01-24 22:36:44 Training - Downloading the training image...
2020-01-24 22:37:25 Training - Training image download completed. Training in progress.[34mTraining Params[0m
[34m{'epoch': '2000'}[0m
[34mInput Params[0m
[34m{'train': {'TrainingInputMode': 'File', 'S3DistributionType': 'FullyReplicated', 'RecordWrapperType': 'None'}}[0m
[34mStart Training[0m
[34mAddind training file ita.Leg_001_DA00_cat_002.pdf_page0000.exp0.box[0m
[34mAddind training file ita.Leg_001_DA00_cat_002.pdf_page0000.exp0.tif[0m
[34mAddind training file ita.Leg_024_IR01_cat_022.pdf_page0001.exp0.box[0m
[34mAddind training file ita.Leg_024_IR01_cat_022.pdf_page0001.exp0.tif
[0m
[34m***** Prepare Training Data. 

[0m
[34m=== Copy exist

In [62]:
! cat process.csv
#InputBucket,#InputPath,#InputFileName,#OutputBucket,#OutputPath,#OutputFileName,

348831852500-sagemaker-us-east-1,Tesseract/Input,Leg_001_DA00_cat_002.pdf,348831852500-sagemaker-us-east-1,Tesseract/Output,OUT_Leg_001_DA00_cat_002.pdf
348831852500-sagemaker-us-east-1,Tesseract/Input,Leg_027_IR01_cat_025.pdf,348831852500-sagemaker-us-east-1,Tesseract/Output,OUT_Leg_027_IR01_cat_025.pdf

In [73]:
!aws s3 cp process.csv s3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv

upload: ./process.csv to s3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv


In [78]:
batch_input = 's3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv'

# The location to store the results of the batch transform job
batch_output = 's3://348831852500-sagemaker-us-east-1/Tesseract/Output/'


batch=est.transformer (instance_count=1, 
                       instance_type='ml.c5.9xlarge', 
                       #instance_type='local', # use local mode
                       output_path=batch_output,
                       strategy ="SingleRecord",
                       max_concurrent_transforms=1)


Using already existing model: basic-2020-01-24-22-34-31-989


In [81]:
batch.transform(data=batch_input, data_type='S3Prefix',  content_type='text/csv', split_type='Line')

In [82]:
batch.wait()


...................[34mStarting the inference server with 1 workers.[0m
[34m[2020-01-25 10:24:36 +0000] [11] [INFO] Starting gunicorn 20.0.4[0m
[34m[2020-01-25 10:24:36 +0000] [11] [INFO] Listening at: unix:/tmp/gunicorn.sock (11)[0m
[34m[2020-01-25 10:24:36 +0000] [11] [INFO] Using worker: gevent[0m
[34m[2020-01-25 10:24:36 +0000] [15] [INFO] Booting worker with pid: 15[0m
[34mSetup Model[0m
[35mStarting the inference server with 1 workers.[0m
[35m[2020-01-25 10:24:36 +0000] [11] [INFO] Starting gunicorn 20.0.4[0m
[35m[2020-01-25 10:24:36 +0000] [11] [INFO] Listening at: unix:/tmp/gunicorn.sock (11)[0m
[35m[2020-01-25 10:24:36 +0000] [11] [INFO] Using worker: gevent[0m
[35m[2020-01-25 10:24:36 +0000] [15] [INFO] Booting worker with pid: 15[0m
[35mSetup Model[0m
[34m169.254.255.130 - - [25/Jan/2020:10:24:39 +0000] "GET /ping HTTP/1.1" 200 1 "-" "Go-http-client/1.1"[0m
[34m169.254.255.130 - - [25/Jan/2020:10:24:39 +0000] "GET /execution-parameters HTTP/1.1" 40