<h1>Basic Custom Training Container</h1>

This notebook demonstrates how to build and use a basic custom Docker container for training with Amazon SageMaker. Reference documentation is available at https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'aws_batch_tesseract/'
prefix = 'basic'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

Let's take a look at the Dockerfile which defines the statements for building our custom SageMaker training container:

In [None]:
! pygmentize ./TESSERACT-SAGEMAKER-CONTAINER/Dockerfile

At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 18.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>tesseract and pdfsandwitch related libraries including fonts used for training</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we copy all contents in <strong>code/</strong> (which is where our training code is) to the WORKDIR </li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [None]:
! pygmentize ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script/build_and_push.sh

<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [None]:
!  bash ./TESSERACT-SAGEMAKER-CONTAINER/container_build_script/build_and_push.sh $account_id $region $ecr_repository_name

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [None]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

There are two main scripts invoked by Sagemaker:
    - train: invoked when running a training process
    - serve: invoked when serving and endpoint or to start a Batch serve

For integration with custom algorithms Sagemaker uses the following structure:
/opt/ml  
├── input  
│   ├── config  
│   │   ├── hyperparameters.json            <--- Hyper Parameters passed to the script when invoking sagemaker  
│   │   └── resourceConfig.json             <--- Configuration to access input / test / validation data  
│   └── data  
│       └── channel_name/                   <--- Where data are downloaded by Sagemaker  
│                            
├── model                                   <--- Output directory where model shall be stored when training  
│                                                Also used to store model when starting a prediction  
├── code                                    <--- Custom script files  
│  
└── output                                  <--- Output folder for predictions
└── failure                                 <--- Store here error descriptions that will be reported to the user


## Training

In [None]:
! cat ./TESSERACT-SAGEMAKER-CONTAINER/code/train

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [None]:
import sagemaker

#Configure training Job parameters and select type of server to perform the training
#Tesseract is not able to leverage machines bigger than ml.c5.2xlarge for training

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    #train_instance_type='local', # use local mode
                                    train_instance_type='ml.c5.2xlarge',
                                    base_job_name=prefix+"EMPTYTrainedData"
                                    )

est.set_hyperparameters(epoch=1)

train_data = sagemaker.session.s3_input('s3://348831852500-sagemaker-us-east-1/Tesseract/empty_training/')

est.fit({'train': train_data })

In [None]:
print(est._current_job_name)

In [None]:
empty_basic_model='basicEMPTYTrainedData-2020-01-28-09-33-44-466'
trained_model='basicFullTrainedData-2020-01-27-12-25-45-080'
#Full Training: 

## Perdiction

The prediction process works in the same way for Batch Prediction Jobs and Online Inferences as well.
It uses API calls to pass informations to the Sagemaker Container using the following endpoints:
- ping                   <--- to check if the container is health
- invocations            <--- to send data for preditcion and get back the answer
- execution_parameters   <--- Used only for batch prediction to define how the batch process works

The container is started invoking the serve scripts in the standard approach this script do not need editing and starts: 
- an nginx server to expose an http endpoint for the requests
- an gunicorn server to receive and process requests using Flask (python) that calls predictor.py

The customization is mainly in the predictor.py

In [None]:
! cat ./TESSERACT-SAGEMAKER-CONTAINER/code/predictor.py

Prepare data for a bacth request and copy them to an input s3 bucket

In [None]:
! cat ./sample-data/process.csv
#InputBucket,#InputPath,#InputFileName,#OutputBucket,#OutputPath,#OutputFileName,

In [None]:
!aws s3 cp ./sample-data/process.csv s3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv

In [None]:
batch_input = 's3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv'

# The location to store the results of the batch transform job
batch_output = 's3://348831852500-sagemaker-us-east-1/Tesseract/Output/'


batch=est.transformer (
                       instance_count=1, 
                       instance_type='ml.c5.4xlarge', 
                       #instance_type='local', # use local mode
                       output_path=batch_output,
                       strategy ="SingleRecord",
                       max_concurrent_transforms=1
)

#batch=sagemaker.transformer.Transformer (
#                       model_name=empty_basic_model,
#                       instance_count=1, 
#                       instance_type='ml.c5.4xlarge', 
#                       #instance_type='local', # use local mode
#                       output_path=batch_output,
#                       strategy ="SingleRecord",
#                       max_concurrent_transforms=1
#)

batch.transform(data=batch_input, data_type='S3Prefix',  content_type='text/csv', split_type='Line')

In [None]:
batch.wait()


### Sagemaker supports also on-line prediction 
For this specific process may not be the best cost effective solution 

In [None]:
endpoint=est.deploy (initial_instance_count=1, 
                     instance_type='ml.c5.9xlarge' )

In [None]:
import json
input_csv=["348831852500-sagemaker-us-east-1,Tesseract/Input,Leg_001_DA00_cat_002.pdf,348831852500-sagemaker-us-east-1,Tesseract/Output,OUT_Leg_001_DA00_cat_002.pdf"]

ret=endpoint.predict (json_string)

print (ret)

In [None]:
endpoint.delete_endpoint()

In [None]:
print (ret)

In [None]:
batch_input = 's3://348831852500-sagemaker-us-east-1/Tesseract/process/process.csv'

# The location to store the results of the batch transform job
batch_output = 's3://348831852500-sagemaker-us-east-1/Tesseract/Output/'
tra.transform(data=batch_input, data_type='S3Prefix',  content_type='text/csv', split_type='Line')


In [None]:
tra.wait()