<a id='building-your-own-algo-container'></a>

# Building your own Scikit-Learn algorithm container for SageMaker Training

With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook will guide you through an example that shows you how to build a Docker container for SageMaker and use it for training.

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 

_**Note:**_ SageMaker now includes a [pre-built Scikit-Learn container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb).  We recommend the pre-built container be used for almost all cases requiring a Scikit-Learn algorithm.  However, this example remains relevant as an outline for bringing in other libraries to SageMaker as your own container.

1. [Building your own algorithm container](#building-your-own-algo-container)
  1. [When should I build my own algorithm container?](#when-should-i)
  1. [Permissions](#permissions)
  1. [The Example](#the-example)
  1. [The Lab](#the-lab)
1. [Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker](#part1)
    1. [An overview of Docker](#docker-overview)
    1. [How Amazon SageMaker runs your Docker container](#how-sagemaker-runs-docker)
    1. [The parts of the sample container](#parts-of-container)
    1. [The Dockerfile](#the-dockerfile)
    1. [Building and registering the container](#building-the-container)
1. [Part 2: Using your Algorithm in Amazon SageMaker](#part2)
  1. [Set up the environment](#set-up-environment)
  1. [Create the session](#create-the-session)
  1. [Upload the data for training](#upload-data-for-training)
  1. [Create an estimator and fit the model](#create-and-fit-estimator)

<a id='when-should-i'></a>

## When should I build my own algorithm container?

You may not need to create a container to bring your own code to Amazon SageMaker. When you are using a framework (such as Apache MXNet or TensorFlow) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework. This set of frameworks is continually expanding, so we recommend that you check the current list if your algorithm is written in a common machine learning environment.

Even if there is direct SDK support for your environment or framework, you may find it more effective to build your own container. If the code that implements your algorithm is quite complex on its own or you need special additions to the framework, building your own container may be the right choice.

If there isn't direct SDK support for your environment, don't worry. You'll see in this lab that building your own container is quite straightforward and can easily be automated.

<a id='permissions'></a>

## Permissions

### Modifying your SageMaker role

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. These additional permissions are required in order to allow SageMaker to:

* Create new repositories in Amazon Elastic Container Registry (ECR), which is used to host Docker images in AWS
* Create and build a new project in AWS CodeBuild (ie: to build our custom Docker image)

The easiest way to add these permissions is simply to add the managed policies `AmazonEC2ContainerRegistryFullAccess` and `AWSCodeBuildAdminAccess` to the Execution Role that is associated with your SageMaker Studio user. There's no need to restart your notebook instance when you do this - the new permissions will be available immediately.

To add these 2 required managed policies to your existing SageMaker Execution Role:

* Open a new instance of the AWS Console in a separate browser tab/window
* On the left-hand side, choose `Roles`
* Search for your execution role (usually something like AmazonSageMaker-ExecutionRole-...)
> You can run the next cell to find your execution role

In [6]:
from sagemaker import get_execution_role
role = get_execution_role()
role

'arn:aws:iam::079329190341:role/service-role/AmazonSageMaker-ExecutionRole-20190404T141667'

* Click the correct role in the list, and choose `Attach Policies`
* Search for the `AmazonEC2ContainerRegistryFullAccess` and `AWSCodeBuildAdminAccess` policies, placing a checkmark next to each
* Click `Attach Policy`

### Creating a CodeBuild role

This lab uses AWS CodeBuild to automatically build our Docker container image, and then push the new image into ECR so that SageMaker can use the image for training. In order to leverage CodeBuild, you will first need to create a new service-linked role in your account called `codebuild__sagemaker_workshop_role`. The role will require the following inline policy:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudWatchLogsPolicy",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "S3GetPutPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:PutObject",
                "s3:GetBucketAcl",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "ECRPushPullPolicy",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:InitiateLayerUpload",
                "ecr:UploadLayerPart",
                "ecr:CompleteLayerUpload",
                "ecr:PutImage"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

To create the new service-linked role for CodeBuild:

* Open a new instance of the AWS Console in a separate browser tab/window
* On the left-hand side, choose `Policies`
* Click `Create Policy`
* Click the `JSON` tab at the top of the page
* Copy the above JSON policy, and paste it into the JSON editor, overwriting the existing contents of the editor
* Click `Review Policy`
* In the `Name` field, enter `codebuild__sagemaker_workshop_policy` exactly as you see it here
* Click `Create Policy`
* On the left-hand side of the main IAM service screen, choose `Roles`
* Click `Create Role`
* At the top of the page, ensure that `AWS Service` is highlighted. Next, scroll to the bottom of the page and click on `CodeBuild`
* Click the `Next: Permissions` button
* Search for the `codebuild__sagemaker_workshop_policy` policy, and place a checkmark next to it. 
* Click `Next: Tags` then `Next: Review`
* In the `Role name` field, enter `codebuild__sagemaker_workshop_role` exactly as you see it here
* Click `Create role`



### Allowing SageMaker to pass the CodeBuild service-linked role to CodeBuild

Lastly, we need to adjust your SageMaker Execution Role so that SageMaker has permission to pass the appropriate CodeBuild role into CodeBuild when submitting CodeBuild jobs.

Begin by running the following Jupyter cell to generate an inline policy that is specific to your account:

In [7]:
from IPython.display import display, Markdown
import json
import sagemaker

sess = sagemaker.session.Session()
account = sess.boto_session.client('sts').get_caller_identity()['Account']

policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:passRole",
            "Resource": "arn:aws:iam::" + account + ":role/codebuild__sagemaker_workshop_role"
        }
    ]
}

display(Markdown("#### Generated Inline Policy:"))
display(Markdown("```json\n" + json.dumps(policy, indent=4) + "\n```"))


#### Generated Inline Policy:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:passRole",
            "Resource": "arn:aws:iam::079329190341:role/codebuild__sagemaker_workshop_role"
        }
    ]
}
```

Next, return to the IAM service console, and add the above generated policy to your existing SageMaker Execution Role as an 'inline policy':

* Open a new instance of the AWS Console in a separate browser tab/window
* On the left-hand side, choose `Roles`
* Search for your execution role (usually something like AmazonSageMaker-ExecutionRole-...)
* Click `Add Inline Policy` on the right-hand side
* Click the `JSON` tab at the top of the screen
* Paste your copied inline policy (from above) into the JSON editor, replacing all existing contents
* Click `Review Policy`
* In the `Name` field, type `sagemaker_workshop__codebuild_passrole`
* Click `Create Policy`

<br/>
<br/>

<a id='the-example'></a>

## The Example

Here, we'll show how to package a simple Python training script which showcases the random forest algorithm from the widely used Scikit-Learn machine learning package. The example is intentionally quite trivial since the point is to show the surrounding structure that you'll want to add to your own code so you can train and host it in Amazon SageMaker.

<a id='the-lab'></a>

## The Lab

This Lab is divided into two parts: _Part1: building the container_ and _Part 2: using the container_.

<a id='part1'></a>

# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

<a id='docker-overview'></a>

### An overview of Docker

For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting (inference). The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

<a id='how-sagemaker-runs-docker'></a>

### How Amazon SageMaker runs your Docker container

Amazon SageMaker runs your container with the argument `train`. How your container processes this argument depends on the container:

* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time. In this example, we define it as an executable Python script, but it could be any program that we want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train`. 

#### Running your container during training

When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory on the SageMaker training instance:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.


<a id='parts-of-container'></a>

### The parts of the sample container

Within the `container` directory of this Lab are all the components you need to package the sample algorithm for Amazon SageMaker:

    .
    |-- Dockerfile
    |-- buildspec.yml
    `-- random_forest
        `-- train


Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details below.
* __`buildspec.yml`__ describes the build process that will be used by AWS CodeBuild in order to build our container image.
* __`random_forest`__ is the directory which contains the files that will be installed in the container.
* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.

<a id='the-dockerfile'></a>

### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A running Docker container is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Along the way, we clean up extra space. This makes the container smaller and faster to start.

Let's look at the Dockerfile for the example:

In [8]:
!pwd

/root/MLAI/BYOC/scikit_bring_your_own


In [9]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 2 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:16.04

MAINTAINER Amazon AI <sage-learner@amazon.com>


RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \
    pip install numpy==1.16.2 scipy==1.2.1 scikit-learn==0.20.2 pandas flask gevent gunicorn && \
        (cd /usr/local/lib/python2.7/dist-packages/scipy/.libs; rm *; ln ../..

<a id='building-the-container'></a>

### Building and registering the container

There are multiple ways in which you can build a Docker container. If you have used Docker before, you might be familiar with the commands `docker build` and `docker push`, for example.

In this lab, we are going to automate the building of our container image using [AWS CodeBuild](https://aws.amazon.com/codebuild/):

    AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can get started quickly by using prepackaged build environments, or you can create custom build environments that use your own build tools. With CodeBuild, you are charged by the minute for the compute resources you use.

In order to leverage CodeBuild to build our Docker container image, we will make use of the provided `buildspec.yml` file that defines the various steps that will be executed during the build process. Let's take a look at the buildspec.yml file:

In [10]:
!cat container/buildspec.yml

version: 0.2

phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION)
  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...          
      - docker build -t $ALGORITHM_NAME .
      - docker tag $ALGORITHM_NAME $IMAGE_NAME      
  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing the Docker image...
      - docker push $IMAGE_NAME

<br/>

In the above file, you see that we have defined 3 build phases:

1. *pre_build*: this phase ensures that CodeBuild successfully logs into our ECR repository
2. *build*: this phase is used to run the `docker build` command, which will build our container image according to our Dockerfile (see [above](#the-dockerfile))
3. *post_build*: this phase is used to run the `docker push` command, which pushes our new image into our ECR repository

<br/>

Before we can run our CodeBuild job, we first need to make sure that we have an ECR respository in which we will store our Docker images.

The following code looks for an ECR repository in the account you're using and the current default region. If the repository doesn't exist, the script will create it.

In [11]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-random-forest

cd container

chmod +x random_forest/train

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

aws ecr describe-repositories --repository-names "${algorithm_name}"

{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-1:079329190341:repository/sagemaker-random-forest",
            "registryId": "079329190341",
            "repositoryName": "sagemaker-random-forest",
            "repositoryUri": "079329190341.dkr.ecr.us-east-1.amazonaws.com/sagemaker-random-forest",
            "createdAt": 1590296178.0,
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            }
        }
    ]
}


Next, we need to package up our Dockerfile and Scikit-Learn training script into a ZIP file, and store it in S3.

CodeBuild will later download this ZIP file in order to build the container image that SageMaker will use for training.

In [12]:
import sagemaker

sm_session = sagemaker.session.Session()
s3_root = sm_session.default_bucket() + "/codebuild_tmp/"

# install zip, as it is sometimes missing
print("Attempting to install zip using apt-get:")
!apt-get install -y zip
print()

# zip the contents of the 'container' directory into "codebuild__random_forest.zip"
print("Zipping contents of ./container/ into codebuild__random_forest.zip:")
!cd container; zip -r ../codebuild__random_forest.zip *; cd ..; echo; ls -alh *.zip  
print()

# copy codebuild__random_forest.zip to a temporary S3 location which we'll need when submitting our CodeBuild job
print("Uploading zip file to S3:")
!aws s3 cp ./codebuild__random_forest.zip s3://{s3_root}

Attempting to install zip using apt-get:
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  unzip
The following NEW packages will be installed:
  unzip zip
0 upgraded, 2 newly installed, 0 to remove and 14 not upgraded.
Need to get 405 kB of archives.
After this operation, 1202 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 unzip amd64 6.0-23+deb10u1 [172 kB]
Get:2 http://deb.debian.org/debian buster/main amd64 zip amd64 3.0-11+b1 [234 kB]
Fetched 405 kB in 0s (7773 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package unzip.
(Reading database ... 16492 files and directories currently installed.)
Preparing to unpack .../unzip_6.0-23+deb10u1_amd64.deb ...
Unpacking unzip (6.0-23+deb10u1) ...
Selecting previously unselected package zip.
Preparing to unpack .../zip_3.0-11+b1_amd

<br/>
<br/>

Now that we have uploaded our training script and Dockerfile to S3, the next step is to define a CodeBuild project. Remember, we'll be using a CodeBuild project to automate the Docker build process.

The following code block will generate a JSON file describing our new CodeBuild project:

In [13]:
import json
import boto3
boto3_session = boto3.session.Session()

algorithm_name = "sagemaker-random-forest"
project_name = algorithm_name + "_01"
region = boto3_session.region_name
account = boto3.client('sts').get_caller_identity().get('Account')
codebuild_role = "arn:aws:iam::" + account + ":role/codebuild__sagemaker_workshop_role"

# Below we define our JSON template that will be passed into the "aws codebuild create-project ..." CLI command.
# This template defines the overall CodeBuild Project, as well as our custom environment variables that control the build process
# NOTE: You do not need to change anything below.

json_template = {
  "name": project_name,
  "source": {
    "type": "S3",
    "location": s3_root + "codebuild__random_forest.zip"
  },
  "artifacts": {
    "type": "NO_ARTIFACTS"
  },
  "environment": {
    "type": "LINUX_CONTAINER",
    "image": "aws/codebuild/standard:4.0",
    "computeType": "BUILD_GENERAL1_SMALL",
    "environmentVariables": [
      {
        "name": "AWS_DEFAULT_REGION",
        "value": region
      },
      {
          "name": "ALGORITHM_NAME",
          "value": algorithm_name
      },
        {
            "name": "IMAGE_NAME",
            "value": account + ".dkr.ecr." + region + ".amazonaws.com/" + algorithm_name + ":latest"
        }
    ],
    "privilegedMode": True
  },
  "serviceRole": codebuild_role
}


# Output the JSON file to 'codebuild_create_project.json' in the notebook directory
with open("codebuild_create_project.json", "w") as f:
    f.write(json.dumps(json_template))

<br/>

Next, let's check the contents of our newly created JSON tempate file:

In [14]:
import json

with open("codebuild_create_project.json") as f:
    tmp = json.load(f)
    display(Markdown("```json\n" + json.dumps(tmp, indent=4) + "\n```"))

```json
{
    "name": "sagemaker-random-forest_01",
    "source": {
        "type": "S3",
        "location": "sagemaker-us-east-1-079329190341/codebuild_tmp/codebuild__random_forest.zip"
    },
    "artifacts": {
        "type": "NO_ARTIFACTS"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "aws/codebuild/standard:4.0",
        "computeType": "BUILD_GENERAL1_SMALL",
        "environmentVariables": [
            {
                "name": "AWS_DEFAULT_REGION",
                "value": "us-east-1"
            },
            {
                "name": "ALGORITHM_NAME",
                "value": "sagemaker-random-forest"
            },
            {
                "name": "IMAGE_NAME",
                "value": "079329190341.dkr.ecr.us-east-1.amazonaws.com/sagemaker-random-forest:latest"
            }
        ],
        "privilegedMode": true
    },
    "serviceRole": "arn:aws:iam::079329190341:role/codebuild__sagemaker_workshop_role"
}
```

<br/>

Next, we use the AWS CLI to create a new CodeBuild Project, referencing our JSON template file:

In [None]:
!aws codebuild create-project --cli-input-json file://codebuild_create_project.json

<br/>
<br/>

With our CodeBuild Project in place, we now use the AWS CLI to initiate a build job:

In [None]:
# This will build the Docker container image defined by our Dockerfile and push the image into ECR
!aws codebuild start-build --project-name=sagemaker-random-forest_01

<br/>

Now that you have started the CodeBuild build process, it is a good idea to monitor the progress of your build job.

When you run the following cell, it will create links for the CodeBuild and ECR consoles:

In [None]:
link1 = "* [CodeBuild](https://{region}.console.aws.amazon.com/codesuite/codebuild/599069043765/projects/sagemaker-random-forest_01/)".format(region=region)
link2 = "* [ECR](https://{region}.console.aws.amazon.com/ecr/repositories/sagemaker-random-forest/)".format(region=region)
links = link1 + "\n" + link2

display(Markdown("#### Generated Links:\n" + links))

Begin by clicking the `CodeBuild` link above. Within the CodeBuild console, you will see a description of your ongoing build job, along with the associated build status. Monitor your ongoing build job (try clicking the first link under 'Build run' to view the log files). When your container image has been created, the 'Latest Build Status' column will show as `Succeeded`.

Once the Codebuild job is complete, click the `ECR` link above. If your CodeBuild job was successful in pushing your new image into ECR, you will see a new image with the image tag `latest`. Under the `Pushed at` column for this image, you should see that this image was pushed very recently.

If for some reason you do not see your new image within ECR, please double-check that your Codebuild job was successful, and repeat the [Building and registering the container](#building-the-container) section as required.

<br/>
<br/>

<a id='part2'></a>

# Part 2: Using your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train models and use the model for hosting or batch transforms.

Let's use our new container in a SageMaker training job.

<a id='set-up-environment'></a>

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [15]:
# S3 prefix
prefix = 'DEMO-scikit-byoc-houseprices'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

<a id='create-the-session'></a>

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [16]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

<a id='upload-data-for-training'></a>

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset


We can use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

In [18]:
# we use the Boston housing dataset 
data = load_boston()

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [20]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [21]:
import os

# create the 'data' folder, if it doesn't exist
if not os.path.exists("data"):
    os.mkdir("data")

# save our train/test sets as CSV files within the 'data' folder    
trainX.to_csv('data/boston_train.csv')
testX.to_csv('data/boston_test.csv')

In [22]:
WORK_DIRECTORY = 'data'

# upload the train/test sets to S3
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


<a id='create-and-fit-estimator'></a>

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [23]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-random-forest:latest'.format(account, region)
hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 3,
                       'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
                       'target': 'target'}

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c5.xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       hyperparameters=hyperparameters)

tree.fit(data_location)



2020-06-29 04:19:58 Starting - Starting the training job...
2020-06-29 04:20:01 Starting - Launching requested ML instances......
2020-06-29 04:21:23 Starting - Preparing the instances for training...
2020-06-29 04:21:58 Downloading - Downloading input data
2020-06-29 04:21:58 Training - Downloading the training image...
2020-06-29 04:22:26 Uploading - Uploading generated training model[34mStarting the training.[0m
[34mreading data[0m
[34mbuilding training and testing datasets[0m
[34mtraining model[0m
[34mvalidating model[0m
[34mAE-at-10th-percentile: 0.20342857142857299[0m
[34mAE-at-50th-percentile: 0.8844065934065917[0m
[34mAE-at-90th-percentile: 3.072488888888888[0m
[34mmodel persisted at /opt/ml/model/model.joblib[0m

2020-06-29 04:22:32 Completed - Training job completed
Training seconds: 40
Billable seconds: 40


<br/>
<br/>

Lastly, let's see where our model artifacts were stored in S3:

In [None]:
tree.model_data