# Building an Iris classifier Docker Image
Now it's time to extend the abstract image we just created for Scikit Learn algorithms and implement a Concrete Docker Image with our algorithms/models.

![Docker Diagram](../../imgs/DockerScikit_B.jpg)

Here, we'll prepare a Docker image with two different algorithms (Using the public iris dataset):
1. Logistic regression
2. Random Forest Tree

We'll use a Sagemaker feature called "CustomAttributes" for preparing a dispatcher mechanism. The algorithm we want to use inside our container will be dispatched by this feature.

## First, lets create a Dockerfile

In [None]:
%%writefile Dockerfile
FROM base-image:latest

COPY model.py /opt/program

## Then, let's create a code that uses scikit-learn as the ML Lib

In [None]:
%%writefile model.py
import numpy as np
import json
import os
import pandas as pd
import re

from sklearn import model_selection
from sklearn.externals import joblib

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


prefix = '/opt/ml'


input_path = os.path.join(prefix, 'input/data')
# If something bad happens, write a failure file with the error messages and store here
output_path = os.path.join(prefix, 'output')
# Everything you store here will be packed into a .tar.gz by Sagemaker and store into S3
model_path = os.path.join(prefix, 'model')
# This is the hyperparameters you will send to your algorithms through the Estimator
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')

model_cache = {}

def train():
    print("Training mode")
    
    try:
        # This algorithm has a single channel of input data called 'training'. Since we run in
        # File mode, the input files are copied to the directory specified here.
        channel_name='training'
        training_path = os.path.join(input_path, channel_name)

        hyper_logistic = {}
        hyper_random_forest = {}
        # Read in any hyperparameters that the user passed with the training job
        with open(param_path, 'r') as tc:
            is_float = re.compile(r'^\d+(?:\.\d+)$')
            is_integer = re.compile(r'^\d+$')
            for key,value in json.load(tc).items():
                # workaround to convert numbers from string
                if is_float.match(value) is not None:
                    value = float(value)
                elif is_integer.match(value) is not None:
                    value = int(value)
                
                if key.startswith('logistic'):
                    key = key.replace('logistic_', '')
                    hyper_logistic[key] = value
                if key.startswith('random_forest'):
                    key = key.replace('random_forest_', '')
                    hyper_random_forest[key] = value

        # Take the set of files and read them all into a single pandas dataframe
        input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
        if len(input_files) == 0:
            raise ValueError(('There are no files in {}.\\n' +
                              'This usually indicates that the channel ({}) was incorrectly specified,\\n' +
                              'the data specification in S3 was incorrectly specified or the role specified\\n' +
                              'does not have permission to access the data.').format(training_path, channel_name))
        raw_data = [ pd.read_csv(file, sep=',', header=None ) for file in input_files ]
        train_data = pd.concat(raw_data)
        
        # labels are in the first column
        Y = train_data.ix[:,0]
        X = train_data.ix[:,1:]
        
        X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.33, random_state=7)

        algo = "logistic"
        print("Training: %s" % algo)
        model = LogisticRegression()
        model.set_params(**hyper_logistic)
        model.fit(X_train, Y_train)
        print("{}: {}".format( algo, model.score(X_test, Y_test)) )
        joblib.dump(model, open(os.path.join(model_path, '%s_model.pkl' % algo), 'wb'))

        algo = "random_forest"
        print("Training: %s" % algo)
        model = RandomForestClassifier()
        model.set_params(**hyper_random_forest)
        model.fit(X_train, Y_train)
        print("{}: {}".format( algo, model.score(X_test, Y_test)) )
        joblib.dump(model, open(os.path.join(model_path, '%s_model.pkl' % algo), 'wb'))
    
    except Exception as e:
        # Write out an error file. This will be returned as the failureReason in the
        # DescribeTrainingJob result.
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\\n' + trc)
            
        # Printing this causes the exception to be in the training job logs, as well.
        print('Exception during training: ' + str(e) + '\\n' + trc, file=sys.stderr)
        
        # A non-zero exit code causes the training job to be marked as Failed.
        sys.exit(255)

def predict(payload, algo):
    if algo is None or payload is None:
        raise ValueError( "You need to inform the algorithm and the payload" )
    
    if model_cache.get(algo) is None:
        model_filename = os.path.join(model_path, '%s_model.pkl' % algo)
        model_cache[algo] = joblib.load(open(model_filename, 'rb'))
    
    return {"iris_id": model_cache[algo].predict( payload ).tolist() }

## Finally, let's create the buildspec
This file will be used by CodeBuild for creating our base image

In [None]:
%%writefile buildspec.yml
version: 0.2

phases:
  install:
    runtime-versions:
      docker: 18

  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION)
      - docker pull $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/scikit-base:latest
      - docker tag $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/scikit-base:latest scikit-base:latest
  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...
      - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
      - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing the Docker image...
      - echo docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - echo $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG > image.url
      - echo Done
artifacts:
  files:
    - image.url
  name: image_url
  discard-paths: yes

### Building the image locally, first

In [None]:
!docker build -f Dockerfile -t model:1.0 .

# Let's do some tests, locally
## First, let's define some hyperparameters for both algorithms

In [None]:
hyperparameters = {
    "logistic_max_iter": 100,
    "logistic_solver": "lbfgs",

    "random_forest_max_depth": 10,
    "random_forest_n_jobs": 5,
    "random_forest_verbose": 1
}

In [None]:
import json
!mkdir -p input/config

hyperparameters = dict({key: str(values) for key, values in hyperparameters.items()})
with open('input/config/hyperparameters.json', 'w') as f:
    f.write(json.dumps(hyperparameters))
    f.flush()
    f.close()

## Then, let's prepare a dataset


In [None]:
!mkdir -p input/data/training

import pandas as pd
import numpy as np

from sklearn import datasets
iris = datasets.load_iris()

dataset = np.insert(iris.data, 0, iris.target,axis=1)

pd = pd.DataFrame(data=dataset, columns=['iris_id'] + iris.feature_names)
pd.to_csv('input/data/training/iris.csv', header=None, index=False, sep=',', encoding='utf-8')

pd.head()

## Then, let's test the training process

In [None]:
!mkdir -p model
!rm -f model/*

In [None]:
print( "Training ...")
!docker run --rm --name 'my_model' \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" model:1.0 train

## Now, a basic test with a direct call to our container

In [None]:
print( "Testing with logistic")
!docker run --rm --name 'my_model' \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" model:1.0 test logistic "[[4.6, 3.1, 1.5, 0.2]]"
        
print( "Testing with random_forest")
!docker run --rm --name 'my_model' \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" model:1.0 test random_forest "[[4.6, 3.1, 1.5, 0.2]]"

## This is the serving test. It simulates an Endpoint exposed by Sagemaker

After you execute the next cell, this Jupyter notebook will freeze. A webservice will be exposed at the port 8080. 

In [None]:
!docker run --rm --name 'my_model' \
    -p 8080:8080 \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" model:1.0 serve

> While the above cell is running, click here [TEST NOTEBOOK](02_Testing%20our%20local%20model%20server.ipynb) to run some tests.

> After you finish the tests, press **STOP**

### Before we push our code to the repo, let's check the building process

In [None]:
import boto3

sts_client = boto3.client("sts")
session = boto3.session.Session()

account_id = sts_client.get_caller_identity()["Account"]
region = session.region_name
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()

repo_name='model'
image_tag='test'

In [None]:
!mkdir -p tests
!cp model.py Dockerfile buildspec.yml tests/
with open('tests/vars.env', 'w') as f:
    f.write("AWS_ACCOUNT_ID=%s\n" % account_id)
    f.write("IMAGE_TAG=%s\n" % image_tag)
    f.write("IMAGE_REPO_NAME=%s\n" % repo_name)
    f.write("AWS_DEFAULT_REGION=%s\n" % region)
    f.write("AWS_ACCESS_KEY_ID=%s\n" % credentials.access_key)
    f.write("AWS_SECRET_ACCESS_KEY=%s\n" % credentials.secret_key)
    f.write("AWS_SESSION_TOKEN=%s\n" % credentials.token )
    f.close()

!cat tests/vars.env

In [None]:
%%time

!/tmp/aws-codebuild/local_builds/codebuild_build.sh \
    -a "$PWD/tests/output" \
    -s "$PWD/tests" \
    -i "samirsouza/aws-codebuild-standard:2.0" \
    -e "$PWD/tests/vars.env"

## Ok, now it's time to push everything to the repo

In [None]:
%%bash

cd ../docker
cp $OLDPWD/buildspec.yml $OLDPWD/model.py $OLDPWD/Dockerfile .

git add --all
git commit -a -m " - files for building an iris model image"
git push

### Ok, now open the AWS console in another tab and go to the CodePipeline console to see the status of our building pipeline