Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Amazon Software License (the "License"). You may not
use this file except in compliance with the License. A copy of the
License is located at:
http://aws.amazon.com/asl/
or in the "license" file accompanying this file. This file is distributed
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express
or implied. See the License for the specific language governing permissions
and limitations under the License.

# Hyperparameter Tuning using Your Own Tensorflow Container

This notebook shows how to build your own Keras(Tensorflow) container, test it locally using SageMaker Python SDK local mode, and bring it to SageMaker for training, leveraging hyperparameter tuning. 

The model used for this notebook is a ResNet model, trainer with the CIFAR-10 dataset. The example is based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py

## Set up the notebook instance to support local mode
Currently you need to install docker-compose in order to use local mode (i.e., testing the container in the notebook instance without pushing it to ECR).

In [None]:
!/bin/bash setup.sh
!pip install git+https://github.com/aws/sagemaker-python-sdk

## Set up the environment
We will set up a few things before starting the workflow. 

1. get the HPO client (i.e., endpoint URL) through smhpolib
2. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket
3. specify the s3 bucket and prefix where training data set and model artifacts are stored

In [None]:
import os
import numpy as np
import tempfile

import tensorflow as tf

import sagemaker
import smhpolib
import boto3
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator


#region = 'us-west-2'   # if you want to specify a different region other than the one the notebook instance is in
region = boto3.Session().region_name   # if you want to use the same region the notebook instance is in

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
sagemaker = boto3.client('sagemaker',region)

bucket = 'sagemaker-west-2'  # your s3 bucket name, please make sure it is in the same region to the one you specified above
output_location = 's3://{}/data/DEMO-keras-cifar10/output'.format(bucket)  # your s3 location where model artifacts are written to

ecr_repository = '811689727410.dkr.ecr.%s.amazonaws.com/test' % region# your ECR repository, which you should have been created before running the notebook

role = get_execution_role()

NUM_CLASSES = 10   # the data set has 10 categories of images

## Complete source code
- [trainer/start.py](trainer/start.py): Keras model
- [trainer/environment.py](trainer/environment.py): Contain information about the SageMaker environment

## Building the image
We will build the docker image using the Tensorflow versions on dockerhub. The full list of Tensorflow versions can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/


In [None]:
import shlex
import subprocess

def get_image_name(ecr_repository, tensorflow_version_tag):
    return '%s:tensorflow-%s' % (ecr_repository, tensorflow_version_tag)

def build_image(name, version):
    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)
    subprocess.check_call(shlex.split(cmd))

#version tag can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/ 
#e.g., latest cpu version is 'latest', while latest gpu version is 'latest-gpu'
tensorflow_version_tag = 'latest'   

image_name = get_image_name(ecr_repository, tensorflow_version_tag)

#TODO the logs are in the console not in the notebook
print('building image:'+image_name)
build_image(image_name, tensorflow_version_tag)

## Upload the data to a S3 bucket

In [None]:
def upload_channel(channel_name, x, y):
    y = tf.keras.utils.to_categorical(y, NUM_CLASSES)

    file_path = tempfile.mkdtemp()
    np.savez_compressed(os.path.join(file_path, 'cifar-10-npz-compressed.npz'), x=x, y=y)

    return sagemaker_session.upload_data(path=file_path, bucket=bucket, key_prefix='data/DEMO-keras-cifar10/%s' % channel_name)


def upload_training_data():
    # The data, split between train and test sets:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

    train_data_location = upload_channel('train', x_train, y_train)
    test_data_location = upload_channel('test', x_test, y_test)

    return {'train': train_data_location, 'test': test_data_location}

channels = upload_training_data()
channels

## Testing the container locally (optional)

You can test the container locally using local mode of SageMaker Python SDK. A training container will be created in the notebook instance based on the docker image you built. Note that we have not pushed the docker image to ECR yet since we are only running local mode here. You can skip to the tuning step if you want but testing the container locally can help you find issues quickly before kicking off the tuning job.

### Setting the hyperparameters

In [None]:
hyperparameters = dict(batch_size=32, data_augmentation=True, learning_rate=.0001, 
                       width_shift_range=.1, height_shift_range=.1)
hyperparameters

### Create a training job using local mode

In [None]:
estimator = Estimator(image_name, role=role, output_path=output_location,
                      train_instance_count=1, 
                      train_instance_type='local', hyperparameters=hyperparameters)
estimator.fit(channels)

## Pushing the container to ECR
Before kicking off the tuning job, you need to push the docker image to ECR first. The ECR repository has been set up in the beginning of the sample notebook, please make sure you have input your ECR repository information there.

In [None]:
def push_image(name):
    cmd = 'aws ecr get-login --no-include-email --region '+region
    login = subprocess.check_output(shlex.split(cmd)).strip()

    subprocess.check_call(shlex.split(login.decode()))

    cmd = 'docker push %s' % name
    subprocess.check_call(shlex.split(cmd))

#TODO the logs are in the console not in the notebook
print ("pushing image:"+image_name)
push_image(image_name)

## Specify hyperparameter tuning job configuration
Now you configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:
* The ranges of hyperparameters you want to tune
* The limits of the resource the tuning job can consume 
* The objective metric for the tuning job

In [None]:
import json
from time import gmtime, strftime

tuning_job_name = 'Tensorflow-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

print(tuning_job_name)

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.01",
          "MinValue": "0.001",
          "Name": "learning_rate",          
        }
      ],
      "IntegerParameterRanges": []
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 4,
      "MaxParallelTrainingJobs": 2
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "loss",
      "Type": "Minimize"
    }
  }


## Specify training job configuration
Now you configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call.
In this JSON object, you specify:
* Metrics that the training jobs emit
* The container image for the algorithm to train
* The input configuration for your training and test data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job
* The type of instance to use for the training jobs
* The stopping condition for the training jobs

This example defines one metric that Tensorflow container emits: loss. 

In [None]:
training_image = image_name

output_location = 's3://{}/tensorflowhpo/{}/output'.format(bucket,tuning_job_name) # where model artifact is written to

print('training artifacts will be uploaded to: {}'.format(output_location))

training_job_definition = {
    "AlgorithmSpecification": {
      "MetricDefinitions": [
        {
          "Name": "loss",
          "Regex": "loss: ([0-9\\.]+)"
        }
      ],
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['train'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        },
        {
            "ChannelName": "test",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['test'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },            
            "CompressionType": "None",
            "RecordWrapperType": "None"            
        }
    ],
    "OutputDataConfig": {
      "S3OutputPath": output_location
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.c4.8xlarge",
      "VolumeSizeInGB": 50
    },
    "RoleArn": role,
    "StaticHyperParameters": {
        "batch_size":"32",
        "data_augmentation":"True",
        "height_shift_range":"0.1",
        "width_shift_range":"0.1"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}


## Create and launch a hyperparameter tuning job
Now you can launch a hyperparameter tuning job by calling create_tuning_job API. Pass the name and JSON objects you created in previous steps as the values of the parameters. After the tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of each training job that has been created.

In [None]:
sagemaker.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                               HyperParameterTuningJobConfig = tuning_job_config,
                                               TrainingJobDefinition = training_job_definition)

## Track hyperparameter tuning job progress
After you launch a tuning job, you can see its progress by calling describe_tuning_job API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job.

In [None]:
sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)


In [None]:
# list all training jobs that have been created by the tuning job
list_training_result = sagemaker.list_training_jobs_for_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name, MaxResults=20)
training_job_names = [tjs['TrainingJobName'] for tjs in list_training_result[u'TrainingJobSummaries'] ]
training_job_names
list_training_result

## Analyze tuning job results - after tuning job is completed
Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.

## Deploy the best model
Please refer to "HPO_XGBoost_insurance_claim_prediction" example to see example code to deploy a model. You can also refer to SageMaker documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-deploy-model.html