Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Amazon Software License (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at: http://aws.amazon.com/asl/ or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and limitations under the License.

# Hyperparameter Tuning on SageMaker XGBoost algorithm

This sample notebook shows how to use [Amazon SageMaker's built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to predict whether a driver will file an auto insurance claim next year, based on a public dataset provided by an insurance company on Kaggle (To get more information about the dataset, please visit https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data). It leverages hyperparameter tuning to automatically kick off traning jobs with different hyperparameter combinations, to find the one with best model training result.

You can certainly use your own dataset, in which case you simply change the training data location to your own s3 bucket, as you will see later in the notebook. 

After the tuning job is completed, we will also show you how to deploy the best model and make predictions against the endpoint, which you can find in other SageMaker sample notebooks as well.

---
## Prequisites and Preprocessing

### Permissions and environment variables

Here we set up the linkage and authentication to AWS services.

#### Get the HPO client, which is region specific

In [1]:
import boto3
import smhpolib

region = boto3.Session().region_name
account = boto3.Session().client('sts').get_caller_identity()['Account']
sagemaker = boto3.Session().client('sagemaker')

#### Get the execution role that is to be passed to training jobs

In [2]:
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

arn:aws:iam::306280812807:role/service-role/AmazonSageMaker-ExecutionRole-20180117T091311


#### Specify s3 bucket and prefix
Set up the S3 bucket that you want to use for putting training data and model data. In this example, you use a public dataset that is stored in a public S3 bucket that we prepared. If you want to use your own datasets, you can put your datasets in teh bucket you specify here.

In [3]:
bucket = 'sagemaker-{}-{}'.format(region, account)    # put your s3 bucket here
prefix = 'hpo/xgboost'       # specify the s3 prefix (i.e., subfolder) for this exercise
bucket

'sagemaker-us-east-1-306280812807'

## Specify hyperparameter tuning job configuration
Now you configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:
* The ranges of hyperparameters you want to tune
* The limits of the resource the tuning job can consume 
* The objective metric for the tuning job


In [75]:
from time import gmtime, strftime, sleep
tuning_job_name = 'xgboost-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

print (tuning_job_name)

tuning_job_config = {
    "ParameterRanges": { 
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta",
        },
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "rate_drop",
        },
        {
          "MaxValue": "10",
          "MinValue": "0",
          "Name": "gamma",
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight",
        }
          ,
           {
          "MaxValue": "2",
          "MinValue": "1",
          "Name": "tweedie_variance_power",
        }
      ],
        
      "IntegerParameterRanges": [
        {
          "MaxValue": "20",
          "MinValue": "5",
          "Name": "max_depth",
        },
       {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_delta_step",
        },
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 20,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:auc",
      "Type": "Maximize"
    }
  }

xgboost-tuningjob-22-05-31-55


## Specify training job configuration
Now you configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call.
In this JSON object, you specify:
* Metrics that the training jobs emit
* The container image for the algorithm to train
* The input configuration for your training and test data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job
* The type of instance to use for the training jobs
* The stopping condition for the training jobs

This example defines two metrics the built-in XGBoost Algorithm emits: valid_auc and train_auc. In this example, we set static values for the eval_metric, auc, num_round, objective, rate_drop, and tweedie_variance_power parameters of the built-in XGBoost Algorithm.

In [76]:
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
           'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
           'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
           'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}
           
training_image = containers[region]

training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": "s3://public-test-hpo-datasets-{}/kaggle/porto-seguro/xgb/train/".format(region)
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": "s3://public-test-hpo-datasets-{}/kaggle/porto-seguro/xgb/val/".format(region)
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}".format(bucket)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.c4.8xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "160",
      "objective": "binary:logistic"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}


## Create and launch a hyperparameter tuning job
Now you can launch a hyperparameter tuning job by calling create_tuning_job API. Pass the name and JSON objects you created in previous steps as the values of the parameters. After the tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of each training job that has been created.

In [77]:
sagemaker.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

{'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:306280812807:hyper-parameter-tuning-job/xgboost-tuningjob-22-05-31-55',
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '130',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 22 May 2018 05:32:14 GMT',
   'x-amzn-requestid': '79b3063f-b323-4e60-9ce7-53846d06fd4b'},
  'HTTPStatusCode': 200,
  'RequestId': '79b3063f-b323-4e60-9ce7-53846d06fd4b',
  'RetryAttempts': 0}}

## Track hyperparameter tuning job progress
After you launch a tuning job, you can see its progress by calling describe_tuning_job API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job.

In [7]:
# run this cell to check current status of hyperparameter tuning job
sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)

{'CreationTime': datetime.datetime(2018, 5, 21, 14, 40, 43, tzinfo=tzlocal()),
 'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:306280812807:hyper-parameter-tuning-job/xgboost-tuningjob-21-14-31-09',
 'HyperParameterTuningJobConfig': {'HyperParameterTuningJobObjective': {'MetricName': 'validation:auc',
   'Type': 'Maximize'},
  'ParameterRanges': {'CategoricalParameterRanges': [],
   'ContinuousParameterRanges': [{'MaxValue': '1',
     'MinValue': '0',
     'Name': 'eta'},
    {'MaxValue': '10', 'MinValue': '0', 'Name': 'gamma'},
    {'MaxValue': '10', 'MinValue': '1', 'Name': 'min_child_weight'}],
   'IntegerParameterRanges': [{'MaxValue': '10',
     'MinValue': '1',
     'Name': 'max_depth'}]},
  'ResourceLimits': {'MaxNumberOfTrainingJobs': 20,
   'MaxParallelTrainingJobs': 3},
  'Strategy': 'Bayesian'},
 'HyperParameterTuningJobName': 'xgboost-tuningjob-21-14-31-09',
 'HyperParameterTuningJobStatus': 'InProgress',
 'LastModifiedTime': datetime.datetime(2018, 5, 21, 14, 4

You can call list_training_jobs_for_tuning_job to see a detailed list of the training jobs that the tuning job launched.

In [8]:

# list all training jobs that have been created by the tuning job
list_training_result = sagemaker.list_training_jobs_for_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name, MaxResults=20)
list_training_result

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '1080',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 21 May 2018 14:40:58 GMT',
   'x-amzn-requestid': 'c7db69e1-25d5-4441-8b4e-5e3c5edb3a9e'},
  'HTTPStatusCode': 200,
  'RequestId': 'c7db69e1-25d5-4441-8b4e-5e3c5edb3a9e',
  'RetryAttempts': 0},
 'TrainingJobSummaries': [{'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:306280812807:training-job/xgboost-tuningjob-21-14-31-09-003-af0c2c92',
   'TrainingJobName': 'xgboost-tuningjob-21-14-31-09-003-af0c2c92',
   'TrainingJobStatus': 'InProgress',
   'TunedHyperParameters': {'eta': '0.18771938145441602',
    'gamma': '0.23709200061741154',
    'max_depth': '6',
    'min_child_weight': '5.648140042272571'}},
  {'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:306280812807:training-job/xgboost-tuningjob-21-14-31-09-002-428b2ff3',
   'TrainingJobName': 'xgboost-tuningjob-21-14-31-09-002-428b2ff3',
   'TrainingJobStatus': 'InProgress',
   '

In [None]:
# don't go beyond here with Run All, when the tuning job is completed, skip this cell and move on
assert False

## Analyze tuning job results - after tuning job is completed
Once the tuning job is completed (i.e., all training jobs have been finished), we can list hyperparameters and objective metrics of all training jobs and pick up the training job with the best objective metric.

In [None]:
import pandas as pd
from smhpolib import analysis    # analytical library provided through smhpolib, you can find the source code under /smhpolib folder

tuning = analysis.TuningJob(tuning_job_name = tuning_job_name)

HPO_params = tuning.hyperparam_dataframe()

if len(HPO_params) > 0:
    df = HPO_params[HPO_params['FinalObjectiveValue'] > -float('inf')]
    if len(df) > 0:
        df = df.sort_values('FinalObjectiveValue', ascending=False)
        print("Valid objective: %d" % len(df))
        print({"lowest":min(df['FinalObjectiveValue']),"highest": max(df['FinalObjectiveValue'])})
        best_model = df.iloc[0]
        print("best model information: \n%s" %best_model)
        best_training_job_name = best_model['TrainingJobName']
        pd.set_option('display.max_colwidth', -1)  # Don't truncate TrainingJobName        
    else:
        print("Training jobs launched are not completed yet. Try again in a few minutes.")
        
df

## See TuningJob results vs time
Next we will show how the objective metric changes over time, as the tuning job progresses

In [None]:
import bokeh
import bokeh.io
bokeh.io.output_notebook()
from bokeh.plotting import figure, show
import bokeh.palettes

def big_warp_palette(size, palette_func, warp=1):
    """setting warp < 1 exagerates the high end.
    setting warp > 1 exagerates the low end"""
    p = palette_func(256)
    out = []
    for i in range(size):
        f = i / size # from 0-1 inclusive
        f **= warp
        idx = int(f * 255)
        out.append(p[idx])
    return out

if len(df) > 0:
    palette = big_warp_palette(len(df),bokeh.palettes.plasma, 0.4)
    df['color'] = palette
    hover = smhpolib.viz.SmhpoHover(tuning)

    p = figure(plot_width=900, plot_height=400, tools=hover.tools(), x_axis_type='datetime')
    p.circle(source=df, x='TrainingCreationTime', y='FinalObjectiveValue', color='color')
    show(p)
else:
    print("Training jobs launched are not completed yet. Try again in a few minutes.")


## Analyze the correlation between objective metric and individual hyperparameters 
Now you have finished a tuning job, you may want to know the correlation between your objective metric and individual hyperparameters you've selected to tune. Having that insight will help you decide whether it makes sense to adjust search ranges for certain hyperparameters and start another tuning job. For exmaple, if you see a positive trend between objective metric and a numerical hyperparameter, you probably want to set a higher tuning range for that hyperparameter in your next tuning job.

The following cell draws a graph for each hyperparameter to show its correlation with your objective metric.

In [None]:
# Which hyperparameters to look for correlations for
all_hyperparameters = tuning.hyperparam_ranges().keys()
all_hyperparameters

figures = []
for hp in all_hyperparameters:
    p = figure(plot_width=500, plot_height=500, 
                title="Final objective vs %s" % hp,
                tools=hover.tools(),
                x_axis_label=hp, y_axis_label="objective")
    p.circle(source=df,x=hp,y='FinalObjectiveValue',color='color')
    figures.append(p)
show(bokeh.layouts.Column(*figures))


## Deploy the best model
Now we are ready to deploy the best model so we can make inferences against it. In order to deploy a model, we have to import the model from training to hosting, then create an endpoint configuration, after that, we create an endpoint using the model and the endpoint configuration we just created.

### Import model into hosting

In [None]:
%%time
from time import gmtime, strftime

model_name=best_training_job_name
print(model_name)

info = sagemaker.describe_training_job(TrainingJobName=best_training_job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
hosting_image = training_image  # For XGBoost algorithm, training and hosting share the same image

primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

### Create endpoint configuration
Now, we'll create an endpoint configuration which provides the instance type and count for model deployment.

In [None]:
endpoint_config_name = 'XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.c5.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

### Create endpoint
Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications.

In [None]:
endpoint_name = 'XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sagemaker.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

try:
    sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Arn: " + resp['EndpointArn'])
    print("Create endpoint ended with status: " + status)

    if status != 'InService':
        message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
        print('Training failed with the following error: {}'.format(message))
        raise Exception('Endpoint creation did not succeed')


## Validate the model for use
Finally, you can now validate the model for use by invoking the endpoint you just created and passing in a sample data for prediction

### Get some sample data
You can simiply use the first row in the validation data, which is:

In [None]:
# First row from validation dataset
sample_record="0,5,1,4,0,0,0,0,0,1,0,0,0,0,0,6,1,0,0,0.9,1.8,2.332648709,10,0,-1,0,0,14,1,1,0,1,104,2,0.445982062,0.879049073,0.40620192,3,0.7,0.8,0.4,3,1,8,2,11,3,8,4,2,0,9,0,1,0,1,1,1"
label,payload = sample_record.split(',',maxsplit=1)

### Run prediction

In [None]:
import json
from itertools import islice
import math
import struct

runtime_client = boto3.client('runtime.sagemaker')

response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
result = response['Body'].read()
result = float(result.decode("utf-8"))
print ('Label: ',label,'\nPrediction: ', result)