### Scenario Description

    In this notebook we are doing the following
    - Using training & test data present in csv format
    - doing binary classification
    - Using a pre-built amazon container for xgboost
    - basic hyperparamters (no tuning!)
    - specifying debugging configurations

In [1]:
import sagemaker
import boto3
from sagemaker.session import s3_input

session = sagemaker.Session()
sm = boto3.Session().client('sagemaker')
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print ('Role-',role)
print ('Region-',region)

Role- arn:aws:iam::951135073253:role/service-role/AmazonSageMaker-ExecutionRole-20200722T234773
Region- eu-west-1


In [45]:
import time

BUCKET_NAME = 'snowflake-getting-started'
BASE_PREFIX = 'bank-marketing'

EXPERIMENTS_OUTPUT_LOC = 's3://'+BUCKET_NAME+'/'+BASE_PREFIX+'/experiments-xboost-hyperparametertuning'
print ('Experiment metadata would be published at -',EXPERIMENTS_OUTPUT_LOC)

EXP_CHECKPOINT=EXPERIMENTS_OUTPUT_LOC+'/checkpoint'
EXP_DEBUGGING_OUTPUTS=EXPERIMENTS_OUTPUT_LOC+'/debugging'
EXP_TRAINED_MODELS=EXPERIMENTS_OUTPUT_LOC+'/trained_models'
EXP_SOURCE_CODE= EXPERIMENTS_OUTPUT_LOC+'/code'

print ('Experiment debugging data available at -',EXP_DEBUGGING_OUTPUTS)
print ('Experiment trained models available at -',EXP_TRAINED_MODELS)
print ('Experiment checkpoints available at -',EXP_CHECKPOINT)
print ('Experiment code available at -',EXP_SOURCE_CODE)

Experiment metadata would be published at - s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning
Experiment debugging data available at - s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/debugging
Experiment trained models available at - s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/trained_models
Experiment checkpoints available at - s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/checkpoint
Experiment code available at - s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/code


In [4]:
# define the data type and paths to the training and validation datasets
content_type = "text/csv"
train_input = s3_input("s3://{}/{}/{}".format(BUCKET_NAME, BASE_PREFIX, 'train/train_data.csv'), content_type=content_type)
validation_input = s3_input("s3://{}/{}/{}".format(BUCKET_NAME, BASE_PREFIX, 'test/test_data.csv'), content_type=content_type)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [51]:
from sagemaker.estimator import Estimator
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig
from sagemaker.amazon.amazon_estimator import get_image_uri

# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

save_interval ='1'
container = get_image_uri(region, 'xgboost',repo_version='1.0-1')

print (container)
algorithm_mode_default_estimator = Estimator(container,
                                              train_instance_type='ml.m4.xlarge',
                                              train_instance_count=1,
                                              sagemaker_session = session,
                                              role = role,
                                              input_mode='File',
                                              enable_network_isolation = True, #disallow internet connection,
                                              checkpoint_s3_uri = EXP_CHECKPOINT,
                                              enable_sagemaker_metrics=True,
                                              debugger_hook_config=DebuggerHookConfig(
                                                          s3_output_path=EXP_DEBUGGING_OUTPUTS, 
                                                          hook_parameters={
                                                            'save_interval': '1'
                                                          },
                                                          # Required - See https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections for supported collections
                                                          collection_configs=[ 
                                                              CollectionConfig( name="metrics"), 
                                                              CollectionConfig( name="feature_importance"), 
                                                              CollectionConfig( name="full_shap"), 
                                                              CollectionConfig( name="average_shap"), 
                                                          ],
                                                        ),
                                              rules=[ 
                                                  Rule.sagemaker( 
                                                      rule_configs.loss_not_decreasing(), 
                                                      rule_parameters={ "collection_names": "metrics", "num_steps": str(save_interval * 2), }, 
                                                  ), 
                                              ],
                                              output_path = EXP_TRAINED_MODELS
                                        )

# Define exploration boundaries (default suggested values from Amazon SageMaker Documentation)
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0, 1000, scaling_type="Auto"),
    'eta': ContinuousParameter(0.1, 0.5, scaling_type='Logarithmic'),
    'max_depth': IntegerParameter(0,10,scaling_type='Auto'),
    'min_child_weight': ContinuousParameter(0,10,scaling_type='Auto'),
    'num_round': IntegerParameter(1,4000,scaling_type='Auto'),
    'subsample': ContinuousParameter(0.5,1,scaling_type='Logarithmic')}

objective_metric_name = 'validation:auc'

algorithm_mode_hyper_tuning_estimator = HyperparameterTuner(
                                                            algorithm_mode_default_estimator,
                                                            objective_metric_name,
                                                            hyperparameter_ranges,
                                                            max_jobs=5,
                                                            max_parallel_jobs=2,
                                                            strategy='Bayesian'
                                                        )

algorithm_mode_hyper_tuning_estimator.fit(
                                            inputs={'train': train_input, 'validation': validation_input},    
                                            logs=True,
                                            # This is a fire and forget event. By setting wait=False, you just submit the job to run in the background.
                                            # Amazon SageMaker starts one training job and release control to next cells in the notebook.
                                            # Follow this notebook to see status of the training job.
                                            wait=False
                                        )



141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3


### Compare the model training runs for an experiment

Now we will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

In [52]:
# Retrieve analytics object
algorithm_mode_hyper_tuning_estimator_analytics = algorithm_mode_hyper_tuning_estimator.analytics()

# Look at summary of associated training jobs
tuner_dataframe = algorithm_mode_hyper_tuning_estimator_analytics.dataframe()

tuner_dataframe

In [30]:
#get the best training job
algorithm_mode_hyper_tuning_estimator.best_training_job()

'sagemaker-xgboost-200730-2005-007-f01bf205'

In [42]:
#download the best model & load it, this can then be used for local predictions
!aws s3 cp s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/trained_models/sagemaker-xgboost-200730-2005-001-8b586175/output/model.tar.gz .

# unzip model
import tarfile
import pickle as pkl

tar = tarfile.open('model.tar.gz')
tar.extractall()
tar.close() 

model = pkl.load(open("xgboost-model", "rb"))

download: s3://snowflake-getting-started/bank-marketing/experiments-xboost-hyperparametertuning/trained_models/sagemaker-xgboost-200730-2005-001-8b586175/output/model.tar.gz to ./model.tar.gz


In [41]:
model

<xgboost.core.Booster at 0x7f1b4e27a810>