# Random search with SageMaker XGBoost and Automatic Model Tuning

---

This notebook is simplified version of https://sagemaker-examples.readthedocs.io/en/latest/hyperparameter_tuning/xgboost_random_log/hpo_xgboost_random_log.html


## Contents

1. [Introduction](#Introduction)
1. [Preparation](#Preparation)
1. [Download and prepare the data](#Download-and-prepare-the-data)
1. [Setup hyperparameter tuning](#Setup-hyperparameter-tuning)
1. [Logarithmic scaling](#Logarithmic-scaling)
1. [Random search](#Random-search)


---

## Introduction

This notebook showcases the use of **random search strategy**.


We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.

---

## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as SageMaker training.
- The IAM role used to give training access to your data. See SageMaker documentation for how to create these.

In [48]:
import sagemaker
import boto3
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import os
from time import gmtime, strftime

region = boto3.Session().region_name
smclient = boto3.Session().client("sagemaker")

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# bucket = sagemaker.Session().default_bucket()
# prefix = "sagemaker/CS611-project"

bucket = 'sagemaker-mle-group7'   # Set a default S3 bucket
prefix = 'xgboost'

sklearn_processor = SKLearnProcessor(
    framework_version="1.2-1", role=role, instance_type="ml.m5.xlarge", instance_count=1
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


---

## Download and prepare the data
Here we download the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.

In [None]:
# !wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
# !unzip -o bank-additional.zip

Now let us load the data, apply some preprocessing, and upload the processed data to s3

In [24]:
import pandas as pd

s3 = boto3.client("s3")

df = pd.read_csv(r"data/healthcare-dataset-stroke-data.csv")
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [56]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

output_bucket_path = 's3://sagemaker-mle-group7'
output_data_uri = f'{output_bucket_path}/output'

input_data_uri = 's3://sagemaker-mle-group7/healthcare-dataset-stroke-data.csv'

sklearn_processor.run(
    code='preprocessing.py',
    inputs=[ProcessingInput(input_name='input', source=input_data_uri)],
    outputs=[
        ProcessingOutput(output_name='output', source='/opt/ml/processing/output')]
)

# inputs = [
#     ProcessingInput(source="s3:/sagemaker-mle-group7/healthcare-dataset-stroke-data.csv", destination="s3:/sagemaker-mle-group7/opt/ml/processing/input"),
# ]

# outputs = [
#     ProcessingOutput(output_name="train", source="s3:/sagemaker-mle-group7/opt/ml/processing/train"),
#     ProcessingOutput(output_name="validation", source="s3:/sagemaker-mle-group7/opt/ml/processing/validation"),
#     ProcessingOutput(output_name="test", source="s3:/sagemaker-mle-group7/opt/ml/processing/test")
# ]

# sklearn_processor.run(
#     code = 'preprocessing.py',
#     inputs=[ProcessingInput(input_name='input', source=input_data_uri)],
#     outputs=[
#         ProcessingOutput(output_name='output', source='/opt/ml/processing/output')]
# )

# step_process = ProcessingStep(
#     name="Preprocessing",
#     step_args = sklearn_processor.run(inputs=inputs, outputs=outputs,
#     code="preprocessing.py")
# )

AttributeError: 'NoneType' object has no attribute 's3_uri'

In [35]:
sklearn_processor.run(code='preprocessing.py',
                     inputs=[ProcessingInput(
                        source="s3:/sagemaker-mle-group7/healthcare-dataset-stroke-data.csv",
                        destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
                               ProcessingOutput(source='/opt/ml/processing/output/validation'),
                               ProcessingOutput(source='/opt/ml/processing/output/test')])

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2023-06-03-06-27-28-327


ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: Invalid LocalPath "s3://sagemaker-mle-group7/opt/ml/processing/output/train" for ProcessingOutput "output-1". Please supply an absolute path for LocalPath that begins with "/opt/ml/processing", such as "/opt/ml/processing/input" or "/opt/ml/processing/output".

In [31]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code="preprocessing.py",
    # arguments = ["arg1", "arg2"], # Arguments can optionally be specified here
    inputs=[ProcessingInput(source="s3://sagemaker-mle-group7/healthcare-dataset-stroke-data.csv", destination="/s3://sagemaker-mle-group7//processing/input")],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output/train"),
        ProcessingOutput(source="/opt/ml/processing/output/validation"),
        ProcessingOutput(source="/opt/ml/processing/output/test"),
    ],
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2023-06-03-06-23-13-286


ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: Invalid LocalPath "/s3://sagemaker-mle-group7//processing/input" for ProcessingInput "input-1". Please supply an absolute path for LocalPath that begins with "/opt/ml/processing", such as "/opt/ml/processing/input" or "/opt/ml/processing/output".

In [21]:
sklearn_processor.jobs[0].describe()

{'ProcessingInputs': [{'InputName': 'input-1',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-34-37-146/input/input-1/healthcare-dataset-stroke-data.csv',
    'LocalPath': '/opt/ml/processing/input',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}},
  {'InputName': 'code',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-34-37-146/input/code/preprocessing.py',
    'LocalPath': '/opt/ml/processing/input/code',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}}],
 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'output-1',
    'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-34-37-146/output/output-1',
     

In [22]:
import boto3

s3_client = boto3.client("s3")
# default_bucket = sagemaker.Session().default_bucket()
default_bucket = 'sagemaker-'

for i in range(1, 4):
    prefix = s3_client.list_objects(
        Bucket=default_bucket, Prefix="sagemaker-scikit-learn"
    )["Contents"][-i]["Key"]
    print("s3://" + default_bucket + "/" + prefix)

s3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-3/test.csv
s3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-2/validation.csv
s3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-1/train.csv


In [None]:
train_path = 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-1/'
validation_path = 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-2/'
test_path = 's3://sagemaker-us-east-1-600187469140/sagemaker-scikit-learn-2023-06-03-05-44-05-795/output/output-3/'

In [None]:
# input for SageMaker

from sagemaker.inputs import TrainingInput

s3_input_train = TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)

s3_input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation".format(bucket, prefix), content_type="csv"
)

In [None]:
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.csv")
).upload_file("validation.csv")

In [None]:
# Load data
data = pd.read_csv("./data/healthcare-dataset-stroke-data.csv")
pd.set_option("display.max_columns", 500)  # Make sure we can see all of the columns
pd.set_option("display.max_rows", 50)  # Keep the output on one page

# Apply some feature processing
data["no_previous_contact"] = np.where(
    data["pdays"] == 999, 1, 0
)  # Indicator variable to capture when pdays takes a value of 999
data["not_working"] = np.where(
    np.in1d(data["job"], ["student", "retired", "unemployed"]), 1, 0
)  # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)  # Convert categorical variables to sets of indicators

# columns that should not be included in the input
model_data = model_data.drop(
    ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
    axis=1,
)

# split data
train_data, validation_data, test_data = np.split(
    model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)

# save preprocessed file to s3
pd.concat([train_data["y_yes"], train_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv(
    "train.csv", index=False, header=False
)
pd.concat(
    [validation_data["y_yes"], validation_data.drop(["y_no", "y_yes"], axis=1)], axis=1
).to_csv("validation.csv", index=False, header=False)
pd.concat([test_data["y_yes"], test_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv(
    "test.csv", index=False, header=False
)
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.csv")
).upload_file("validation.csv")

---

## Setup hyperparameter tuning
In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, and define the static hyperparameter and objective

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.image_uris import retrieve

sess = sagemaker.Session()

container = retrieve("xgboost", region, "latest")

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    base_job_name="xgboost-random-search",
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)

xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=10,
    rate_drop=0.3,
    tweedie_variance_power=1.4,
)
objective_metric_name = "validation:auc"

## Specify hyperparameter ranges

We list down the hyperparameters we want to try in this tuning.

In [None]:
hyperparameter_ranges = {
    "alpha": ContinuousParameter(0.01, 10),
    "lambda": ContinuousParameter(0.01, 10),
}

## Random search

We now start a tuning job using random search. The main advantage of using random search is that this allows us to train jobs with a high level of parallelism

In [None]:
tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=5,
    max_parallel_jobs=5,
    strategy="Random",
)

tuner.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    include_cls_metadata=False,
    job_name="xgb-randsearch-" + strftime("%Y%m%d-%H-%M-%S", gmtime()),
)

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully.

In [None]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

Check of the hyperparameter tuning jobs status

## Analyze tuning job results - after tuning job is completed


In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# check jobs have finished
status_tuner = boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

assert status_tuner == "Completed", "First must be completed, was {}".format(status_log)

df_tuner = sagemaker.HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.job_name
).dataframe()
df_tuner

## Deploy the best model

In [None]:
predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

## Delete the end point

In [None]:
sess.delete_endpoint(endpoint_name=predictor.endpoint_name)