# Direct Marketing with Amazon SageMaker XGBoost and Hyperparameter Tuning
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

Kernel `Python 3 (Data Science)` works well with this notebook.

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data Downloading](#Data_Downloading)
1. [Data Transformation](#Data_Transformation)
1. [Setup Hyperparameter Tuning](#Setup_Hyperparameter_Tuning)
1. [Launch Hyperparameter Tuning](#Launch_Hyperparameter_Tuning)
1. [Analyze Hyperparameter Tuning Results](#Analyze_Hyperparameter_Tuning_Results)
1. [Deploy The Best Model](#Deploy_The_Best_Model)
1. [Evaluation](#Evaluation)


---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook will train a model which can be used to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. Hyperparameter tuning will be used in order to try multiple hyperparameter settings and produce the best model.

We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.

---

## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as SageMaker training.
- The IAM role used to give training access to your data. See SageMaker documentation for how to create these.

In [29]:
import sagemaker
import boto3
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from sagemaker.inputs import TrainingInput

import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import os
import awswrangler as wr

region = boto3.Session().region_name
smclient = boto3.Session().client("sagemaker")

role = sagemaker.get_execution_role()

boto3.setup_default_session(region_name=region)

s3_client = boto3.client("s3", region_name=region)


In [14]:
%store -r
%store

Stored variables and their in-db values:
bucket                             -> 'sagemaker-us-east-1-875692608981'
claims_fg_name                     -> 'fraud-detect-demo-claims'
claims_preprocessed                ->       policy_id  incident_severity  num_vehicles_i
claims_table                       -> 'fraud-detect-demo-claims-1636518800'
clarify_expl_job_name              -> 'Clarify-Explainability-2021-11-10-14-35-21-747'
col_order                          -> ['fraud', 'num_injuries', 'incident_severity', 'in
customers_fg_name                  -> 'fraud-detect-demo-customers'
customers_preprocessed             ->       policy_id  customer_age  customer_education 
customers_table                    -> 'fraud-detect-demo-customers-1636518803'
database_name                      -> 'sagemaker_featurestore'
endpoint_config_name               -> 'fraud-detect-demo-xgboost-post-smote-endpoint-con
endpoint_name                      -> 'fraud-detect-demo-xgboost-post-smote-endpoint'
hyperp

In [22]:
# ======> This is your DataFlow output path if you decide to redo the work in DataFlow on your own
dataset = wr.s3.read_csv(path=f"s3://{bucket}/{prefix}/data/dataset/", dataset=True)
dataset

Unnamed: 0.1,Unnamed: 0,policy_id,authorities_contacted_none,policy_liability,customer_gender_female,fraud,incident_severity,total_claim_amount,driver_relationship_child,policy_state_az,...,policy_state_wa,collision_type_rear,incident_dow,policy_state_id,auto_year,driver_relationship_na,customer_age,authorities_contacted_fire,driver_relationship_other,collision_type_na
0,0,3643,0,1,1,1,1,7500.0,0,0,...,1,1,2,0,2009,0,24,0,0,0
1,1,3826,0,1,0,0,1,51000.0,0,1,...,0,0,6,0,2019,0,25,0,0,0
2,2,2188,0,1,0,0,2,46500.0,0,0,...,0,1,6,0,2013,0,24,0,0,0
3,3,605,0,0,1,0,1,19000.0,0,1,...,0,0,3,0,2019,0,18,0,0,0
4,4,4325,0,1,1,0,1,19000.0,0,0,...,1,0,2,0,2020,0,42,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,1096,1,0,0,0,0,9000.0,0,0,...,0,0,4,0,2013,0,61,0,0,0
4996,4996,4587,0,1,0,0,1,11000.0,0,0,...,0,0,1,0,2018,0,46,0,0,0
4997,4997,4655,0,2,0,0,1,15000.0,0,0,...,0,0,0,0,2017,0,53,0,0,0
4998,4998,2926,1,0,0,0,0,11000.0,0,0,...,0,0,2,0,2016,0,18,0,0,0


In [23]:
dataset.to_csv("./data/claims_customer.csv")

In [27]:
# split data
train_data, validation_data, test_data = np.split(
    dataset.sample(frac=1, random_state=1729),
    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)

# save preprocessed file to s3
pd.concat([train_data["fraud"], train_data.drop(["fraud", "policy_id"], axis=1)], axis=1).to_csv(
    "data/hpo_train.csv", index=False, header=False
)
pd.concat(
    [validation_data["fraud"], validation_data.drop(["fraud", "policy_id"], axis=1)], axis=1
).to_csv("data/hpo_validation.csv", index=False, header=False)

pd.concat([test_data["fraud"], test_data.drop(["fraud", "policy_id"], axis=1)], axis=1).to_csv(
    "data/hpo_test.csv", index=False, header=False
)


### Write train, test data to S3

In [33]:
s3_client.upload_file(Filename="data/hpo_train.csv", Bucket=bucket, Key=f"{prefix}/hpo/train/train.csv")
s3_client.upload_file(Filename="data/hpo_validation.csv", Bucket=bucket, Key=f"{prefix}/hpo/validation/validation.csv")
s3_client.upload_file(Filename="data/hpo_test.csv", Bucket=bucket, Key=f"{prefix}/hpo/test/test.csv")
s3_client.upload_file(Filename="data/claims_customer.csv", Bucket=bucket, Key=f"{prefix}/hpo/dataset/dataset.csv")

---

## Setup_Hyperparameter_Tuning 
*Note, with the default setting below, the hyperparameter tuning job can take about 30 minutes to complete.*


Now that we have prepared the dataset, we are ready to train models. Before we do that, one thing to note is there are algorithm settings which are called "hyperparameters" that can dramtically affect the performance of the trained models. For example, XGBoost algorithm has dozens of hyperparameters and we need to pick the right values for those hyperparameters in order to achieve the desired model training results. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it, and a good search algorithm can search for the best hyperparameter setting in an automated and effective way.

We will use SageMaker hyperparameter tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, which includes:
* The container image for the algorithm (XGBoost)
* Configuration for the output of the training jobs
* The values of static algorithm hyperparameters, those that are not specified will be given default values
* The type and number of instances to use for the training jobs

In [34]:
sess = sagemaker.Session()

container = sagemaker.image_uris.retrieve("xgboost", region, "latest")
job_name = 'hp-tuning-fraud-detect'

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    base_job_name=job_name,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/hp-tuner/output".format(bucket, prefix),
    sagemaker_session=sess,
)

xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=100,
    rate_drop=0.3,
    tweedie_variance_power=1.4,
)

We will tune four hyperparameters in this examples:
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* *alpha*: L1 regularization term on weights. Increasing this value makes models more conservative. 
* *min_child_weight*: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

In [35]:
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
}

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: *validation:auc* and *train:auc*, and we elected to monitor *validation:auc* as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.

In [36]:
objective_metric_name = "validation:auc"

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [37]:
tuner = HyperparameterTuner(
    xgb, objective_metric_name, hyperparameter_ranges, max_jobs=9, max_parallel_jobs=3
)

## Launch_Hyperparameter_Tuning
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [38]:
from sagemaker.session import TrainingInput

train_data_uri = f"s3://{bucket}/{prefix}/hpo/train/train.csv"
validation_data_uri = f"s3://{bucket}/{prefix}/hpo/validation/validation.csv"

train_input = TrainingInput(train_data_uri, content_type="csv")
validation_input = TrainingInput(validation_data_uri, content_type="csv")


In [39]:
tuner.fit({"train": train_input, "validation": validation_input}, include_cls_metadata=False)

.....................................................................................................................................................................!


Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully.

In [40]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

'Completed'

## Analyze tuning job results - after tuning job is completed
Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.

## Deploy the best model
Now that we have got the best model, we can deploy it to an endpoint. Please refer to other SageMaker sample notebooks or SageMaker documentation to see how to deploy a model.