# Amazon SageMaker XGBoost and Hyperparameter Tuning for Automobile Insurance Fraud Detection
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

Kernel `Python 3 (Data Science)` works well with this notebook.

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data Downloading](#Data_Downloading)
1. [Data Transformation](#Data_Transformation)
1. [Setup Hyperparameter Tuning](#Setup_Hyperparameter_Tuning)
1. [Launch Hyperparameter Tuning](#Launch_Hyperparameter_Tuning)
1. [Analyze Hyperparameter Tuning Results](#Analyze_Hyperparameter_Tuning_Results)
1. [Deploy The Best Model](#Deploy_The_Best_Model)
1. [Evaluation](#Evaluation)


---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook will train a model which can be used to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. Hyperparameter tuning will be used in order to try multiple hyperparameter settings and produce the best model.

We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.

---

## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as SageMaker training.
- The IAM role used to give training access to your data. See SageMaker documentation for how to create these.

In [1]:
import sagemaker
import boto3
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.inputs import TrainingInput

import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import os
import awswrangler as wr

region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)

sagemaker_client = boto3.Session().client("sagemaker")
sagemaker_role = sagemaker.get_execution_role()

s3_client = boto3.client("s3", region_name=region)
sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client)

In [2]:
%store -r
%store

Stored variables and their in-db values:
bucket                              -> 'sagemaker-us-east-1-875692608981'
claims_fg_name                      -> 'fraud-detect-demo-claims'
claims_preprocessed                 ->       policy_id  incident_severity  num_vehicles_i
claims_table                        -> 'fraud-detect-demo-claims-1637200704'
clarify_bias_job_1_name             -> 'Clarify-Bias-2021-11-18-02-16-43-192'
clarify_bias_job_2_name             -> 'Clarify-Bias-2021-11-18-02-37-10-321'
clarify_expl_job_name               -> 'Clarify-Explainability-2021-11-18-02-49-06-695'
col_order                           -> ['fraud', 'customer_gender_male', 'customer_educat
context_name                        -> 'fraud-detect-1637201755'
customers_fg_name                   -> 'fraud-detect-demo-customers'
customers_preprocessed              ->       policy_id  customer_age  customer_education 
customers_table                     -> 'fraud-detect-demo-customers-1637200706'
database_name 

In [3]:
#---> create a pandas dataframe from the combine dataset
dataset = wr.s3.read_csv(path=f"s3://{bucket}/{prefix}/data/dataset/", dataset=True)
dataset

Unnamed: 0.1,Unnamed: 0,policy_id,customer_gender_male,customer_education,incident_type_collision,fraud,num_injuries,policy_liability,customer_gender_female,incident_severity,...,police_report_available,incident_dow,num_vehicles_involved,num_insurers_past_5_years,collision_type_side,driver_relationship_self,policy_state_wa,policy_deductable,driver_relationship_other,authorities_contacted_police
0,0,3397,1,1,0,0,0,2,0,0,...,0,6,1,1,0,0,0,750,0,1
1,1,4142,0,4,1,0,4,1,1,1,...,1,4,4,1,1,1,0,750,0,1
2,2,2389,1,4,1,0,0,2,0,0,...,0,3,1,1,0,1,0,750,0,0
3,3,4164,1,4,1,1,0,0,0,1,...,1,1,2,1,0,1,0,750,0,1
4,4,2702,0,3,0,0,0,0,1,0,...,0,4,1,2,0,0,0,750,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,2585,1,1,1,0,0,1,0,0,...,0,6,2,1,1,1,0,750,0,0
4996,4996,4622,1,4,1,0,0,0,0,0,...,0,5,2,1,0,1,0,750,0,0
4997,4997,4630,1,3,1,0,0,2,0,1,...,1,3,2,1,0,1,0,750,0,1
4998,4998,2852,1,1,1,0,0,1,0,0,...,1,2,3,1,0,1,0,750,0,1


In [4]:
dataset.to_csv("./data/claims_customer.csv")

In [5]:
# split data
train_data, validation_data, test_data = np.split(
    dataset.sample(frac=1, random_state=1729),
    [int(0.7 * len(dataset)), int(0.9 * len(dataset))],
)

# save preprocessed file to s3
pd.concat([train_data["fraud"], train_data.drop(["fraud", "policy_id"], axis=1)], axis=1).to_csv( "data/hpo_train.csv", index=False, header=False)
pd.concat([validation_data["fraud"], validation_data.drop(["fraud", "policy_id"], axis=1)], axis=1).to_csv("data/hpo_validation.csv", index=False, header=False)
pd.concat([test_data["fraud"], test_data.drop(["fraud", "policy_id"], axis=1)], axis=1).to_csv( "data/hpo_test.csv", index=False, header=False)


### Write train, test and validate data to S3

In [6]:
s3_client.upload_file(Filename="data/hpo_train.csv", Bucket=bucket, Key=f"{prefix}/hpo/train/train.csv")
s3_client.upload_file(Filename="data/hpo_validation.csv", Bucket=bucket, Key=f"{prefix}/hpo/validation/validation.csv")
s3_client.upload_file(Filename="data/hpo_test.csv", Bucket=bucket, Key=f"{prefix}/hpo/test/test.csv")
s3_client.upload_file(Filename="data/claims_customer.csv", Bucket=bucket, Key=f"{prefix}/hpo/dataset/dataset.csv")

---

## Setup_Hyperparameter_Tuning 
*Note, with the default setting below, the hyperparameter tuning job can take about 30 minutes to complete.*


Now that we have prepared the dataset, we are ready to train models. Before we do that, one thing to note is there are algorithm settings which are called "hyperparameters" that can dramtically affect the performance of the trained models. For example, XGBoost algorithm has dozens of hyperparameters and we need to pick the right values for those hyperparameters in order to achieve the desired model training results. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it, and a good search algorithm can search for the best hyperparameter setting in an automated and effective way.

We will use SageMaker hyperparameter tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, which includes:
* The container image for the algorithm (XGBoost)
* Configuration for the output of the training jobs
* The values of static algorithm hyperparameters, those that are not specified will be given default values
* The type and number of instances to use for the training jobs

In [7]:
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")

xgb = sagemaker.estimator.Estimator(
    container,
    sagemaker_role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/tuning_jobs".format(bucket, prefix),
    sagemaker_session=sagemaker_session,
)

xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    rate_drop=0.3,
)

We will tune four hyperparameters in this examples:
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* *alpha*: L1 regularization term on weights. Increasing this value makes models more conservative. 
* *num_round*: The number of rounds to run the training.
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

In [8]:
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "max_depth": IntegerParameter(1, 10),
    "num_round": IntegerParameter(1, 150),
    "alpha": ContinuousParameter(0, 2),
}

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: *validation:auc* and *train:auc*, and we elected to monitor *validation:auc* as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.

In [9]:
objective_metric_name = "validation:auc"

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [10]:
tuner = HyperparameterTuner(
    xgb, objective_metric_name, hyperparameter_ranges, max_jobs=20, max_parallel_jobs=3
)

## Launch_Hyperparameter_Tuning
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [11]:
from sagemaker.session import TrainingInput

train_data_uri = f"s3://{bucket}/{prefix}/hpo/train/train.csv"
validation_data_uri = f"s3://{bucket}/{prefix}/hpo/validation/validation.csv"

train_input = TrainingInput(train_data_uri, content_type="csv")
validation_input = TrainingInput(validation_data_uri, content_type="csv")


In [12]:
import time

tuning_job = f"tuning-{time.strftime('%d-%H-%M-%S', time.gmtime())}"

tuner.fit({"train": train_input, "validation": validation_input}, include_cls_metadata=False, job_name=tuning_job)

..................................................................................................................................................................................................................................................................................................................................................................................!


Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully.

In [13]:
sagemaker_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

'Completed'

# Analyze Results of a Hyperparameter Tuning job

Once you have completed a tuning job, (or even while the job is still running) you can use this notebook to analyze the results to understand how each hyperparameter effects the quality of the model.

In [14]:
tuning_job_name=tuning_job
tuning_job_name

'tuning-18-04-19-20'

## Track hyperparameter tuning job progress
After you launch a tuning job, you can see its progress by calling describe_tuning_job API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job. You can call list_training_jobs_for_tuning_job to see a detailed list of the training jobs that the tuning job launched.

In [15]:
# run this cell to check current status of hyperparameter tuning job
tuning_job_result = sagemaker_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
is_minimize = objective["Type"] != "Maximize"
objective_name = objective["MetricName"]

20 training jobs have completed


In [16]:
from pprint import pprint

if tuning_job_result.get("BestTrainingJob", None):
    print("Best model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

Best model found so far:
{'CreationTime': datetime.datetime(2021, 11, 18, 4, 24, 18, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc',
                                                 'Value': 0.8491250276565552},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2021, 11, 18, 4, 28, 43, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:875692608981:training-job/tuning-18-04-19-20-006-88aa457c',
 'TrainingJobName': 'tuning-18-04-19-20-006-88aa457c',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2021, 11, 18, 4, 27, 54, tzinfo=tzlocal()),
 'TunedHyperParameters': {'alpha': '1.652930944637491',
                          'eta': '0.9788657917131687',
                          'max_depth': '1',
                          'num_round': '8'}}


## Fetch all results as DataFrame
We can list hyperparameters and objective metrics of all training jobs and pick up the training job with the best objective metric.

In [17]:
import pandas as pd

tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", None)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

Number of training jobs with valid objective: 20
{'lowest': 0.4989669919013977, 'highest': 0.8491250276565552}


Unnamed: 0,alpha,eta,max_depth,num_round,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
14,1.652931,0.978866,1.0,8.0,tuning-18-04-19-20-006-88aa457c,Completed,0.849125,2021-11-18 04:27:54+00:00,2021-11-18 04:28:43+00:00,49.0
7,1.459183,1.0,1.0,35.0,tuning-18-04-19-20-013-3ab5fe21,Completed,0.846607,2021-11-18 04:39:51+00:00,2021-11-18 04:41:01+00:00,70.0
6,1.882423,0.564315,2.0,30.0,tuning-18-04-19-20-014-95ff7ef9,Completed,0.844331,2021-11-18 04:40:32+00:00,2021-11-18 04:41:37+00:00,65.0
19,1.768307,0.573277,3.0,8.0,tuning-18-04-19-20-001-a34179ec,Completed,0.837455,2021-11-18 04:22:55+00:00,2021-11-18 04:24:05+00:00,70.0
3,1.965919,0.573348,1.0,20.0,tuning-18-04-19-20-017-5bdae550,Completed,0.836987,2021-11-18 04:44:32+00:00,2021-11-18 04:45:41+00:00,69.0
15,1.970924,0.599899,2.0,85.0,tuning-18-04-19-20-005-0eb2f35d,Completed,0.824283,2021-11-18 04:27:07+00:00,2021-11-18 04:27:57+00:00,50.0
2,1.932838,0.416793,7.0,19.0,tuning-18-04-19-20-018-87d05168,Completed,0.813953,2021-11-18 04:45:18+00:00,2021-11-18 04:46:20+00:00,62.0
5,1.611244,0.318496,7.0,7.0,tuning-18-04-19-20-015-a05f52c1,Completed,0.810757,2021-11-18 04:40:49+00:00,2021-11-18 04:42:02+00:00,73.0
0,1.050697,0.368769,10.0,27.0,tuning-18-04-19-20-020-81cf7dfb,Completed,0.810563,2021-11-18 04:48:53+00:00,2021-11-18 04:49:50+00:00,57.0
17,1.778448,0.479802,3.0,46.0,tuning-18-04-19-20-003-2d84be6f,Completed,0.808335,2021-11-18 04:22:34+00:00,2021-11-18 04:23:40+00:00,66.0


## See TuningJob results vs time
Next we will show how the objective metric changes over time, as the tuning job progresses.  For Bayesian strategy, you should expect to see a general trend towards better results, but this progress will not be steady as the algorithm needs to balance _exploration_ of new areas of parameter space against _exploitation_ of known good areas.  This can give you a sense of whether or not the number of training jobs is sufficient for the complexity of your search space.

In [18]:
import bokeh
import bokeh.io

bokeh.io.output_notebook()
from bokeh.plotting import figure, show
from bokeh.models import HoverTool


class HoverHelper:
    def __init__(self, tuning_analytics):
        self.tuner = tuning_analytics

    def hovertool(self):
        tooltips = [
            ("FinalObjectiveValue", "@FinalObjectiveValue"),
            ("TrainingJobName", "@TrainingJobName"),
        ]
        for k in self.tuner.tuning_ranges.keys():
            tooltips.append((k, "@{%s}" % k))

        ht = HoverTool(tooltips=tooltips)
        return ht

    def tools(self, standard_tools="pan,crosshair,wheel_zoom,zoom_in,zoom_out,undo,reset"):
        return [self.hovertool(), standard_tools]


hover = HoverHelper(tuner)

p = figure(plot_width=900, plot_height=400, tools=hover.tools(), x_axis_type="datetime")
p.circle(source=df, x="TrainingStartTime", y="FinalObjectiveValue")
show(p)

## Analyze the correlation between objective metric and individual hyperparameters 
Now you have finished a tuning job, you may want to know the correlation between your objective metric and individual hyperparameters you've selected to tune. Having that insight will help you decide whether it makes sense to adjust search ranges for certain hyperparameters and start another tuning job. For example, if you see a positive trend between objective metric and a numerical hyperparameter, you probably want to set a higher tuning range for that hyperparameter in your next tuning job.

The following cell draws a graph for each hyperparameter to show its correlation with your objective metric.

In [19]:
ranges = tuner.tuning_ranges
figures = []
for hp_name, hp_range in ranges.items():
    categorical_args = {}
    if hp_range.get("Values"):
        # This is marked as categorical.  Check if all options are actually numbers.
        def is_num(x):
            try:
                float(x)
                return 1
            except:
                return 0

        vals = hp_range["Values"]
        if sum([is_num(x) for x in vals]) == len(vals):
            # Bokeh has issues plotting a "categorical" range that's actually numeric, so plot as numeric
            print("Hyperparameter %s is tuned as categorical, but all values are numeric" % hp_name)
        else:
            # Set up extra options for plotting categoricals.  A bit tricky when they're actually numbers.
            categorical_args["x_range"] = vals

    # Now plot it
    p = figure(
        plot_width=500,
        plot_height=500,
        title="Objective vs %s" % hp_name,
        tools=hover.tools(),
        x_axis_label=hp_name,
        y_axis_label=objective_name,
        **categorical_args,
    )
    p.circle(source=df, x=hp_name, y="FinalObjectiveValue")
    figures.append(p)
show(bokeh.layouts.Column(*figures))

## Deploy the best model
Now that we have got the best model, we can deploy it to an endpoint. Please refer to other SageMaker sample notebooks or SageMaker documentation to see how to deploy a model.