# Direct Marketing with Amazon SageMaker Autopilot

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.

---

---

## Contents

1. [Introduction](#Introduction)
1. [Restore Shared Variables](#Restore-Shared-Variables)
1. [Setting up the SageMaker Autopilot Job](#Settingup)
1. [Launching the SageMaker Autopilot Job](#Launching)
1. [Tracking Sagemaker Autopilot Job Progress](#Tracking)
1. [Results](#Results)
1. [Cleanup](#Cleanup)

## Introduction

In the notebook, we will use [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to create [Autopilot Experiment Job](https://aws.amazon.com/sagemaker/autopilot/), at the end, we will demo how to use [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) for batch inference use case.

In the notebook, we will explore the process on how to [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to kick off Autopilot experiment job.

> **_NOTE_** Please do finish [01_sagemaker_autopilot_data_preparation.ipynb](./01_sagemaker_autopilot_data_preparation.ipynb) notebook first so that we have the training dataset ready on the S3 bucket.

### Why Amazon SageMaker Python SDK?

[Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) provides highlevel APIs to make data scientists / ML engineers life easy on use Amazon SageMaker service. 

## Restore Shared Variables

Retrieve shared variables created by [01_sagemaker_autopilot_data_preparation.ipynb](./01_sagemaker_autopilot_data_preparation.ipynb) notebook and list out the S3 URIs to prepare Autopilot experiment.

In [None]:
%store -r train_data_s3_path
%store -r test_data_s3_path
%store -r bucket
%store -r prefix

try:
  train_data_s3_path
except NameError:
    raise ValueError("Training dataset S3 URI is missing, please execute the data preparation notebook!")

## Setting up the SageMaker Autopilot Job<a name="Settingup"></a>

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a Autopilot job are:
* Amazon S3 location for input dataset and for all output artifacts
* Name of the column of the dataset you want to predict (`y` in this case) 
* An IAM role

Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row.

In [None]:
from datetime import datetime
from time import sleep

from sagemaker.automl import automl

timestamp_suffix = f"{datetime.now():%Y-%m-%d-%H-%M-%S}"
auto_ml_job_name = f"automl-banking-{timestamp_suffix}"
print(f"AutoMLJobName: {auto_ml_job_name}")

automl_job = automl.AutoML(
    role=role,
    target_attribute_name="y",
    output_path=f"s3://{bucket}/{prefix}/automl-output",
    problem_type="BinaryClassification",
    max_candidates=10,  # (We've set this low to prioritize demo speed over accuracy)
    job_objective={"MetricName": "F1"},
)

Specifying the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`) is **optional**. In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict). 

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a `Candidate` because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

## Launching the SageMaker Autopilot Job<a name="Launching"></a>

You can now launch the Autopilot job by calling the `fit()` method as described in the [SageMaker Python SDK AutoML doc](https://sagemaker.readthedocs.io/en/stable/api/training/automl.html#sagemaker.automl.automl.AutoML.fit).

In [None]:
automl_job.fit(inputs=train_data_s3_path, wait=False, logs=False)

## Tracking SageMaker Autopilot job progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [None]:
print("JobStatus - Secondary Status\n----------------------------")

while True:
    sleep(60)
    describe_response = automl_job.describe_auto_ml_job()
    print("{AutoMLJobStatus} - {AutoMLJobSecondaryStatus}".format(**describe_response))
    if describe_response["AutoMLJobStatus"] in ("Failed", "Completed", "Stopped"):
        break

## Results

The Autopilot job is completed, and we now have a set of models with their associated performance metric.
Let's consider the top 5.

In [None]:
candidates_list = automl_job.list_candidates(
    max_results=10, sort_by="FinalObjectiveMetricValue"
)

models = pd.json_normalize(candidates_list)[
    [
        "CandidateName",
        "FinalAutoMLJobObjectiveMetric.Value",
        "FinalAutoMLJobObjectiveMetric.MetricName",
    ]
].rename(
    columns={
        "FinalAutoMLJobObjectiveMetric.Value": "metric_value",
        "FinalAutoMLJobObjectiveMetric.MetricName": "metric_name",
        "CandidateName": "candidate_name",
    }
)

models

In [None]:
automl_job.best_candidate()

### Perform batch inference using the best candidate

Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). 

For classification problem types, the inference containers generated by SageMaker Autopilot allow you to select the response content for predictions. Valid inference response content are defined below for binary classification and multiclass classification problem types.

- `predicted_label` - predicted class
- `probability` - In binary classification, the probability that the result is predicted as the second or True class in the target column. In multiclass classification, the probability of the winning class.
- `labels` - list of all possible classes
- `probabilities` - list of all probabilities for all classes (order corresponds with `labels`)

By default the inference contianers are configured to generate the `predicted_label` only.

In this binary classification example we'll request both `predicted_label` and `probability` - demonstrating how this additional "confidence" output from the model can be used.

In [None]:
model_name = "automl-banking-model-" + timestamp_suffix
inference_response_keys = ["predicted_label", "probability"]
model = automl_job.create_model(
    name=model_name,
    candidate=automl_job.best_candidate(),
    inference_response_keys=inference_response_keys,
)

You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.

In [None]:
output_path = f"s3://{bucket}/{prefix}/inference-results/"

transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    assemble_with="Line",
    output_path=output_path,
)

We can now start the transform job.

In [None]:
transformer_job = transformer.transform(
    data=test_data_s3_path,
    data_type="S3Prefix",
    content_type="text/csv",
    split_type="Line",
    wait=False,
)

Watch the transform job for completion.

In [None]:
print("JobStatus\n----------")

while True:
    sleep(30)
    describe_response = sm.describe_transform_job(
        TransformJobName=transformer._current_job_name
    )
    job_run_status = describe_response["TransformJobStatus"]
    print(job_run_status)
    if job_run_status in ("Failed", "Completed", "Stopped"):
        break

Now let's view the results of the transform job:

In [None]:
test_data_preds = pd.read_csv(
    transformer.output_path + "test_data.csv.out",
    header=None,
    names=inference_response_keys,
)

test_data_preds

### Additional metrics

We can use the result of the transform job to evaluate additional metrics on the test dataset, using the [scikit-learn](https://scikit-learn.org/stable/index.html) library.

Common metrics for classification problems are [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) and [AP](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html).

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    average_precision_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_curve,
    roc_auc_score,
    roc_curve,
)

labels = test_data["y"]
AUC = roc_auc_score(labels == "yes", test_data_preds.probability)
AP = average_precision_score(labels, test_data_preds.probability, pos_label="yes")

print(f"AUC: {AUC:.3f}\nAP {AP:.3f}")

We can also generate a classification report and a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix):

In [None]:
print(classification_report(labels == "yes", test_data_preds.predicted_label == "yes"))
cm = confusion_matrix(labels, test_data_preds.predicted_label, labels=["yes", "no"])
ConfusionMatrixDisplay(cm, display_labels=["yes", "no"]).plot(
    include_values=["yes", "no"], cmap=plt.cm.Blues, values_format="d"
);

And present the model performance using [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and Precision-Recall curves.

In [None]:
f, [ax0, ax1] = plt.subplots(1, 2, figsize=(16, 9))

fpr, tpr, _ = roc_curve(labels == "yes", test_data_preds.probability)
ax0.step(fpr, tpr, where="post")
ax0.fill_between(fpr, tpr, step="post", alpha=0.2, color="b")
ax0.plot([0, 1], [0, 1], linestyle="--")
ax0.set_xlabel("False Positive Rate")
ax0.set_ylabel("True Positive Rate")
ax0.set_title("ROC Curve")

precision, recall, _ = precision_recall_curve(
    labels == "yes", test_data_preds.probability
)
ax1.step(recall, precision, where="post")
ax1.fill_between(recall, precision, step="post", alpha=0.2, color="b")
ax1.set_xlabel("Recall")
ax1.set_ylabel("Precision")
ax1.set_title("Precision-Recall Curve")

for ax in [ax0, ax1]:
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1])
    ax.set_aspect(1)
    ax.grid()

### Exploration and Modelling Notebooks

As well as the results and candidate models themselves, Autopilot generates other artifacts including:

- A **Data Exploration Notebook**: produced during the analysis phase of the job, that helps you identify problems in your dataset.
- A **Candidate Definitions Notebook**: interactively stepping through the steps taken by Autopilot to define and train candidates, and select the best one.
- **Supporting Python code**: Including the actual code used for the different pre-processing steps.

To get a good overview of the available assets, we'll download not just the notebooks but the whole output folder:

In [None]:
automl_job_desc = automl_job.describe_auto_ml_job()

automl_output_s3uri = automl_job_desc["OutputDataConfig"]["S3OutputPath"]
print(f"Autopilot output:\n{automl_output_s3uri}")

print(f"Downloading to autopilot_output/...")
sagemaker.s3.S3Downloader.download(automl_output_s3uri, "autopilot_output/")
print("Done")

From this download we can view not just the notebooks, but also other assets like the generated Python code they link to, and pre-processed datasets. Explore the notebooks linked below, but also check out the other contents in the `autopilot_output` folder!

In [None]:
from IPython.display import Markdown

candidate_notebook_s3uri = automl_job_desc["AutoMLJobArtifacts"][
    "CandidateDefinitionNotebookLocation"
]
candidate_notebook_path = "autopilot_output" + candidate_notebook_s3uri[len(automl_output_s3uri):]

print(f"Candidate definition notebook:\n{candidate_notebook_s3uri}")
print(f"\nDownloaded at:")
display(Markdown(f"[{candidate_notebook_path}]({candidate_notebook_path})"))

In [None]:
dataexp_notebook_s3uri = automl_job_desc["AutoMLJobArtifacts"][
    "DataExplorationNotebookLocation"
]
dataexp_notebook_path = "autopilot_output" + dataexp_notebook_s3uri[len(automl_output_s3uri):]

print(f"Data exploration notebook:\n{dataexp_notebook_s3uri}")
print(f"\nDownloaded at:")
display(Markdown(f"[{dataexp_notebook_path}]({dataexp_notebook_path})"))

### Best Model Explainability Artifacts
SageMaker AutoPilot uses SageMaker Clarify to generate an explainability report for the best candidate.

These Clarify artifacts are also available from S3, and were already included in our download above:

In [None]:
explainability_s3uri = automl_job.best_candidate()["CandidateProperties"][
    "CandidateArtifactLocations"
]["Explainability"]
explainability_path = "autopilot_output" + explainability_s3uri[len(automl_output_s3uri):]

print(f"Explainability artifacts:\n{explainability_s3uri}")
print(f"\nDownloaded to folder:")
print(f"{explainability_path}")

## Cleanup

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. 

In [None]:
# import boto3

# s3 = boto3.resource('s3')
# bucket = s3.Bucket(bucket)

# job_outputs_prefix = '{}/output/{}'.format(prefix, auto_ml_job_name)
# bucket.objects.filter(Prefix=job_outputs_prefix).delete()