# Direct Marketing with Amazon SageMaker Autopilot

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.

---

---

## Contents

1. [Introduction](#Introduction)
1. [Prerequisites](#Prerequisites)
1. [Downloading the dataset](#Downloading)
1. [Upload the dataset to Amazon S3](#Uploading)

## Introduction

[Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways:

- On autopilot (hence the name) or with human guidance
- Without code (through the SageMaker Studio UI), or using the AWS SDKs.

This notebook will explain the ML problem and dataset, then use AutoPilot service in SageMaker Studio UI without codeing to auto-train & deploy a model. Then, will use the AWS SDKs to simply create and deploy a machine learning model.

### The Problem

A typical introductory task in machine learning (the "Hello World" equivalent) is one that uses a dataset to predict whether a customer will enroll for a term deposit at a bank, after one or more phone calls. For more information about the task and the dataset used, see [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing).

Direct marketing, through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention are limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem. You can imagine that this task would readily translate to marketing lead prioritization in your own organization.

This notebook demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or "candidates". Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker.

Other examples demonstrate how to customize models in various ways. For instance, models deployed to devices typically have memory constraints that need to be satisfied as well as accuracy. Other use cases have real-time deployment requirements and latency constraints. For now, keep it simple.

## Prerequisites

In this first cell, we'll:

- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to SageMaker with the open-source [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), `sagemaker`

Note that while [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the general-purpose AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

In [2]:
import sagemaker

session = sagemaker.Session()

bucket = session.default_bucket()
prefix = "mlu-workshop/autopilot-dm"

sm = session.sagemaker_client

role = sagemaker.get_execution_role()

## Downloading the dataset<a name="Downloading"></a>

In this example we'll use the **direct marketing dataset** as per:

> *\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*

Here, we'll download [the dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the SageMaker sample data S3 bucket.

In [3]:
!wget -P data/ -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

import zipfile

with zipfile.ZipFile("data/bank-additional.zip", "r") as zip_ref:
    print("Unzipping...")
    zip_ref.extractall("data")
print("Done")

local_data_path = "./data/bank-additional/bank-additional-full.csv"

--2022-02-10 01:15:15--  https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.224.97
Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|52.218.224.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 432828 (423K) [application/zip]
Saving to: ‘data/bank-additional.zip’


2022-02-10 01:15:17 (768 KB/s) - ‘data/bank-additional.zip’ saved [432828/432828]

Unzipping...
Done


## Upload the dataset to Amazon S3<a name="Uploading"></a>

Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.

Read the data into a Pandas data frame and take a look.

In [4]:
import pandas as pd

data = pd.read_csv(local_data_path)
with pd.option_context("display.max_columns", 500):
    # Make sure we can see all of the columns
    display(data)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,334,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,383,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,189,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,442,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


Note that there are 20 features to help predict the target column 'y'.

Amazon SageMaker Autopilot takes care of preprocessing your data for you. You do not need to perform conventional data preprocssing techniques such as handling missing values, converting categorical features to numeric features, scaling data, and handling more complicated data types.

Moreover, splitting the dataset into training and validation splits is not necessary. Autopilot takes care of this for you. You may, however, want to split out a test set. That's next, although you use it for batch inference at the end instead of testing the model.


### Reserve some data for calling batch inference on the model

Divide the data into training and testing splits. The training split is used by SageMaker Autopilot. The testing split is reserved to perform inference using the suggested model.


In [5]:
train_data = data.sample(frac=0.8, random_state=200)

test_data = data.drop(train_data.index)

test_data_no_target = test_data.drop(columns=["y"])

### Upload the dataset to Amazon S3
Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use.

**Please note down the S3 object URI of variable `train_data_s3_path`, which will be used in doing AutoPilot job in SageMaker Studio UI**

In [15]:
train_file = "data/train_data.csv"
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)

test_file = "data/test_data_without_label.csv"
test_data_no_target.to_csv(test_file, index=False, header=False)
# test_data_without_label_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
# print("Test data without label uploaded to: " + test_data_without_label_s3_path)

test_file_label = "data/test_data_with_label.csv"
test_data.to_csv(test_file_label, index=False, header=True)
# test_data_with_label_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
# print("Test data uploaded to: " + test_data_with_label_s3_path)

Train data uploaded to: s3://sagemaker-ap-southeast-2-452533547478/mlu-workshop/autopilot-dm/train/train_data.csv


### Store the shared variables

In [16]:
%store bucket
%store prefix
%store train_data_s3_path
%store test_file_label

Stored 'bucket' (str)
Stored 'prefix' (str)
Stored 'train_data_s3_path' (str)
Stored 'test_file_label' (str)


## Next

Since the data is sotred on S3 bucket, we are ready to kick off Autopilot experiment: 

a. [Using SageMaker Studio UI](./02a_sagemaker_autopilot_experiment_with_studio_ui.ipynb)
 
b. [Using SageMaker Python SDK](./02b_sagemaker_autopilot_experiment_with_sdk.ipynb)