# Targeting Direct Marketing

Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls.

To execute this notebook in SageMaker Studio, select the `Data Science` image.

In [None]:
import numpy as np                                
import pandas as pd                              
import matplotlib.pyplot as plt      
import zipfile
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

Run the cell below to import or install the Data Wrangler widget to show automatic visualization and generate code to fix data quality issues


In [None]:

try:
    import sagemaker_datawrangler
except ImportError:
    !pip install --upgrade sagemaker-datawrangler
    import sagemaker_datawrangler

# Display Pandas DataFrame to view the widget: df, display(df), df.sample()... 

In [None]:
pd.__version__

Make sure pandas version is set to `1.2.4` or later. If it is not the case, restart the kernel before going further

---

Let's start by downloading the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [None]:
!wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

with zipfile.ZipFile('bank-additional.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

Now lets read this into a Pandas data frame and take a look. Because we imported the `sagemaker_datawrangler` library we will automatically be able to view distributions, issues with the data, and other helpful recomendations and built in transformations. 

In [None]:
data = pd.read_csv('./bank-additional/bank-additional-full.csv')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page

Let's use the Data Wrangler widget to visualize our data.

In [None]:
data

In [None]:
# # Note: These transformations can be done through the graphical widget that we generated above. The data prep widget will automatically generate code for transformations that you do.

# output_df = (data).copy(deep=True)

# # Code to Drop missing for column: marital to resolve warning: Disguised missing values 
# missing_values = ['unknown']
# output_df = output_df[~output_df['marital'].isin(missing_values)]


# # Code to Drop column for column: contact to resolve warning: Constant column 
# output_df=output_df.drop(columns=['contact'])

Encode the target column.

In [None]:
output_df['y'] = output_df['y'].replace({'no': 0, 'yes': 1})

Drop unused columns.

In [None]:
output_df = output_df.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

Encode the features.

In [None]:
model_data = pd.get_dummies(output_df)

Split the data.

In [None]:
X = model_data.loc[:, model_data.columns != 'y']
y = model_data['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


Train a decision tree with scikit-learn.

In [None]:
estimator = DecisionTreeClassifier()
estimator.fit(X_train, y_train)

In [None]:
y_pred = estimator.predict(X_test)

Calculate model accuracy.

In [None]:
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc}')

Save the model with `joblib`.

In [None]:
import joblib

model_path = 'sklearn_model.joblib'

joblib.dump(estimator, model_path) 

Use the SageMaker SDK to upload model to S3.

In [None]:
import sagemaker

session = sagemaker.Session()

bucket = session.default_bucket()
print(f"S3 bucket: {bucket}")

key_prefix = "my-sklearn-model"
session.upload_data(model_path, bucket, key_prefix=key_prefix)

Use the AWS CLI to check if file exists in S3.

In [None]:
! aws s3 ls {bucket}/{key_prefix}/