# Workshop Setup and Data Preparation

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.

---

---

## Contents

1. [Introduction](#Introduction)
1. [Prerequisites](#Prerequisites)
1. [Downloading the dataset](#Downloading)
1. [Upload the dataset to Amazon S3](#Uploading)

## Introduction

> ***This notebook must be completed before all other labs in the work.***

In the notebook, you will setup the workshop environment and prepare the data for other labs. To finish the notebook execution, you may:
 * Run all `code` cells with Menu `Run` -> `Run All Cells`
 * Run each `code` cell with `Shift + Enter`
 
In the labs, we'll use the **[Direct Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing)** as per:

> *\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*

Please walk through the cell execution results and try to understand the dataset before jumping on other labs.

## Setup

Setup the environment variables for data set download.

In [None]:
import pandas as pd

import zipfile

import sagemaker

session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = session.default_bucket()
prefix = "mlu-workshop/direct-marketing"

data_folder = "../data"

## Downloading the dataset<a name="Downloading"></a>

Here, we'll download [the dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the SageMaker sample data S3 bucket.

In [None]:
!wget -P $data_folder -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip


with zipfile.ZipFile(f"{data_folder}/bank-additional.zip", "r") as zip_ref:
    print("Unzipping...")
    zip_ref.extractall(data_folder)
print("Done")

data_file_path = f"{data_folder}/bank-additional/bank-additional-full.csv"

## View the downloaded dataset<a name="View"></a>

It's recommended to perform a check of the dataset to make sure that it has no obvious errors. 

> The [the dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) is small and it's easy to inspect it in the notebook environment. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. If you are keen to use SageMaker Autopilot, please note that it is capable of handling datasets up to 5 GB.

Read the data into a Pandas data frame and take a look.

In [None]:
data = pd.read_csv(data_file_path)
with pd.option_context("display.max_columns", 500):
    # Make sure we can see all of the columns
    display(data)

Let's talk about the data.  At a high level, we can see:

* We have a little over 40K customer records, and 20 features for each customer
* The features are mixed; some numeric, some categorical
* The data appears to be sorted, at least by `time` and `contact`, maybe more

_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable:*
* `y`: Has the client subscribed a term deposit? (binary: 'yes','no')

### Store the shared variables

In [None]:
%store bucket
%store prefix
%store data_folder
%store data_file_path

### Next

Since the data has been download, we are ready to kick off our first la - Autopilot Experiment. Now let's prepare the training data for AutoPilot.

* [Preparing Autopilot experiment training data](../1.autopilot/01_sagemaker_autopilot_setup.ipynb)