# Assignment 1: local development in a notebook
In this assignment you write code to process data, do feature engineering, train a model, and evaluate the model on the test dataset. You do all processing in the local notebook, trading scalability and reproducibility for the speed of deployment and fast interations to experiment with different feature engineering approaches and model types.

Optionally you can run this notebook headlessly as a SageMaker on-demand or scheduled notebook job. 

Refer to the notebook [`01-idea-development.ipynb`](../01-idea-development.ipynb) for code snippets and a general guidance for the exercises in this assignment.

Feel free to implement your own specific use case with your own dataset and a model.

## Install and import packages
You can use `%` commands and `pip install` to install any packages in the notebook kernel.

In [10]:
%pip install --upgrade pip
%pip install -q  xgboost sagemaker-experiments

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [11]:
import pandas as pd
import numpy as np 
import json
import joblib
import xgboost as xgb
import sagemaker
import boto3
import os
from time import gmtime, strftime, sleep
from sklearn.metrics import roc_auc_score
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

sagemaker.__version__

'2.165.0'

In [12]:
session = sagemaker.Session()
sm = session.sagemaker_client

## Load data
- Create variables to keep literal constants, like file names and paths
- Load data from a local file to a Pandas dataframe
- Explore the data

In [13]:
# Write data load code
file_source = "EFS"
file_name = "bank-additional-full.csv"
input_path = "./data/bank-additional" 
output_path = "./data"

In [14]:
if file_source != "EFS":
    session.download_data(
        path=os.path.join(input_path, ""), 
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input/{file_name}"
    )

In [15]:
df_data = pd.read_csv(os.path.join(input_path, file_name), sep=";")

pd.set_option("display.max_columns", 500)  # View all of the columns
df_data  # show first 5 and last 5 rows of the dataframe

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,334,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,383,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,189,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,442,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


## [Optional] create an experiment
Use [Amazon SageMaker Experiments Python SDK](https://sagemaker-experiments.readthedocs.io/en/latest/) to create and manage your experiments.

In [16]:
# Wirte code to create an experiment
experiment_name = f"from-idea-to-prod-experiment-{strftime('%d-%H-%M-%S', gmtime())}"

%store experiment_name

Stored 'experiment_name' (str)


## Exercise 1: EDA and feature engineering
- Implement data processing
- Implement EDA
- Implement feature engineering

In [17]:
# Indicator variable to capture when pdays takes a value of 999
target_col = "y"
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

# Indicator for individuals not actively employed
df_data["not_working"] = np.where(
    np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
)

# remove unnecessary data
df_model_data = df_data.drop(
    ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
    axis=1,
)

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

# Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
df_model_data = pd.concat(
    [
        df_model_data["y_yes"].rename(target_col),
        df_model_data.drop(["y_no", "y_yes"], axis=1),
    ],
    axis=1,
)

## Exercise 2: Split data
- Prepare data for training, split the dataset

In [19]:
train_data, validation_data, test_data = np.split(
        df_model_data.sample(frac=1, random_state=1729),
        [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],)
print(f"Data split > train:{train_data.shape} | validation:{validation_data.shape} | test:{test_data.shape}")

Data split > train:(28831, 60) | validation:(8238, 60) | test:(4119, 60)


## Exercise 3: Model training
- Train the model
- Optional: track your model training runs as trials and trials components

In [20]:
# Exercise 3 - write code here
train_features = train_data.drop(target_col, axis=1)
train_label = pd.DataFrame(train_data[target_col])

In [21]:
dtrain = xgb.DMatrix(train_features, label=train_label)

In [22]:
hyperparams = {
                "max_depth": 5,
                "eta": 0.5,
                "alpha": 2.5,
                "objective": "binary:logistic",
                "subsample" : 0.8,
                "colsample_bytree" : 0.8,
                "min_child_weight" : 3
              }

num_boost_round = 150
nfold = 3
early_stopping_rounds = 10

In [23]:
# Cross-validate on training data
cv_results = xgb.cv(
    params=hyperparams,
    dtrain=dtrain,
    num_boost_round=num_boost_round,
    nfold=nfold,
    early_stopping_rounds=early_stopping_rounds,
    metrics=["auc"],
    seed=10,
)

## Exercise 4: Validate model
- Validate the model on the test dataset

In [24]:
metrics_data = {
    "binary_classification_metrics": {
        "validation:auc": {
            "value": cv_results.iloc[-1]["test-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["test-auc-std"]
        },
        "train:auc": {
            "value": cv_results.iloc[-1]["train-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["train-auc-std"]
        },
    }
}

In [29]:
model = xgb.train(hyperparams, dtrain)

In [33]:
test_features = test_data.drop(target_col, axis=1)
test_label = pd.DataFrame(test_data[target_col])
dtest = xgb.DMatrix(test_features, label=test_label)
train_pred = model.predict(dtrain)
test_pred = model.predict(dtest)

In [34]:
# Exercise 4 - write code here

test_auc = roc_auc_score(test_label, test_pred)
train_auc = roc_auc_score(train_label, train_pred)
print(f"Train-auc:{train_auc:.2f}, Test-auc:{test_auc:.2f}")

Train-auc:0.79, Test-auc:0.77


In [25]:
print(f"Cross-validated train-auc:{cv_results.iloc[-1]['train-auc-mean']:.2f}")
print(f"Cross-validated validation-auc:{cv_results.iloc[-1]['test-auc-mean']:.2f}")

Cross-validated train-auc:0.79
Cross-validated validation-auc:0.77


## Exercise 5: [Optional] explore your experiment in Studio
Refer to [View and Compare Amazon SageMaker Experiments, Trials, and Trial Components](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-view-compare.html) developer guide to understand how to work with experiments and trials.

## Exercise 6: [Optional] run the notebook as a SageMaker job
Adapt your notebook code and follow the instructions in [Notebook-based Workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html) to run this notebook on-demand headlessly a SageMaker job.

## Continue with the assignment 2
Navigate to the [assignment 2](02-assignment-sagemaker-containers.ipynb) notebook.