# Sagemaker demo

Sagemaker has 3 options for training and deploying models:
1. A standard Sagemaker model, like xgboost.
2. A custom `sklearn` model using a pre-built sagemaker scikit-learn docker image.
3. Any other custom build model using an own custom image

In this notebook we demonstrate how to do option 1 and 2, which can be done without using a `Dockerfile`. Option 3 involves a bit more engineering.

For part one we start by following this [step-by-step guide](https://aws.amazon.com/getting-started/tutorials/build-train-deploy-machine-learning-model-sagemaker/) on how to model with sagemaker.

In [16]:
import boto3
import os

bucket_name = 'playbucket-steven'

## Example data

Here we are going to predict Black Friday sales based on this [kaggle data set](https://www.kaggle.com/mehdidag/black-friday/version/1).
- Its around 5 MB
- Its dimensions are 538k x 12

### Getting data from S3

Boto is the AWS SDK for Python. Access S3 using boto:

In [3]:
# create s3 resource
s3 = boto3.resource('s3')

# Print out bucket names
for bucket in s3.buckets.all():
    print(bucket.name)
    # print out file names in bucket, called 'keys'
    for objects in bucket.objects.all():
        print('- ', objects.key)

Amazon S3 does not provide compute, such as zip compression/decompression. You would need to write a program that:
- Downloads the zip file
- Extracts the files
- Does actions on the files

In [6]:
import pandas as pd
import zipfile

# Download if not there yet
if not os.path.isfile('./data/black-friday.zip'):
    s3.Bucket(bucket_name).download_file('black-friday.zip', 'data/black-friday.zip')

# Extract and read with pandas
zf = zipfile.ZipFile('data/black-friday.zip') 
df = pd.read_csv(zf.open('BlackFriday.csv'))    

### Examine and preprocess dataset

Target column to predict: 	`Purchase` = Purchase amount in dollars

In [7]:
df.shape

(537577, 12)

Create X and y data sets

In [9]:
X = df.iloc[:, 7:] # keep only integer columns to avoid one hot encoding
y = X.pop('Purchase')

Split in train test and create a model

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Approach 1: Using a standard `Sagemaker` model

To prepare the data, train the machine learning model, and deploy it, you will need to import some libraries and define a few environment variables in your Jupyter notebook environment. 

In [15]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the eu-west-1 region. You will use the 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


To use a SageMaker pre-built XGBoost model, you will need to reformat the header and first column of the training data and load the data from the S3 bucket. 

Using [Sagemaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). Also check out this [example XGBoost notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb)

For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.



In [21]:
# write training data to csv in the required XGBoost format
pd.concat([y_train, X_train], axis=1).to_csv('./data/train.csv', index=False, header=False)

# upload to bucket used for training
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('./data/train.csv')

# load training data for sagemaker
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')


Next, you need to set up the SageMaker session, create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters. 

Note, for a regression task you will need to set `objective='reg:linear'`.

In [30]:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket_name, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='reg:linear',num_round=100)  # set objective function here

With the data loaded and XGBoost estimator set up, train the model using gradient optimization on a ml.m4.xlarge instance by executing the `fit` command.

This kicks off a training job (see UI), where the training happens. Does not happen on this notebook instance!

In [31]:
xgb.fit({'train': s3_input_train})

INFO:sagemaker:Creating training-job with name: xgboost-2019-04-05-09-29-20-327


2019-04-05 09:29:20 Starting - Starting the training job...
2019-04-05 09:29:21 Starting - Launching requested ML instances......
2019-04-05 09:30:24 Starting - Preparing the instances for training......
2019-04-05 09:31:45 Downloading - Downloading input data
2019-04-05 09:31:45 Training - Downloading the training image.
[31mArguments: train[0m
[31m[2019-04-05:09:31:53:INFO] Running standalone xgboost training.[0m
[31m[2019-04-05:09:31:53:INFO] Path /opt/ml/input/data/validation does not exist![0m
[31m[2019-04-05:09:31:53:INFO] File size need to be processed in the node: 5.15mb. Available memory size in the node: 8388.12mb[0m
[31m[2019-04-05:09:31:53:INFO] Determined delimiter of CSV input is ','[0m
[31m[09:31:53] S3DistributionType set as FullyReplicated[0m
[31m[09:31:53] 360176x4 matrix with 1190311 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[09:31:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 54 extra node

### Deploy the model

You will deploy the trained model to an endpoint, reformat then load the CSV data, then run the model to create predictions.

In [32]:
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.large')

INFO:sagemaker:Creating model with name: xgboost-2019-04-05-09-38-06-958
INFO:sagemaker:Creating endpoint with name xgboost-2019-04-05-09-29-20-327


---------------------------------------------------------------------------!

Making a prediction

In [35]:
# test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).as_matrix() #load the data into an array
X_test_array = X_test.as_matrix()

xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = csv_serializer # set the serializer type

predictions = xgb_predictor.predict(X_test_array).decode('utf-8') # predict!

predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

(177401,)


### Evaluate your model

In this step, you will evaluate the performance and accuracy of the machine learning model.


In [41]:
from sklearn.metrics import r2_score

r2_score(y_test, predictions_array, multioutput='variance_weighted')

0.646891500834572

### Terminate your resources

To delete the SageMaker endpoint and possibly the objects in your S3 bucket

In [58]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

# bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
# bucket_to_delete.objects.all().delete()

INFO:sagemaker:Deleting endpoint with name: xgboost-2019-04-05-09-29-20-327


## Approach 2: Using a custom `sklearn` model

Let's first develop a custom model, and grid search pipeline.

In [12]:
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV


pipeline = Pipeline([
    ('imputer', Imputer()),
    ('regressor', DecisionTreeRegressor())
])


params = {'regressor__max_depth': [2, 3, 4, 5]}

# replace len(data) for n depending on target
grid_search = GridSearchCV(pipeline, 
                           n_jobs=-1,
                           param_grid=params, 
                           cv=5)

grid_search.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('regressor', DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'regressor__max_depth': [2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [13]:
grid_search.predict(X_test)

array([13330.77036981, 13330.77036981,  8976.93940026, ...,
       13780.45761711,  4710.39645491,  6412.51659019])

In [14]:
grid_search.score(X_test, y_test)

0.493346139146699

### Now what?

We have a model, how do we deploy it and make an endpoint?

Use the pre-build sklearn image as is done in this [tutorial](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb)

In [45]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'pipeline.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.m5.large",
    role=role,
    sagemaker_session=sess,
    hyperparameters={'regressor_max_depth': 4})  # hyperparameters that are not tuned

Probably the above is for running just a standard model. Now I have also put grid search in the `pipeline.py`, but this might be removed and should be part of the hyperparameter tuning that sagemaker offers through `HyperParameterTrainingJobs`, see [here](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html)

In [54]:
# train_input = sess.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )

# sklearn.fit({'train': train_input})
sklearn.fit({'train': s3_input_train})

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2019-04-05-14-25-08-068


2019-04-05 14:25:08 Starting - Starting the training job...
2019-04-05 14:25:09 Starting - Launching requested ML instances......
2019-04-05 14:26:11 Starting - Preparing the instances for training......
2019-04-05 14:27:35 Downloading - Downloading input data
2019-04-05 14:27:35 Training - Training image download completed. Training in progress..
[31m2019-04-05 14:27:35,369 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-04-05 14:27:35,372 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-04-05 14:27:35,394 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-04-05 14:27:35,756 sagemaker-containers INFO     Module pipeline does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-04-05 14:27:35,757 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-04-05 14:27:35,757 sagemaker-containers INFO     Generating MANIFEST.in[0m
[3

#### Deploy the model 

Deploying the model to SageMaker hosting just requires a deploy call on the fitted model. This call takes an instance count and instance type.

In [55]:
sklearn_predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m5.large")

INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2019-04-05-14-25-08-068
INFO:sagemaker:Creating endpoint with name sagemaker-scikit-learn-2019-04-05-14-25-08-068


--------------------------------------------------------------!

In [57]:
# same as for the xgboost inference job

sklearn_predictor.content_type = 'text/csv' # set the data type for an inference
sklearn_predictor.serializer = csv_serializer # set the serializer type

sklearn_predictions = sklearn_predictor.predict(X_test_array)#.decode('utf-8') # predict!

sklearn_predictions_array = np.fromstring(sklearn_predictions[1:], sep=',') # and turn the prediction into an array
print('predictions shape:', sklearn_predictions_array.shape)

print('r2_score:', r2_score(y_test, predictions_array, multioutput='variance_weighted'))

predictions shape: (0,)
r2_score: 0.646891500834572


Done.