# <p style="text-align: center;"> Starbucks Capstone Challenge

<img src="./Starbucks_Rewards_App.png" width="400" height="300">

## 0. Setting up the notebook

In [1]:
import pandas as pd
import numpy as np
import math
import json
import seaborn as sns
import boto3
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
import matplotlib.pyplot as plt
from dateutil import parser
from datetime import datetime
from sklearn import preprocessing
import plotly.express as px
%matplotlib inline

In [2]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer

In [3]:
# Make sure that we use SageMaker 1.x
!pip install sagemaker==1.72.0

Collecting sagemaker==1.72.0
  Downloading sagemaker-1.72.0.tar.gz (297 kB)
[K     |████████████████████████████████| 297 kB 39.7 MB/s eta 0:00:01
Collecting smdebug-rulesconfig==0.1.4
  Downloading smdebug_rulesconfig-0.1.4-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-1.72.0-py2.py3-none-any.whl size=386358 sha256=09ea764475ddb4dbf558f4c16e2924d5316fc85f13ff2051d8d89fbefbf54f1d
  Stored in directory: /home/ec2-user/.cache/pip/wheels/c3/58/70/85faf4437568bfaa4c419937569ba1fe54d44c5db42406bbd7
Successfully built sagemaker
Installing collected packages: smdebug-rulesconfig, sagemaker
  Attempting uninstall: smdebug-rulesconfig
    Found existing installation: smdebug-rulesconfig 1.0.1
    Uninstalling smdebug-rulesconfig-1.0.1:
      Successfully uninstalled smdebug-rulesconfig-1.0.1
  Attempting uninstall: sagemaker
    Found existing instal

In [3]:
session = sagemaker.Session()

role = get_execution_role()

prefix = 'starbucks-xgboost-capstone'

## 1. Training a XGBoost model

At the capstone proposal, I was planning use the DeepAR model to predict how much someone will spend in the next days. However, analysing the data I realize that if would like to identify the customer, i.e. use the person id in the data, I wouldn't have hourly data to use in the DeepAR model. After some attempts, tryng to adapt the data to create a correct and big enough dataset  to train the DeepAR model, I realize that would be a better time investiment using another model. Thus, I chose a Amazon SageMaker XGBoost model.

Once we have a regression problem to solve, predict the transaction value to the next few days, we can use a XGBoost model to do that. TALK MORE ABOUT XGBOOST!

In [4]:
# Obtained from https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Batch%20Transform)%20-%20High%20Level.ipynb
# We use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')

# Now we can construct the estimator object
xgb = sagemaker.estimator.Estimator(container, # The image name of the training container
                                    role,      # The IAM role to use (our current role in this case)
                                    train_instance_count=1, # The number of instances to use for training
                                    train_instance_type='ml.m4.xlarge', # The type of instance to use for training
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                                                        # Where to save the output (the model artifacts)
                                    sagemaker_session=session) # The current SageMaker session

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:
	get_image_uri(region, 'xgboost', '1.0-1').
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [5]:
# Obtained from https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Batch%20Transform)%20-%20High%20Level.ipynb
xgb.set_hyperparameters(max_depth=20,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='reg:linear',
                        early_stopping_rounds=20,
                        num_round=300)

In [6]:
data_dir = 'completed_offers_transactions'

train_location = "s3://sagemaker-us-east-1-839757017467/starbucks-capstone-project/train/train.csv"
val_location   = "s3://sagemaker-us-east-1-839757017467/starbucks-capstone-project/validation/validation.csv"
test_location  = "s3://sagemaker-us-east-1-839757017467/starbucks-capstone-project/test/test.csv"

In [7]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')


's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [8]:
# Obtained from https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Batch%20Transform)%20-%20High%20Level.ipynb

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2021-08-11 00:16:11 Starting - Starting the training job...
2021-08-11 00:16:15 Starting - Launching requested ML instances......
2021-08-11 00:17:12 Starting - Preparing the instances for training.........
2021-08-11 00:18:44 Downloading - Downloading input data...
2021-08-11 00:19:39 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2021-08-11:00:19:41:INFO] Running standalone xgboost training.[0m
[34m[2021-08-11:00:19:41:INFO] File size need to be processed in the node: 0.48mb. Available memory size in the node: 8420.02mb[0m
[34m[2021-08-11:00:19:41:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:19:41] S3DistributionType set as FullyReplicated[0m
[34m[00:19:41] 17322x5 matrix with 86610 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-08-11:00:19:41:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:19:41] S3DistributionType set as FullyReplicated[0m
[34m[

## 2. Testing the model

In [9]:
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [10]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

In [11]:
xgb_transformer.wait()

.................................
[34mArguments: serve[0m
[34m[2021-08-11 00:25:45 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-08-11 00:25:45 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-08-11 00:25:45 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-08-11 00:25:45 +0000] [20] [INFO] Booting worker with pid: 20[0m
[34m[2021-08-11 00:25:45 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-08-11 00:25:45 +0000] [22] [INFO] Booting worker with pid: 22[0m
[34m[2021-08-11 00:25:45 +0000] [23] [INFO] Booting worker with pid: 23[0m
  monkey.patch_all(subprocess=True)[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-08-11:00:25:45:INFO] Model loaded successfully for worker : 20[0m
[34m[2021-08-11:00:25:45:INFO] Model loaded successfully for worker : 21[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-08-11:00:25:45:INFO] Model loaded successfully for worker : 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021

In [12]:
# Download the output for test set
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-us-east-1-839757017467/xgboost-2021-08-11-00-20-24-726/test.csv.out to completed_offers_transactions/test.csv.out


In [54]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
Y_pred.rename(columns={0:'Predictions'}, inplace=True)

In [55]:
test_to_plot_key = os.path.join(data_dir, 'test_to_plot.csv')

Y_test = pd.read_csv(test_to_plot_key, header=None, usecols=[0])
Y_test.rename(columns={0:'Test Data'}, inplace=True)

#### Plotting predictions

In [56]:
# Creates a dataframe with concatenated data just to plot
df_to_plot = pd.concat([Y_pred, Y_test], axis=1)

# Plot 
fig = px.line(df_to_plot[1100:1200], y=['Predictions', 'Test Data'])

# Show plot 
fig.show()

---

## Deploy the Trained Model

In [27]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Using already existing model: xgboost-2021-08-11-00-16-11-486


-----------------!

In [28]:
# Path to test dataset
test_key = os.path.join(data_dir, 'test.csv')

# Read the test.csv file
X_test = pd.read_csv(test_key, header=None)
X_test.head()

Unnamed: 0,0,1,2,3,4
0,0,9421,5,0,1
1,0,735,6,0,0
2,0,6944,3,0,1
3,0,1463,4,0,0
4,0,11221,7,1,0


In [29]:
def predict(data, content_type='text/csv'):
    """ Function to get predictions from a .csv file."""
    # We need to tell the endpoint what format the data we are sending is in
    xgb_predictor.content_type = content_type
    xgb_predictor.serializer = csv_serializer
    
    Y_pred = xgb_predictor.predict(data.values).decode('utf-8')
    # predictions is currently a comma delimited string and so we would like to break it up
    # as a numpy array.
    predictions = np.fromstring(Y_pred, sep=',')
    
    return predictions

In [30]:
def predict_app(data, content_type='text/csv'):
    """ Function to get predictions from numpy array."""
    # We need to tell the endpoint what format the data we are sending is in
    xgb_predictor.content_type = content_type
    xgb_predictor.serializer = csv_serializer
    
    Y_pred = xgb_predictor.predict(data).decode('utf-8')
    # predictions is currently a comma delimited string and so we would like to break it up
    # as a numpy array.
    predictions = np.fromstring(Y_pred, sep=',')
    
    return predictions

In [36]:
Y_pred = predict(X_test)
Y_pred

array([20.75829887, 17.2298584 , 19.05839348, ..., 14.55226517,
       19.05433846, 22.88606644])

In [48]:
time_list = [0,6,12,18]
df_to_plot = pd.DataFrame(np.append(time_list, Y_pred[:4]).reshape((2,4)).transpose())
df_to_plot.rename(columns={0:'time', 1:'Predicted Transaction'}, inplace=True)
df_to_plot

Unnamed: 0,time,Predicted Transaction
0,0.0,20.758299
1,6.0,17.229858
2,12.0,19.058393
3,18.0,18.53643


### Calculating the metrics

Let's calculate the Mean Squared Logarithmic Error (MSLE) and Mean Squared Error (MSE) of our model, to check its performance:

In [32]:
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error

# Calculate the MSLE
msle = mean_squared_log_error(Y_pred, Y_test)

# Calculate the RMSE
rmse = np.sqrt(mean_squared_error(Y_pred, Y_test))

print("Mean Squared Logarithmic Error: ", msle)
print("Root Mean Squared Error: ", msle)

Mean Squared Logarithmic Error:  0.3569353451461626
Root Mean Squared Error:  0.3569353451461626


In [33]:
Y_pred = pd.DataFrame(Y_pred).rename(columns={0:"Y_pred"})
# Creates a dataframe with concatenated data just to plot
df_to_plot = pd.concat([Y_pred, Y_test], axis=1)

# Plot 
fig = px.line(df_to_plot[1100:1200], y=['Y_pred', 'Test Data'])

# Show plot 
fig.show()

In [34]:
xgb_predictor.endpoint

'xgboost-2021-08-11-00-16-11-486'

### Functions to preprocessing input data to predict

In [None]:
# Complete data to get all possible person ID and offers
data_dir = 'completed_offers_transactions'
df = pd.read_csv(data_dir+'/transactions_people_offers.csv', sep=',')

def input_data_prep(df, person_id, offer_id, offer_type):
    """
    Creates a encoded np.array to input into the model to get predictions.
    """
    # Auxilars    
    input_values = np.array([])

    # Get person gender
    person_gender = df['gender'].unique()[0]

    values_list = [person_id, offer_id,  offer_type, person_gender]
    input_data = np.array(values_list).reshape((1,4))

    le_list = get_enconding_list(df)
    input_values = encode_input_data(input_data, le_list)
    
    return input_values


def get_enconding_list(df):

    columns_to_encoding = ['person', 'offer_id', 'offer_type', 'gender']
    le_list = []

    for i in columns_to_encoding:
        # Create an object LabelEncoder()
        le = preprocessing.LabelEncoder()
        # Get the list of values for the column
        values_to_encoding = df[i].values
        # Run the enconding for all possible values of the column
        le.fit(values_to_encoding)

        # Saves the Label Encoder Object to get inverse transform later
        le_list.append(le)

    return le_list

def encode_input_data(input_data, le_list):
    """Encode input data to input into Estimator endpoint."""
    encoded_list = []
    time_list = [0, 6, 12, 18]
    count = 0
    for i in range(0,4):
        le = le_list[i]
        df_input_values = pd.DataFrame(input_data).iloc[0, i:i+1].values
        encoded_list.append(le.transform(df_input_values))
    
    encoded_input = np.array(encoded_list).reshape((1,4))
    encoded_input = np.ones((4,1), dtype='int') * encoded_input
    encoded_input = np.insert(encoded_input, 0, 0, axis=1)

    for i in encoded_input:
        i[0] = time_list[count]
        count += 1
    
    return encoded_input

In [None]:
xgb_predictor.delete_endpoint()

## Deploy the model for the web app

Now we saw the model is working, we can write some inference code such that we can input one of the possible 4 times of the day (0, 6, 12 or 18), the offer id we will send and the person id to the web app and predict how much this person will spend in the next few days.

<img src="./assets/Architecture.PNG" width="600" height="600">

## API Gateway

<img src="./assets/API_Gateway.PNG" width="1700" height="1000">

In [None]:
# Api endpoint created
API_ENDPOINT = "https://xpfqnp3i55.execute-api.us-east-1.amazonaws.com/PROD"

### Lambda function

In [None]:
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = 'xgboost-2021-08-11-00-16-11-486',    # The name of the endpoint we created
                                       ContentType = 'text/csv',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }

### Results

<img src="./assets/app_result_1.PNG" width="500" height="600">

<img src="./assets/app_result_2.PNG" width="500" height="600">

<img src="./assets/app_result_3.PNG" width="500" height="600">