# Model Building and Deployment

We will use this notebook to build **Logistic Regression** as baseline model. We will further build **Support Vector Classifier-Linear** and **XGBoost** models. Then, we will evaluate the three models based on F1, Precision, Recall and ROC-AUC Scores. The best model will be picked and deployed using AWS-Sagemaker. At last, all the resources will be cleared.

In [1]:
# uncomment if not already installed sagemaker
# !pip install sagemaker

# import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

# import sklearn libraries
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

# import sagemaker libraries
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

In [2]:
df = pd.read_csv('./data/clean_data.csv')
df = df.drop(columns='Unnamed: 0')

In [3]:
df.head(5)

Unnamed: 0,bogo,customerid,discount,duration,email,gender,income,informational,mobile,offerid,...,social,time,totalamount,web,2013,2014,2015,2016,2017,2018
0,1,78afa995795e4d85b5d9ceeca43f5fef,0,7,1,0,100000.0,0,1,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,0,0.0,37.67,1,0,0,0,0,1,0
1,1,ff7cb44e72db4112b270560686f97a23,0,5,1,0,39000.0,0,1,4d5c57ea9a6940dd891ad53e9dbe8da0,...,1,0.0,48.31,1,0,0,1,0,0,0
2,0,97b6993c232946d3b6b9f90530ff8073,0,3,1,1,52000.0,1,1,5a8bc65990b245e5a138643cd4eb9837,...,1,0.0,23.43,0,0,0,0,0,1,0
3,1,81848348d5584aef9e7374a07ebe6ea1,0,7,1,0,118000.0,0,1,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,0,0.0,52.24,1,0,0,0,1,0,0
4,0,28f9666945804ab0bfc63f3ec6ae9af1,1,10,1,0,44000.0,0,1,fafdcd668e3743c1bb461111dcafc2a4,...,1,0.0,5.12,1,0,0,0,0,0,1


In [4]:
df = df.drop(columns=['time','customerid','email','informational'])

In [5]:
df.columns

Index(['bogo', 'discount', 'duration', 'gender', 'income', 'mobile', 'offerid',
       'offersuccessful', 'reward', 'social', 'totalamount', 'web', '2013',
       '2014', '2015', '2016', '2017', '2018'],
      dtype='object')

### Normalizing and Splitting the data

In [6]:
# normalizing the data using min max scaler
scaler = preprocessing.MinMaxScaler()

min_max_scaler = preprocessing.MinMaxScaler()

cols_to_scale = ['income','totalamount','duration','reward']
def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(df[col])),columns=[col])
    return df

df = scaleColumns(df, cols_to_scale)
df = df.drop(['offerid'], axis=1)
df.head(5)

Unnamed: 0,bogo,discount,duration,gender,income,mobile,offersuccessful,reward,social,totalamount,web,2013,2014,2015,2016,2017,2018
0,1,0,0.571429,0,0.777778,1,1,0.5,0,0.031366,1,0,0,0,0,1,0
1,1,0,0.285714,0,0.1,1,1,1.0,1,0.040225,1,0,0,1,0,0,0
2,0,0,0.0,1,0.244444,1,0,0.0,1,0.019509,0,0,0,0,0,1,0
3,1,0,0.571429,0,0.977778,1,0,0.5,0,0.043497,1,0,0,0,1,0,0
4,0,1,1.0,0,0.155556,1,0,0.2,1,0.004263,1,0,0,0,0,0,1


In [7]:
# train test split
y = df.filter(['offersuccessful'])
X = df.drop('offersuccessful', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


#### Baseline using sklearn --> LOGISTIC REGRESSION

In [8]:
# declare the object
clf = LogisticRegression(random_state=0)

# train the model
clf.fit(X_train, y_train)

# predict
y_preds_lr = clf.predict(X_test)

data_lr = [['F1', f1_score(y_test, y_preds_lr)], 
           ['Precision', precision_score(y_test, y_preds_lr)],
           ['Recall', recall_score(y_test, y_preds_lr)],
           ['ROC-AUC', roc_auc_score(y_test, y_preds_lr)]]  
  
# Create the pandas DataFrame  
results_lr = pd.DataFrame(data_lr, columns = ['Measure', 'Log_Reg'])  

  y = column_or_1d(y, warn=True)


#### Model1 --> SUPPORT VECTOR CLASSIFIER-LINEAR

In [9]:
# declare object
clf = SVC(gamma='auto', kernel="linear")

# train
clf.fit(X_train, y_train)

# predict
y_preds_svm = clf.predict(X_test)

# evaluate
data_svm = [['F1', f1_score(y_test, y_preds_svm)], 
           ['Precision', precision_score(y_test, y_preds_svm)],
           ['Recall', recall_score(y_test, y_preds_svm)],
           ['ROC-AUC', roc_auc_score(y_test, y_preds_svm)]]  
  
# Create the pandas DataFrame  
results_svm = pd.DataFrame(data_svm, columns = ['Measure', 'SVC'])  


  y = column_or_1d(y, warn=True)


#### Model2 -->XGBoost

In [10]:
# create the object
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)

# train
xgb_model.fit(X_train, y_train)

# predict
y_preds_xg = xgb_model.predict(X_test)

# evaluate
data_xg = [['F1', f1_score(y_test, y_preds_xg)], 
           ['Precision', precision_score(y_test, y_preds_xg)],
           ['Recall', recall_score(y_test, y_preds_xg)],
           ['ROC-AUC', roc_auc_score(y_test, y_preds_xg)]]  
  
# Create the pandas DataFrame  
results_xg = pd.DataFrame(data_xg, columns = ['Measure', 'XGBoost'])  

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


#### Results Comparison and Interpretation

In [11]:
results_temp = results_lr.merge(results_svm, left_on='Measure', right_on='Measure')
results = results_temp.merge(results_xg, left_on='Measure', right_on='Measure')
results

Unnamed: 0,Measure,Log_Reg,SVC,XGBoost
0,F1,0.841543,0.851419,0.911694
1,Precision,0.871918,0.908318,0.896412
2,Recall,0.813213,0.801228,0.927507
3,ROC-AUC,0.852745,0.86415,0.915427


- Clearly both support vector classifier(linear) and xgboost outperforms the baseline.
- <font color="green">XGBoost</font> is the best model with 89% precision and 92% recall.


In [12]:
# define data directory
data_dir = 'starbucks_data'

def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    
    pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1)\
            .to_csv(os.path.join(data_dir, filename), header=False, index=False)
    
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

# convert the data into proper format
make_csv(X_train, y_train, filename='train.csv', data_dir=data_dir)
make_csv(X_test, y_test, filename='test.csv', data_dir=data_dir)

Path created: starbucks_data/train.csv
Path created: starbucks_data/test.csv


### Sagemaker starts!

**This task will be broken down into the following steps:**

- Upload the data to S3. 
- Define xgboost model and a training script.
- Train the model and deploy it.
- Evaluate the deployed classifier.

**NOTE: ** Different models were tried using sklearn and xgboost was picked as the best model.

In [13]:
# establish session with aws
b_session = boto3.session.Session(region_name='eu-central-1')
session = sagemaker.Session(boto_session=b_session)
role='arn:aws:iam::668154071669:role/service-role/AmazonSageMaker-ExecutionRole-20200120T151312'

# create an S3 bucket
bucket = session.default_bucket()

 ###  Upload the data to S3 bucket

In [14]:
# should be the name of directory created to save the features data
data_dir = 'starbucks_data'

# set prefix, a descriptive name for a directory  
prefix = 'starbucks'

# upload all data to S3
input_data = session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

In [15]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

starbucks/test.csv
starbucks/train.csv
Test passed!


In [16]:
import os

# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.
data_dir = './data/xgboost'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

    # creating the validation set from the training set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42)

The documentation for the XGBoost algorithm in SageMaker requires that the training and validation datasets should contain no headers or index and that the label should occur first for each sample.

In [17]:
pd.DataFrame(X_test).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [18]:
# upload the data to the s3 bucket
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)

In [19]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
container = get_image_uri(session.boto_region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-1').


In [20]:
# First we create a SageMaker estimator object for our model.
xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

# And then set the algorithm specific parameters.
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

Now we simply need to attach the training and validation datasets and then ask SageMaker to set up the computation.

In [21]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

In [22]:
# train the model
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-02-26 14:02:42 Starting - Starting the training job...
2020-02-26 14:02:43 Starting - Launching requested ML instances...
2020-02-26 14:03:44 Starting - Preparing the instances for training......
2020-02-26 14:04:44 Downloading - Downloading input data...
2020-02-26 14:05:22 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2020-02-26:14:05:22:INFO] Running standalone xgboost training.[0m
[34m[2020-02-26:14:05:22:INFO] File size need to be processed in the node: 3.57mb. Available memory size in the node: 8515.52mb[0m
[34m[2020-02-26:14:05:22:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:05:22] S3DistributionType set as FullyReplicated[0m
[34m[14:05:22] 29440x16 matrix with 471040 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-02-26:14:05:22:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:05:22] S3DistributionType set as FullyReplicated[0m
[34m[14:0


2020-02-26 14:05:34 Uploading - Uploading generated training model
2020-02-26 14:05:34 Completed - Training job completed
Training seconds: 50
Billable seconds: 50


### Deploy the XGBoost Model

In [23]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

-----------!

In [24]:
from sagemaker.predictor import csv_serializer

# We need to tell the endpoint what format the data we are sending is in so that SageMaker can perform the serialization.
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

In [25]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])
    
    return np.fromstring(predictions[1:], sep=',')

test_X = pd.read_csv(os.path.join(data_dir, 'test.csv'), header=None).values

predictions = predict(test_X)
predictions = [round(num) for num in predictions]

### Evaluate the model

In [26]:
aws_xg = [['F1', f1_score(y_test, predictions)], 
           ['Precision', precision_score(y_test, predictions)],
           ['Recall', recall_score(y_test, predictions)],
           ['ROC-AUC', roc_auc_score(y_test, predictions)]] 

# Create the pandas DataFrame  
aws_results_xg = pd.DataFrame(data_xg, columns = ['Measure', 'XGBoost'])    

In [27]:
aws_results_xg

Unnamed: 0,Measure,XGBoost
0,F1,0.911694
1,Precision,0.896412
2,Recall,0.927507
3,ROC-AUC,0.915427


### Clear the AWS resources- Very Important!

In [28]:
xgb_predictor.delete_endpoint()

In [29]:
# deleting bucket
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '7273072B96AA6CD3',
   'HostId': '7bfzdy+zoMRIB/ys4AGHB4HNWqUq+jyEq9PpipBUnNVXMvMTK4utdh/JRJtnJYM6F6DFwJ+Pxwo=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': '7bfzdy+zoMRIB/ys4AGHB4HNWqUq+jyEq9PpipBUnNVXMvMTK4utdh/JRJtnJYM6F6DFwJ+Pxwo=',
    'x-amz-request-id': '7273072B96AA6CD3',
    'date': 'Wed, 26 Feb 2020 14:17:59 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'starbucks/output/xgboost-2020-02-26-14-02-41-707/output/model.tar.gz'},
   {'Key': 'starbucks/train.csv'},
   {'Key': 'starbucks/validation.csv'},
   {'Key': 'starbucks/test.csv'}]}]

### Interpreting the results

In [30]:
aws_results_xg

Unnamed: 0,Measure,XGBoost
0,F1,0.911694
1,Precision,0.896412
2,Recall,0.927507
3,ROC-AUC,0.915427


In [40]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predictions))

[[10268  1113]
 [  678  9585]]


- There are 1113 false positives and 678 false negatives in the data.
- ROC-AUC scores tells the degree of separability. It tells us how the model is capable of separating/distinguishing between the two classes. Our model has ROC-AUC score of 91% which indicates a good separability between the classes.
- Precision is of 89% and recall is of 92%.
- F1 is of 91% which is quite a good figure.