# Predicting Appointment No Shows  with Amazon SageMaker Linear Learner
_**Supervised Learning with Logistic Regression: A Binary Prediction Problem**_

---

---

## Contents

1. [Background](#Background)
1. [Preparation](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Extensions](#Extensions)

---

## Background


This notebook uses the linear learner algorithm to predict whether a patient will be a no-show for a medical appointment.The dataset was downloaded from Kaggle https://www.kaggle.com/joniarroba/noshowappointments. The appointment data of 110,527 records was collected from medical clinics in the city of Vitoria, Brazil, over a three month period in 2016.

The following steps were undertaken:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Preparation

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

## Notes


Change the data files to recordIO-wrapped protobuf format to see if I can get the predictions working

In [45]:
bucket = 'sagemaker-sf-strategenics'
prefix = 'sagemaker/DEMO-xgboost-noShow'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Bring in the Python libraries that we'll use throughout the analysis

In [None]:
from datetime import date
from datetime import time
from datetime import datetime
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker                                  # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker.predictor import csv_serializer    # Converts strings for HTTP POST requests on inference
import seaborn as sns

---

## Data
The csv file containing the data is stored in an S3 bucket. First let's read the data file into a Pandas data frame.

In [3]:
data_key = 'appointmentData.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

noShow= pd.read_csv(data_location)

pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
noShow
noShow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


Let's start by renaming some of the columns

In [4]:
noShow.rename(columns = {'Hipertension': 'Hypertension',
                         'Handcap': 'Disabilities'}, inplace = True)

print(noShow.columns)

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension',
       'Diabetes', 'Alcoholism', 'Disabilities', 'SMS_received', 'No-show'],
      dtype='object')


As we have the date that the appointment was scheduled and the date of the appointment, we can calculate the number of days that the patient waited for the appointment.

First we have to convert the two date columns to a date format.

In [5]:
#convert date columns to a date format
noShow['tempSchedDate'] = pd.to_datetime(noShow['ScheduledDay'])
noShow['tempAppDate'] = pd.to_datetime(noShow['AppointmentDay'])
#get the date part of the date columns, as the Scheduled date has a time component but the appointment day does not
noShow['AppointmentDate']= noShow['tempAppDate'].dt.date
noShow['AppointmentBooked']= noShow['tempSchedDate'].dt.date
#calculate the waiting time
noShow['WaitingTime'] = (noShow.AppointmentDate - noShow.AppointmentBooked).dt.days

print(noShow.AppointmentBooked.head())
print(noShow.AppointmentDate.head())
print(noShow.WaitingTime.head())
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page


0    2016-04-29
1    2016-04-29
2    2016-04-29
3    2016-04-29
4    2016-04-29
Name: AppointmentBooked, dtype: object
0    2016-04-29
1    2016-04-29
2    2016-04-29
3    2016-04-29
4    2016-04-29
Name: AppointmentDate, dtype: object
0    0
1    0
2    0
3    0
4    0
Name: WaitingTime, dtype: int64


Calculate the day of the week for the appointment day and drop the original scheduled and appointment columns.

In [6]:
#Find the day of the week of the appointment
noShow['DayOfWeek'] = noShow['tempAppDate'].dt.day_name()

#drop the columns no longer needed
noShow = noShow.drop(['ScheduledDay','AppointmentDay','tempSchedDate','tempAppDate'], axis=1)

print(noShow.AppointmentDate.head())
print(noShow.DayOfWeek.head())
print('Day of week:', sorted(noShow.DayOfWeek.unique()))

0    2016-04-29
1    2016-04-29
2    2016-04-29
3    2016-04-29
4    2016-04-29
Name: AppointmentDate, dtype: object
0    Friday
1    Friday
2    Friday
3    Friday
4    Friday
Name: DayOfWeek, dtype: object
Day of week: ['Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday']


    There are no duplicate Appointment IDs so we will index the data by AppointmentID. We also need to convert PatientId to an integer.

In [7]:
noShow.PatientId = noShow.PatientId.astype('int64')
print(noShow.PatientId.head())
noShow.set_index('AppointmentID', inplace = True)

0     29872499824296
1    558997776694438
2      4262962299951
3       867951213174
4      8841186448183
Name: PatientId, dtype: int64


In [None]:
noShow
noShow.info()


# Data Summary

* There are 110,527 patient records, and 13 features for each patient
* The features are mixed; some numeric, some categorical


*Features:*
* `Age`: Patient's age. Integer -1 to 115
* `Gender`: Patient's gender, string M,F
* `Alcoholism`: Binary, 1=yes
* `Diabetes`:Binary, 1= yes
* `Hypertension`:Binary, 1= yes
* `Disabilities`: The number of disabilities for a patient. Integer, 1-4
* `Scholarship`: This indicates whether the patient receives financial support from the government. Binary, 1=yes
* `Neighbourhood`: This is the location of the medical clinic. String, 80 values
* `SMS_receceived`:Whether they received a SMS reminder before the appointment. Binary, 1= yes
* `AppointmentBooked`: Date that the appointment was booked
* `AppointmentDate`: Date of the appointment
* `DayOfWeek`: The weekday of the appointment. Integer, 0-5
* `WaitingTime`: The number of days between booking the appointment and the appointment date. Integer, -6 to 179


*Target variable:*
* `No-show`: Was the patient a no-show? Binary: 1=yes,0=no
Overall, 20% of the patients were no-shows.

In [8]:
noShow['Age'] = np.where(noShow['Age']<0, np.nan, noShow['Age'])
noShow['Age'].describe()
noShow['WaitingTime'] = np.where(noShow['WaitingTime']<0, np.nan, noShow['WaitingTime'])
noShow['WaitingTime'].describe()

count    110522.000000
mean         10.184253
std          15.255115
min           0.000000
25%           0.000000
50%           4.000000
75%          15.000000
max         179.000000
Name: WaitingTime, dtype: float64

Disability has values from 0-5, indicating the number of disabilities a patient has. We will turn this into a binary column to indicate whether the patient has a disablity or not.

In [9]:
noShow['Disability'] = np.where(noShow['Disabilities']>1, 1, noShow['Disabilities'])
print('Disability:', sorted(noShow.Disability.unique()))
count = noShow.groupby(['Disability', 'Disabilities']).size() 
print(count)  


Disability: [0, 1]
Disability  Disabilities
0           0               108286
1           1                 2042
            2                  183
            3                   13
            4                    3
dtype: int64


In [None]:
Now we will look at patient history as it is possible that people who have a previous no-show are more likely to no-show again.

In [10]:
#determine if a patient has had a previous appointment
noShow.sort_values(by=['PatientId','AppointmentDate'], inplace=True)
pd.options.display.max_rows=100
noShow['PreviousAppointment'] = noShow.sort_values(by = ['PatientId','AppointmentDate']).groupby(['PatientId']).cumcount()
#print(noShow[['PatientId','AppointmentDate', 'PreviousAppointment']].head(100)) 


a = noShow.groupby(pd.cut(noShow.PreviousAppointment, bins = [-1, 0,1,2,3,4,5, 85], include_lowest = True))[['PreviousAppointment']].count()
b = pd.DataFrame(a)
b.set_index(pd.Series(['0', '1', '2', '3', '4', '5', '> 5']))

Unnamed: 0,PreviousAppointment
0,62299
1,24379
2,10484
3,4984
4,2617
5,1498
> 5,4264


In [None]:
Calculate whether the patient has a previous no-show

In [11]:
noShow['NoShow']=np.where(noShow['No-show'] == "Yes", 1,0)
count = noShow.groupby(['No-show', 'NoShow']).size() 
print(count)

#noShow['PreviousNoShow'] = (noShow[noShow['PreviousAppointment'] > 0].sort_values(['PatientId', 'AppointmentDate']).groupby(['PatientId'])['NoShow'].cumsum())
noShow['NumberOfPreviousNoShow'] = (noShow.sort_values(['PatientId', 'AppointmentDate']).groupby(['PatientId'])['NoShow'].cumsum())
noShow['PreviousNoShows']=noShow['NumberOfPreviousNoShow']-noShow['NoShow']
noShow['PreviousNoShowProp'] = noShow['PreviousNoShows']/ noShow[noShow['PreviousAppointment'] > 0]['PreviousAppointment']

pd.options.display.max_rows=100
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)
print(noShow[['PatientId','AppointmentDate', 'No-show', 'PreviousAppointment','PreviousNoShows','PreviousNoShowProp']].head(100)) 

noShow['PreviousNoShowProp'].describe()

No-show  NoShow
No       0         88208
Yes      1         22319
dtype: int64
               PatientId AppointmentDate No-show  PreviousAppointment  PreviousNoShows  PreviousNoShowProp
AppointmentID                                                                                             
5751990            39217      2016-06-03      No                    0                0                 NaN
5760144            43741      2016-06-01      No                    0                0                 NaN
5712759            93779      2016-05-18      No                    0                0                 NaN
5637648           141724      2016-05-02      No                    0                0                 NaN
5637728           537615      2016-05-06      No                    0                0                 NaN
5680449          5628261      2016-05-13     Yes                    0                0                 NaN
5718578         11831856      2016-05-19      No                 

count    48228.000000
mean         0.198413
std          0.343029
min          0.000000
25%          0.000000
50%          0.000000
75%          0.333333
max          1.000000
Name: PreviousNoShowProp, dtype: float64

In [12]:
#drop columns that we no longer need
noShow = noShow.drop(['NoShow','AppointmentBooked','PreviousNoShows'], axis=1)
noShow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110527 entries, 5751990 to 5660958
Data columns (total 18 columns):
PatientId                 110527 non-null int64
Gender                    110527 non-null object
Age                       110526 non-null float64
Neighbourhood             110527 non-null object
Scholarship               110527 non-null int64
Hypertension              110527 non-null int64
Diabetes                  110527 non-null int64
Alcoholism                110527 non-null int64
Disabilities              110527 non-null int64
SMS_received              110527 non-null int64
No-show                   110527 non-null object
AppointmentDate           110527 non-null object
WaitingTime               110522 non-null float64
DayOfWeek                 110527 non-null object
Disability                110527 non-null int64
PreviousAppointment       110527 non-null int64
NumberOfPreviousNoShow    110527 non-null int64
PreviousNoShowProp        48228 non-null float64
dtypes: 

In [None]:
display(pd.crosstab(noShow.PreviousAppointment,noShow.PreviousNoShows))

In [None]:
noShow.info()

Change age and waiting time to categorical columns


In [14]:
def WaitingTimeCat(days):
    if days == 0:
        return '0 days'
    elif  days in range(1,8):
        return '1-7 days'
    elif  days in range(8,15):
        return '8-14 days'
    elif days in range(15, 29):
        return '15-28 days'
    else:
        return '> 28 days'
    
def AgeCat(years):
    if years in range(0,5):
        return '0-4 years'
    elif  years in range(5,15):
        return '05-14 years'
    elif  years in range(15,25):
        return '15-24 years'
    elif years in range(25, 35):
        return '25-34 years'
    elif years in range(35, 45):
        return '35-44 years'
    elif years in range(45, 55):
        return '45-54 years'
    elif years in range(55, 65):
        return '55-64 years'
    else:
        return '> 64 years'   
noShow['WaitingTimeCat'] = noShow.WaitingTime.apply(WaitingTimeCat)
noShow['AgeCat'] = noShow.Age.apply(AgeCat)



In [15]:
noShow.describe()

Unnamed: 0,PatientId,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Disabilities,SMS_received,WaitingTime,Disability,PreviousAppointment,NumberOfPreviousNoShow,PreviousNoShowProp
count,110527.0,110526.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110522.0,110527.0,110527.0,110527.0,48228.0
mean,147496300000000.0,37.089219,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026,10.184253,0.020276,1.270314,0.406561,0.198413
std,256094900000000.0,23.110026,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873,15.255115,0.140942,3.913419,0.797339,0.343029
min,39217.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0,15.0,0.0,1.0,1.0,0.333333
max,999981600000000.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0,179.0,1.0,87.0,18.0,1.0


In [None]:
display(pd.crosstab(noShow.Age,noShow.AgeCat))

In [17]:
#drop PatientID, AppointmentID and Neightbourhood - check whether we can drop and index column
#noShow = noShow.drop(['PreviousNoShowProp'], axis=1)
noShow['NoShow']=np.where(noShow['No-show'] == "Yes", 1,0)
df = noShow[['NoShow', 'Gender', 'AgeCat','WaitingTimeCat','NumberOfPreviousNoShow','Scholarship','Hypertension','Alcoholism','Disability','Diabetes']].copy()
df2=df.dropna()
df2.describe()
print(df2.shape)


#create indicator columns for categorical columns
model_data = pd.get_dummies(df2) 
model_data.head()
model_data.info()

(110527, 10)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 110527 entries, 5751990 to 5660958
Data columns (total 22 columns):
NoShow                       110527 non-null int64
NumberOfPreviousNoShow       110527 non-null int64
Scholarship                  110527 non-null int64
Hypertension                 110527 non-null int64
Alcoholism                   110527 non-null int64
Disability                   110527 non-null int64
Diabetes                     110527 non-null int64
Gender_F                     110527 non-null uint8
Gender_M                     110527 non-null uint8
AgeCat_0-4 years             110527 non-null uint8
AgeCat_05-14 years           110527 non-null uint8
AgeCat_15-24 years           110527 non-null uint8
AgeCat_25-34 years           110527 non-null uint8
AgeCat_35-44 years           110527 non-null uint8
AgeCat_45-54 years           110527 non-null uint8
AgeCat_55-64 years           110527 non-null uint8
AgeCat_> 64 years            110527 non-null uint8
Wa

Split the dataset into training data (70%), validation data (20%) and prediction data (10%)

In [61]:
#train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=8147), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%
rand_split = np.random.rand(len(model_data))
train_list = rand_split < 0.7
val_list = (rand_split >= 0.7) & (rand_split < 0.9)
test_list = rand_split >= 0.9

data_train = model_data[train_list]
data_val = model_data[val_list]
data_test = model_data[test_list]

train_y = (data_train.iloc[:,0]).as_matrix();
train_X = data_train.iloc[:,1:].as_matrix();

val_y = (data_val.iloc[:,0]).as_matrix();
val_X = data_val.iloc[:,1:].as_matrix();

test_y = data_test.iloc[:,0].as_matrix();
test_X = data_test.iloc[:,1:].as_matrix();

test_y.info()



AttributeError: 'numpy.ndarray' object has no attribute 'info'

In [19]:
print(test_X.shape)

(10908, 21)


In [20]:
import io
import sagemaker.amazon.common as smac 

train_file = 'linear_train.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)


s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, train_file)
print('uploaded training data location: {}'.format(s3_train_data))




uploaded training data location: s3://sagemaker-sf-strategenics/sagemaker/DEMO-xgboost-noShow/train/linear_train.data


In [21]:

validation_file = 'linear_validation.data'

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype('float32'), val_y.astype('float32'))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', validation_file)).upload_fileobj(f)
s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, validation_file)
print('uploaded training data location: {}'.format(s3_validation_data))

uploaded training data location: s3://sagemaker-sf-strategenics/sagemaker/DEMO-xgboost-noShow/validation/linear_validation.data


---

## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [22]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `Linear Learner` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [50]:
output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://sagemaker-sf-strategenics/sagemaker/DEMO-xgboost-noShow/output


In [51]:


sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)
linear.set_hyperparameters(feature_dim=21,
                           predictor_type='binary_classifier',
                           mini_batch_size=200)

linear.fit({'train': s3_train_data, 'validation': s3_validation_data}) 

2019-11-12 00:06:10 Starting - Starting the training job...
2019-11-12 00:06:12 Starting - Launching requested ML instances...
2019-11-12 00:07:10 Starting - Preparing the instances for training......
2019-11-12 00:08:11 Downloading - Downloading input data
2019-11-12 00:08:11 Training - Downloading the training image....[31mDocker entrypoint called with argument(s): train[0m
[31m[11/12/2019 00:08:44 INFO 139624987936576] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_method': u'uniform', u'init_sigma': u'0.01', u'lr_scheduler_mini

[31m[2019-11-12 00:09:10.547] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 5, "duration": 11110, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2804456025254203, "sum": 0.2804456025254203, "min": 0.2804456025254203}}, "EndTime": 1573517350.548035, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 1}, "StartTime": 1573517350.547935}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.273484602491985, "sum": 0.273484602491985, "min": 0.273484602491985}}, "EndTime": 1573517350.54815, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 1}, "StartTime": 1573517350.548119}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max":

[31m[2019-11-12 00:09:24.530] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 7, "duration": 11297, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.27019484409066136, "sum": 0.27019484409066136, "min": 0.27019484409066136}}, "EndTime": 1573517364.530162, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 2}, "StartTime": 1573517364.530072}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26599806433192213, "sum": 0.26599806433192213, "min": 0.26599806433192213}}, "EndTime": 1573517364.530266, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 2}, "StartTime": 1573517364.530245}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count":

[31m[2019-11-12 00:09:38.381] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 9, "duration": 11206, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26709349957547446, "sum": 0.26709349957547446, "min": 0.26709349957547446}}, "EndTime": 1573517378.381332, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 3}, "StartTime": 1573517378.381234}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26463987207535933, "sum": 0.26463987207535933, "min": 0.26463987207535933}}, "EndTime": 1573517378.38142, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 3}, "StartTime": 1573517378.3814}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1,

[31m[2019-11-12 00:09:52.283] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 11, "duration": 11138, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2657582830581862, "sum": 0.2657582830581862, "min": 0.2657582830581862}}, "EndTime": 1573517392.283171, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 4}, "StartTime": 1573517392.283067}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2644154839429436, "sum": 0.2644154839429436, "min": 0.2644154839429436}}, "EndTime": 1573517392.28326, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 4}, "StartTime": 1573517392.283241}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "m

[31m[2019-11-12 00:10:06.185] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 13, "duration": 11147, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26506755010718214, "sum": 0.26506755010718214, "min": 0.26506755010718214}}, "EndTime": 1573517406.186142, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 5}, "StartTime": 1573517406.186043}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.264304254196719, "sum": 0.264304254196719, "min": 0.264304254196719}}, "EndTime": 1573517406.186228, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 5}, "StartTime": 1573517406.186207}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "

[31m[2019-11-12 00:10:19.912] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 15, "duration": 11043, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2647189456732698, "sum": 0.2647189456732698, "min": 0.2647189456732698}}, "EndTime": 1573517419.912516, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 6}, "StartTime": 1573517419.912411}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26428059570549073, "sum": 0.26428059570549073, "min": 0.26428059570549073}}, "EndTime": 1573517419.912614, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 6}, "StartTime": 1573517419.912593}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1

[31m[2019-11-12 00:10:33.675] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 17, "duration": 11084, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2644902133695223, "sum": 0.2644902133695223, "min": 0.2644902133695223}}, "EndTime": 1573517433.675403, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 7}, "StartTime": 1573517433.675307}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2643407517070918, "sum": 0.2643407517070918, "min": 0.2643407517070918}}, "EndTime": 1573517433.67549, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 7}, "StartTime": 1573517433.67547}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "ma

[31m[2019-11-12 00:10:47.466] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 19, "duration": 11083, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2643434050163249, "sum": 0.2643434050163249, "min": 0.2643434050163249}}, "EndTime": 1573517447.467105, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 8}, "StartTime": 1573517447.46701}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2643066661056006, "sum": 0.2643066661056006, "min": 0.2643066661056006}}, "EndTime": 1573517447.467191, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 8}, "StartTime": 1573517447.467171}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "m

[31m[2019-11-12 00:11:01.317] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 21, "duration": 11170, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2642415314804984, "sum": 0.2642415314804984, "min": 0.2642415314804984}}, "EndTime": 1573517461.318109, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 9}, "StartTime": 1573517461.318018}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2642759272848913, "sum": 0.2642759272848913, "min": 0.2642759272848913}}, "EndTime": 1573517461.318193, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 9}, "StartTime": 1573517461.318174}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "

[31m[2019-11-12 00:11:15.304] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 23, "duration": 11318, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.264172780384389, "sum": 0.264172780384389, "min": 0.264172780384389}}, "EndTime": 1573517475.30498, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 10}, "StartTime": 1573517475.304871}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26424627752575147, "sum": 0.26424627752575147, "min": 0.26424627752575147}}, "EndTime": 1573517475.305065, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 10}, "StartTime": 1573517475.305045}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, 

[31m[2019-11-12 00:11:29.418] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 25, "duration": 11401, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2641288077430824, "sum": 0.2641288077430824, "min": 0.2641288077430824}}, "EndTime": 1573517489.418823, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 11}, "StartTime": 1573517489.418734}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.264218672991415, "sum": 0.264218672991415, "min": 0.264218672991415}}, "EndTime": 1573517489.418908, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 11}, "StartTime": 1573517489.418887}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "m

[31m[2019-11-12 00:11:43.410] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 27, "duration": 11259, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26409929186798803, "sum": 0.26409929186798803, "min": 0.26409929186798803}}, "EndTime": 1573517503.410713, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 12}, "StartTime": 1573517503.410607}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2641904809000572, "sum": 0.2641904809000572, "min": 0.2641904809000572}}, "EndTime": 1573517503.410805, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 12}, "StartTime": 1573517503.410785}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count":

[31m[2019-11-12 00:11:57.146] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 29, "duration": 10966, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26407751021767156, "sum": 0.26407751021767156, "min": 0.26407751021767156}}, "EndTime": 1573517517.146526, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 13}, "StartTime": 1573517517.146436}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2641647591455346, "sum": 0.2641647591455346, "min": 0.2641647591455346}}, "EndTime": 1573517517.146651, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 13}, "StartTime": 1573517517.146628}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count":

[31m[2019-11-12 00:12:11.131] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 31, "duration": 11266, "num_examples": 388, "num_bytes": 9932672}[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.2640662849534698, "sum": 0.2640662849534698, "min": 0.2640662849534698}}, "EndTime": 1573517531.131929, "Dimensions": {"model": 0, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 14}, "StartTime": 1573517531.131838}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count": 1, "max": 0.26413920865810503, "sum": 0.26413920865810503, "min": 0.26413920865810503}}, "EndTime": 1573517531.132023, "Dimensions": {"model": 1, "Host": "algo-1", "Operation": "training", "Algorithm": "Linear Learner", "epoch": 14}, "StartTime": 1573517531.132003}
[0m
[31m#metrics {"Metrics": {"train_binary_classification_cross_entropy_objective": {"count":


2019-11-12 00:12:27 Uploading - Uploading generated training model
2019-11-12 00:12:27 Completed - Training job completed
Training seconds: 270
Billable seconds: 270


In [46]:
#!conda install -y -c conda-forge mxnet 

Solving environment: done


  current version: 4.5.12
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs: 
    - mxnet


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _mutex_mxnet-0.0.30        |         openblas           2 KB
    numexpr-2.7.0              |   py36hb3f55d8_0         196 KB  conda-forge
    blas-2.14                  |         openblas          10 KB  conda-forge
    liblapacke-3.8.0           |      14_openblas          10 KB  conda-forge
    mxnet-1.5.0                |       hea8a0af_0           5 KB
    numpy-base-1.17.3          |   py36h2f8d375_0         5.2 MB
    libmxnet-1.5.0             |openblas_ha1db078_0        24.0 MB
    libopenblas-0.3.7          |       h6e990d7_3         7.6 MB  conda-forge

In [60]:
import os
import mxnet as mx

 

key = 'sagemaker/DEMO-xgboost-noShow/output/linear-learner-2019-11-12-00-06-10-451/output/model.tar.gz'
boto3.resource('s3').Bucket(bucket).download_file(key, 'model.tar.gz')

os.system('tar -zxvf model.tar.gz')
 
# Linear learner model is itself a zip file, containing a mxnet model and other metadata.
# First unzip the model.
os.system('unzip model_algo-1') 
 
# Load the mxnet module
mod = mx.module.Module.load("mx-mod", 0)
 
# model weights
weights = mod._arg_params['fc0_weight'].asnumpy().flatten()

# model bias
bias = mod._arg_params['fc0_bias'].asnumpy().flatten()

# weight for the first feature
weights          

	data
	out_label[0m


array([ 2.87782   ,  0.08626004, -0.06458283, -0.11891662, -0.00443158,
        0.0787802 ,  0.07766209,  0.09922467,  0.33689326,  0.26571706,
        0.29169926,  0.18138276,  0.04187226, -0.05654851, -0.15616146,
       -0.18547378, -1.6494722 ,  0.8219359 ,  1.3115785 ,  0.98427737,
        1.4681982 ], dtype=float32)

In [None]:
train_data.info()

---

## Hosting
Now that we've trained the `Linear Learner` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [24]:
linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

--------------------------------------------------------------------------------------------------!

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the Linear Learner endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [28]:
test_X.describe()

AttributeError: 'numpy.ndarray' object has no attribute 'describe'

In [40]:
from sagemaker.predictor import csv_serializer, json_deserializer

linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

result = linear_predictor.predict(test_X[10900])
print(result)

{'predictions': [{'score': 0.9366288185119629, 'predicted_label': 1.0}]}


Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [42]:

import numpy as np

predictions = []
for array in np.array_split(test_X, 100):
    result = linear_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)


IndexError: too many indices for array

In [38]:
pd.crosstab(test_y, predictions, rownames=['actuals'], colnames=['predictions'],margins=True)

predictions,0.0,1.0,All
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7966,776,8742
1,124,2042,2166
All,8090,2818,10908


In [None]:
print(predictions.shape)


## In summary, of the 10,908 patients, we predicted 2818 would be a no-show and 776 of them actually didn't turn up to the appointment.  We also had 124 that were no-shows but we predicted as turning up.    


### (Optional) Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [43]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)