# Project info - supervised learning

## Description

The task is to build a model that will predict whether or not a patient will no show for an appointment. This is a binary classification problem.

## Data dictionary

- Label:
    - NoShow - Yes/No indicator describing patient's appointment attendance. 'Yes' means the patient was a no-show.
- Features:
    - PatientId - Identification of a patient 
    - AppointmentId - Identification of each appointment 
    - Gender = Male or Female 
    - AppointmentDay = The date of the appointment
    - ScheduledDay = The date the appointment was scheduled 
    - Age = Patient age in years
    - Neighborhood = Appointment location 
    - Scholarship = True of False
    - Hypertension = True or False 
    - Diabetes = True or False 
    - Alcoholism = True or False 
    - Handicap = True or False 
    - SMSReceived = True or False

# Import modules and tools

In [1]:
# Standard libary and settings
import os
import sys
import warnings
warnings.simplefilter('ignore')
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:78% !important; }</style>"))


# Data extensions and settings
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format
np.set_printoptions(threshold = np.inf, suppress = True)


# Modeling extensions
import sklearn.svm as svm
import sklearn.base as base
import sklearn.metrics as metrics
import sklearn.pipeline as pipeline
import sklearn.ensemble as ensemble
import sklearn.linear_model as linear_model
import sklearn.preprocessing as preprocessing
import sklearn.model_selection as model_selection
import sklearn.feature_selection as feature_selection


# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt


# Magic functions
%matplotlib inline


from scipy.stats import ttest_ind
from statsmodels.stats.weightstats import ztest
from IPython.display import display

from matplotlib.pyplot import rc_context

# Load, inspect, clean, prepare data

In [2]:
# Load and inspect data

df = pd.read_csv('kaggleApptNoShow.csv'
                ,header = 0
                ,names = ['PatientId', 'AppointmentId', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighborhood', 'Scholarship', 'Hypertension',
       'Diabetes', 'Alcoholism', 'Handicap', 'SMSReceived', 'Label'])
df.info()
display(df[:5])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentId     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighborhood      110527 non-null object
Scholarship       110527 non-null int64
Hypertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handicap          110527 non-null int64
SMSReceived       110527 non-null int64
Label             110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


Unnamed: 0,PatientId,AppointmentId,Gender,ScheduledDay,AppointmentDay,Age,Neighborhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,Label
0,29872499824296.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997776694438.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962299951.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951213174.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186448183.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


> Remarks - Zero nulls in this dataset

## Preliminary data cleansing and feature engineering

In [3]:
# review unique values of select columns

for col in df[['Gender','Age','Neighborhood','Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMSReceived','Label']]:
    print(col, df[col].unique())
    

Gender ['F' 'M']
Age [ 62  56   8  76  23  39  21  19  30  29  22  28  54  15  50  40  46   4
  13  65  45  51  32  12  61  38  79  18  63  64  85  59  55  71  49  78
  31  58  27   6   2  11   7   0   3   1  69  68  60  67  36  10  35  20
  26  34  33  16  42   5  47  17  41  44  37  24  66  77  81  70  53  75
  73  52  74  43  89  57  14   9  48  83  72  25  80  87  88  84  82  90
  94  86  91  98  92  96  93  95  97 102 115 100  99  -1]
Neighborhood ['JARDIM DA PENHA' 'MATA DA PRAIA' 'PONTAL DE CAMBURI' 'REPÚBLICA'
 'GOIABEIRAS' 'ANDORINHAS' 'CONQUISTA' 'NOVA PALESTINA' 'DA PENHA'
 'TABUAZEIRO' 'BENTO FERREIRA' 'SÃO PEDRO' 'SANTA MARTHA' 'SÃO CRISTÓVÃO'
 'MARUÍPE' 'GRANDE VITÓRIA' 'SÃO BENEDITO' 'ILHA DAS CAIEIRAS'
 'SANTO ANDRÉ' 'SOLON BORGES' 'BONFIM' 'JARDIM CAMBURI' 'MARIA ORTIZ'
 'JABOUR' 'ANTÔNIO HONÓRIO' 'RESISTÊNCIA' 'ILHA DE SANTA MARIA'
 'JUCUTUQUARA' 'MONTE BELO' 'MÁRIO CYPRESTE' 'SANTO ANTÔNIO' 'BELA VISTA'
 'PRAIA DO SUÁ' 'SANTA HELENA' 'ITARARÉ' 'INHANGUETÁ' 'UNIVERSIT

>Remarks - 
- [Gender] and [Neighborhood] are nominal features and will need to be encoded prior to modeling. 
- [ScheduledDay] and [AppointmentDay] are both shown as datetime, it appears that [AppointmentDay] does not include any information on time of day, which is unfortunate. [ScheduledDay], on the other hand, included time information. I'll want to convert these datetimes to dates and preserve the time information for [ScheduledDay]. 
- The [Label] will also need to be encoded. 
- [Handicap] should be a binary column. One solution is to reduce the values 2, 3 to 1. 
- I also observed a negative and an age of 115. It's a high value but certainly possible. The negative age values will be changed to 0.
- There are several opportunities for feature engingeering, including the number of days between the day the appointment was scheduled and the day of the actual appointment, the day of the week the appointment was scheduled and the appointment itself.

In [4]:
# Clean up and engineer time features

def parseHour(time):
    hour = int(time[11:13])
    return hour

df['ScheduledHour'] = df['ScheduledDay'].apply(parseHour)
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).apply(lambda x: x.date())
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']) # Coerce to datetime datatype
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

df['DaysUntilAppointment'] = (df['AppointmentDay'] - df['ScheduledDay']).dt.days
df.loc[df['DaysUntilAppointment'] > 90, 'DaysUntilAppointment'] = 90

df['ScheduledDayOfWeek'] = df['ScheduledDay'].dt.day_name()
df['AppointmentDayOfWeek'] = df['AppointmentDay'].dt.day_name()
df['SameDayAppointment'] = np.where(df['ScheduledDay'] == df['AppointmentDay'], 'Yes', 'No')

# Convert [Handicap] values higher than 1 to 1

df.loc[df['Handicap'] > 1, 'Handicap'] = 1

# Convert [Handicap] values higher than 1 to 1

df.loc[df['Age'] <= 0, 'Age'] = 0

# Drop unnecessary columns

del df['AppointmentId']
del df['PatientId']
del df['ScheduledDay']
del df['AppointmentDay']

# Inspect changes

df[:5]


Unnamed: 0,Gender,Age,Neighborhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,Label,ScheduledHour,DaysUntilAppointment,ScheduledDayOfWeek,AppointmentDayOfWeek,SameDayAppointment
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,No,18,0,Friday,Friday,Yes
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,No,16,0,Friday,Friday,Yes
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,No,16,0,Friday,Friday,Yes
3,F,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,17,0,Friday,Friday,Yes
4,F,56,JARDIM DA PENHA,0,1,1,0,0,0,No,16,0,Friday,Friday,Yes


# Modeling

## Modeling tools

In [5]:
# This class allows for evaluating several models, each with their own parameter grid

class EstimatorSelectionHelper:
    
    
    def __init__(self, models, params):
        if not set(models.keys()).issubset(set(params.keys())):
            missing_params = list(set(models.keys()) - set(params.keys()))
            raise ValueError('Some estimators are missing parameters: {0}'.format(missing_params))
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}
    
    # Full GridSearchCV
    def fitGs(self, X, y, cv = 5, n_jobs = 1, verbose = 0, scoring = None, refit = True):
        for key in self.keys:
            print('Running GridSearchCV for {0}'.format(key))
            model = self.models[key]
            params = self.params[key]
            gs = model_selection.GridSearchCV(model
                              ,params
                              ,cv = cv
                              ,n_jobs = n_jobs
                              ,verbose = verbose
                              ,scoring = scoring
                              ,refit = refit
                              ,return_train_score = True)
            gs.fit(X,y)
            self.grid_searches[key] = gs    
        return gs
    
    # RandomizedSearchCV
    def fitRgs(self, X, y, cv = 5, n_jobs = 1, verbose = 0, scoring = None, refit = True, n_iter = 15):
        for key in self.keys:
            print('Running RandomizedSearchCV for {0}'.format(key))
            model = self.models[key]
            params = self.params[key]        
            rgs = model_selection.RandomizedSearchCV(model
                                    ,params
                                    ,cv = cv
                                    ,n_jobs = n_jobs
                                    ,verbose = verbose
                                    ,scoring = scoring
                                    ,refit = refit
                                    ,return_train_score = True
                                    ,n_iter = n_iter)
            rgs.fit(X,y)
            self.grid_searches[key] = rgs    
        return rgs
        
    def scoreSummary(self, sort_by = 'mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key
                 ,'min_score': min(scores)
                 ,'max_score': max(scores)
                 ,'mean_score': np.mean(scores)
                 ,'std_score': np.std(scores)
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            #print(k)
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = 'split{}_test_score'.format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis = 1).T.sort_values([sort_by], ascending = False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns]


In [6]:
#  Basic class for selecting attributes by name

class DataFrameSelector(base.BaseEstimator, base.TransformerMixin):
    
    
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        return X[self.attribute_names].values


In [7]:
# Pipeline that performs train/test split, selects numerical columns,
# select categorical columns, and recombines into datasets to be used
# to train and evaluate models

def splitPrepPipe(df, label, catFeatures = []):
    dfTrain, dfTest = model_selection.train_test_split(df, test_size = 0.1, random_state = 42)

    yTrain = dfTrain[label]
    yTest = dfTest[label]

    XTrain = dfTrain.drop([label], axis = 1)
    XTest = dfTest.drop([label], axis = 1)

    allCols = XTrain.columns.values
    index = [np.argwhere(allCols == i)[0][0] for i in catFeatures]
    numCols = np.delete(allCols, index)
    
    numPipeline = pipeline.Pipeline([
        ('selector', DataFrameSelector(numCols)),
        #('std_scaler', preprocessing.StandardScaler()),
    ])

    catPipeline = pipeline.Pipeline([
        ('selector', DataFrameSelector(catFeatures)),
    ])

    fullPipeline = pipeline.FeatureUnion(transformer_list = [
        ('numPipeline', numPipeline),
        ('catPipeline', catPipeline),
    ])    
    
    XTrain = fullPipeline.fit_transform(XTrain)
    XTest = fullPipeline.transform(XTest)
    
    return XTrain, XTest, yTrain, yTest


## Prepare data for model

In [8]:
# Convert nominal columns to dummy columns with binary indicators

df = pd.get_dummies(df, columns = ['Gender','Neighborhood','ScheduledDayOfWeek', 'AppointmentDayOfWeek', 'SameDayAppointment'], drop_first = True)

# Encode [Label]

le = preprocessing.LabelEncoder()
df['Label'] = le.fit_transform(df['Label'])

# Inspect changes

df[:5]


Unnamed: 0,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,Label,ScheduledHour,DaysUntilAppointment,...,ScheduledDayOfWeek_Saturday,ScheduledDayOfWeek_Thursday,ScheduledDayOfWeek_Tuesday,ScheduledDayOfWeek_Wednesday,AppointmentDayOfWeek_Monday,AppointmentDayOfWeek_Saturday,AppointmentDayOfWeek_Thursday,AppointmentDayOfWeek_Tuesday,AppointmentDayOfWeek_Wednesday,SameDayAppointment_Yes
0,62,0,1,0,0,0,0,0,18,0,...,0,0,0,0,0,0,0,0,0,1
1,56,0,0,0,0,0,0,0,16,0,...,0,0,0,0,0,0,0,0,0,1
2,62,0,0,0,0,0,0,0,16,0,...,0,0,0,0,0,0,0,0,0,1
3,8,0,0,0,0,0,0,0,17,0,...,0,0,0,0,0,0,0,0,0,1
4,56,0,1,1,0,0,0,0,16,0,...,0,0,0,0,0,0,0,0,0,1


## Split data into train and test sets

In [9]:
# Create train test split and review data sizes

XTrain, XTest, yTrain, yTest = splitPrepPipe(df, 'Label')

print('XTrain shape: {0}'.format(XTrain.shape))
print('yTrain shape: {0}'.format(yTrain.shape))
print('XTest shape: {0}'.format(XTest.shape))
print('yTest shape: {0}'.format(yTest.shape))


XTrain shape: (99474, 101)
yTrain shape: (99474,)
XTest shape: (11053, 101)
yTest shape: (11053,)


## Peform SMOTE - Synthetic minority over-sampling technique

In [10]:
# This is an imbalanced dataset, meaning that there are many more occurrences of one
# of the label categories than the other. To remedy that, one strategy is SMOTE,
# which create additional samples that have the label of the minority (least represented)
# class, with the intention of making it easier for the model to differentiate the 
# minority class in the original dataset form the majority class.  SMOTE works by adding
# observations that are similar, but not identical to minority class samples in the
# original dataset

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 1, ratio = 1.0)
XTrainOS, yTrainOS = sm.fit_sample(XTrain,yTrain)

print('XTrainOS shape: {0}'.format(XTrainOS.shape))
print('yTrainOS shape: {0}'.format(yTrainOS.shape))


XTrainOS shape: (158760, 101)
yTrainOS shape: (158760,)


## Models and parameters

In [11]:
# Construct parameter grids for LogisticRegression and RandomForestClassifier

models = {
    'LogisticRegression' : linear_model.LogisticRegression()
    ,'RandomForestClassifier': ensemble.RandomForestClassifier()    
}

params = {
    'LogisticRegression' : {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}    
    ,'RandomForestClassifier' : {
                                'n_estimators': np.arange(800, 1100, 100)
                                ,'max_features' : [None, 'sqrt']
                                ,'max_depth': np.arange(12, 20, 2)
                                ,'min_samples_split': np.arange(20, 40, 2)
                                ,'min_samples_leaf': np.arange(2, 40, 2)
                                ,'bootstrap': [True]
                                }
}


In [12]:
# Execute GridSearchCV

helper = EstimatorSelectionHelper(models, params)

gridSearch = helper.fitRgs(XTrainOS
                ,yTrainOS
                ,n_iter = 10
                ,verbose = 0
                ,cv = 5
                ,scoring = 'roc_auc')


Running RandomizedSearchCV for LogisticRegression




Running RandomizedSearchCV for RandomForestClassifier


In [13]:
# Review CV results for each model and parameter set

scores = helper.scoreSummary()
scores.fillna('')


Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,C,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators
7,RandomForestClassifier,0.735064,0.916796,0.977607,0.093841,,True,16.0,,8.0,30.0,1000.0
11,RandomForestClassifier,0.735566,0.915813,0.976647,0.093094,,True,18.0,sqrt,8.0,36.0,900.0
10,RandomForestClassifier,0.732406,0.912623,0.973239,0.093028,,True,16.0,sqrt,6.0,20.0,900.0
16,RandomForestClassifier,0.730471,0.911347,0.972092,0.093343,,True,16.0,sqrt,8.0,22.0,1000.0
15,RandomForestClassifier,0.72638,0.908232,0.968911,0.093886,,True,14.0,,18.0,32.0,800.0
13,RandomForestClassifier,0.722751,0.904922,0.965676,0.094018,,True,16.0,,38.0,36.0,1000.0
9,RandomForestClassifier,0.722976,0.903297,0.963463,0.09304,,True,14.0,sqrt,12.0,32.0,900.0
12,RandomForestClassifier,0.723086,0.902441,0.962403,0.092526,,True,14.0,sqrt,14.0,24.0,900.0
8,RandomForestClassifier,0.71969,0.89751,0.957096,0.091726,,True,14.0,sqrt,30.0,24.0,900.0
14,RandomForestClassifier,0.718418,0.897128,0.95723,0.092217,,True,12.0,sqrt,6.0,22.0,900.0


In [14]:
# Review best model params

gridSearch.best_params_


{'n_estimators': 1000,
 'min_samples_split': 30,
 'min_samples_leaf': 8,
 'max_features': None,
 'max_depth': 16,
 'bootstrap': True}

In [20]:
# Fit best model and evaluate ROC AUC on both the train and test set

bestModel = gridSearch.best_estimator_
#rf = bestModel.fit(XTrainOS,yTrainOS)

yPredsTrain = rf.predict(XTrainOS)
yPredsTest = rf.predict(XTest)

print('Training data: \n')
print('ROC AUC: {}'.format(round(metrics.roc_auc_score(yTrainOS, yPredsTrain), 5)))
print('Precision: {}'.format(round(metrics.precision_score(yTrainOS, yPredsTrain), 5)))
print('Recall: {}'.format(round(metrics.recall_score(yTrainOS, yPredsTrain), 5)))

print('\nTest data: \n')
print('ROC AUC: {}'.format(round(metrics.roc_auc_score(yTest, yPredsTest), 5)))
print('Precision: {}'.format(round(metrics.precision_score(yTest, yPredsTest), 5)))
print('Recall: {}'.format(round(metrics.recall_score(yTest, yPredsTest), 5)))


Training data: 

ROC AUC: 0.85232
Precision: 0.92704
Recall: 0.76483

Test data: 

ROC AUC: 0.56585
Precision: 0.42462
Recall: 0.2
