# Predicting a 'no show' for a medical appointment based on historical data
This notebook uses a historical dataset from 2016 to predict someone not showing up for a medical appointment.
## Packages
The following packages were used.

In [243]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

## Data input
First the dataset was read in.

In [165]:
df = pd.read_csv('data/medical_no_show.csv')
print('Count of rows', str(df.shape[0]))
print('Count of Columns', str(df.shape[1]))
df.head()

Count of rows 110527
Count of Columns 14


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In the next part we check for missing data.

In [166]:
df.isnull().any().any()

False

As no missing data was found, we proceeded with verifying the dtypes for each of the columns.

In [167]:
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

Furthermore, we check how many unique values there are for each column.

In [168]:
for i in df.columns:
    print(i+":",len(df[i].unique()))

PatientId: 62299
AppointmentID: 110527
Gender: 2
ScheduledDay: 103549
AppointmentDay: 27
Age: 104
Neighbourhood: 81
Scholarship: 2
Hipertension: 2
Diabetes: 2
Alcoholism: 2
Handcap: 5
SMS_received: 2
No-show: 2


## Data Cleaning
First all column names are converted lowercase to achieve consistency.

In [169]:
df.columns = df.columns.str.lower().str.strip()

The `appointmentid` is set as index for the dataset.

In [170]:
df.set_index('appointmentid', inplace = True)

`patientid` needs to be converted to `int`.  
`no-show` needs to be converted to `int`.  
`gender` needs to be converted to `int`.  

In [171]:
df['patientid'] = df['patientid'].astype('int64')
df['no-show'] = df['no-show'].map({'No':0, 'Yes':1})
df['gender'] = df['gender'].map({'F':0, 'M':1})

`neighbourhood` is converted using one hot encoding.

In [172]:
df = pd.get_dummies(df, columns = ['neighbourhood'])
df.columns = df.columns.str.lower().str.strip()

A couple of features were added:
- `num_app`: count how many previous appointments the patient has had (starting with 0)
- `apps_missed`: number of appointments missed previously
- `previous_noshow`: percentage of previously missed appointments

In [173]:
df['num_app'] = df.sort_values(by = ['patientid','scheduledday']).groupby(['patientid']).cumcount() + 1
df['apps_missed'] = df.sort_values(by = ['patientid','scheduledday']).groupby(['patientid'])['no-show'].cumsum().shift(1, axis = 0)
df['noshow_pct'] = df['apps_missed'] / (df.sort_values(by = ['patientid','scheduledday']).groupby(['patientid'])['num_app'].shift(1, axis =0))
df['noshow_pct'].fillna(0, inplace = True)
df['apps_missed'].fillna(0, inplace = True)

Below is an example of a patient that has had multiple appointments and missed some as well.

In [174]:
df[df['patientid'] == 838284762259].sort_values(by = ['patientid','scheduledday'])[['scheduledday', 'no-show', 'num_app', 'noshow_pct', 'apps_missed']]

Unnamed: 0_level_0,scheduledday,no-show,num_app,noshow_pct,apps_missed
appointmentid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5566277,2016-04-11T10:09:42Z,0,1,0.0,0.0
5640434,2016-04-29T10:43:19Z,0,2,0.0,0.0
5640443,2016-04-29T10:44:22Z,1,3,0.0,0.0
5653643,2016-05-03T12:59:01Z,0,4,0.333333,1.0
5674766,2016-05-09T11:52:43Z,1,5,0.25,1.0
5685329,2016-05-11T10:10:16Z,0,6,0.4,2.0
5685501,2016-05-11T10:30:03Z,0,7,0.333333,2.0
5716528,2016-05-18T17:48:31Z,0,8,0.285714,2.0
5716529,2016-05-18T17:48:31Z,0,9,0.25,2.0
5719659,2016-05-19T11:45:34Z,0,10,0.222222,2.0


Convert `scheduledday` and `appointmentday` to the datetime format.

In [175]:
df['scheduledday'] = pd.to_datetime(df['scheduledday']).dt.strftime('%Y-%m-%d')
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['appointmentday'] = pd.to_datetime(df['appointmentday']).dt.strftime('%Y-%m-%d')
df['appointmentday'] = pd.to_datetime(df['appointmentday'])

Calculate the difference between the day that the appointment was scheduled and when the appointment actually occurred.  
Next we filter out those that have a difference less than zero, as this is likely erroneous data where the appointment occurred before the scheduled date.  
Also, people with an age lower or equal to 0 are filtered out, as these are likely wrong entries.

In [176]:
df['day_diff'] = (df['appointmentday'] - df['scheduledday']).dt.days
# Filter by day_diff
df = df[df['day_diff'] >= 0]
# Filter by age
df = df[df['age'] > 0]

Dummy variables are generated for `handcap` in the next step.

In [177]:
# Convert to Categorical
df['handcap'] = pd.Categorical(df['handcap'])
# Convert to Dummy Variables
Handicap = pd.get_dummies(df['handcap'], prefix = 'handicap')
df = pd.concat([df, Handicap], axis=1)

Unnecessary columns are subsequently dropped.

In [178]:
df.drop(['scheduledday'], axis=1, inplace=True)
df.drop(['appointmentday'], axis=1, inplace=True)
df.drop(['handcap'], axis=1, inplace = True)

## Exploratory analysis

## Machine learning

A random seed was set to ensure reproducability of the data.

In [245]:
np.random.seed(123)

80% of the time people did show up for an appointment, leading to an inbalanced dataset.  
Just always predicting 0, therefore already leads to an accuracy of 80%, making this scoring parameter not very useful.
As we are more interested in the positive class, average precision is used to evaluate different models.

In [246]:
df['no-show'].value_counts(normalize=True)

0    0.797396
1    0.202604
Name: no-show, dtype: float64

Scaling of the data occurred using a robust scaler.

In [247]:
X = df.drop(['no-show'], axis=1)
y = df['no-show']
scaler = RobustScaler()
X = scaler.fit_transform(X)

Next the dataset is split into a training and test set after shuffling and stratification.

In [248]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify = y, test_size = 0.25)

In [257]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
lr = LogisticRegression(solver='newton-cg',)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_train)

avg_precision = cross_val_score(estimator = lr, X = X_train, y =y_train, cv = 5, scoring='average_precision')
print("avg precision: ",np.mean(avg_precision))
print("acg precision: ",np.std(avg_precision))

avg precision:  0.3252984899031944
acg precision:  0.005361513862354794


In [250]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_train)

avg_precision = cross_val_score(estimator = knn, X = X_train, y =y_train, cv = 5, scoring='average_precision')
print("avg precision: ",np.mean(avg_precision))
print("acg precision: ",np.std(avg_precision))

avg precision:  0.2775209476162753
acg precision:  0.002018414599599695


In [251]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_train)
clf_report = classification_report(y_train, y_pred_dtc)

avg_precision = cross_val_score(estimator = dtc, X = X_train, y =y_train, cv = 5, scoring='average_precision')
print("avg precision: ",np.mean(avg_precision))
print("acg precision: ",np.std(avg_precision))

avg precision:  0.23993610677770788
acg precision:  0.0018807656178568973


In [252]:
from sklearn.ensemble import RandomForestClassifier
rd_clf = RandomForestClassifier()
rd_clf.fit(X_train, y_train)

y_pred_rd_clf = rd_clf.predict(X_train)
clf_report = classification_report(y_train, y_pred_rd_clf)

avg_precision = cross_val_score(estimator = rd_clf, X = X_train, y =y_train, cv = 5, scoring='average_precision')
print("avg precision: ",np.mean(avg_precision))
print("acg precision: ",np.std(avg_precision))

avg precision:  0.3763614078511817
acg precision:  0.006303351151190706


In [253]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator = dtc)
ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_train)
clf_report = classification_report(y_train, y_pred_ada)

avg_precision = cross_val_score(estimator = ada, X = X_train, y =y_train, cv = 5, scoring='average_precision')
print("avg precision: ",np.mean(avg_precision))
print("acg precision: ",np.std(avg_precision))

avg precision:  0.23951160344975123
acg precision:  0.001746407794692656


The random forest classifier seem to be performing the best.

In [258]:
# Model parameters
from random import random


max_depth = list(np.arange(10,1000, 10))
max_depth.append(None)
n_estimators = list(np.arange(10,1000, 10))
criterion = ['gini', 'entropy']
max_features = list(np.arange(5,50, 1))
max_features.append(None)
max_leaf_nodes = list(np.arange(4,100, 1))
max_leaf_nodes.append(None)
min_samples_split = list(np.arange(2,30, 1))
min_samples_split.append(None)

pipe = Pipeline([
    ('scaler', RobustScaler()),
    ('reg', RandomForestClassifier())
    ])
param_grid = [
    {
        'reg': [RandomForestClassifier(random_state = 123)],
        'reg__n_estimators': n_estimators,
        'reg__criterion': criterion,
        'reg__max_depth': max_depth,
        'reg__max_features': max_features,
        'reg__warm_start': [True], 
        'reg__min_samples_split': min_samples_split,
        'reg__n_jobs': [-1],
        'reg__max_leaf_nodes': max_leaf_nodes,
        'reg__bootstrap': [True]
    },
]
grid_pipeline = RandomizedSearchCV(pipe,param_grid, cv = KFold(5, random_state=123, shuffle = True), n_jobs=-1, n_iter=5, return_train_score=True, scoring = ['accuracy', 'f1', 'roc_auc', 'average_precision'], refit = 'average_precision')

In [259]:
grid_pipeline.fit(X_train,y_train)

In [None]:
grid_pipeline.best_params_

{'reg__warm_start': True,
 'reg__n_jobs': -1,
 'reg__n_estimators': 20,
 'reg__max_depth': 160,
 'reg__bootstrap': True,
 'reg': RandomForestClassifier(max_depth=160, n_estimators=20, n_jobs=-1,
                        random_state=123, warm_start=True)}

In [260]:
print('Accuracy is: {}'.format(round(np.nanmax(grid_pipeline.cv_results_['mean_test_accuracy']), 2)))
print('F1 score is: {}'.format(round(np.nanmax(grid_pipeline.cv_results_['mean_test_f1']), 2)))
print('ROC AUC is: {}'.format(round(np.nanmax(grid_pipeline.cv_results_['mean_test_roc_auc']), 2)))
print('Average precision is: {}'.format(round(np.nanmax(grid_pipeline.cv_results_['mean_test_average_precision']), 2)))

Accuracy is: 0.8
F1 score is: 0.09
ROC AUC is: 0.74
Average precision is: 0.4


On the test set, the model performs as follows:

In [261]:
grid_pipeline_pred = grid_pipeline.predict(X_test)
clf_report = classification_report(y_test, grid_pipeline_pred)
print(f"Classification Report : \n{clf_report}")
print(f"Average precision is : {round(metrics.average_precision_score(y_test, grid_pipeline.predict(X_test)), 2)}")

Classification Report : 
              precision    recall  f1-score   support

           0       0.80      0.99      0.89     21327
           1       0.67      0.05      0.09      5419

    accuracy                           0.80     26746
   macro avg       0.74      0.52      0.49     26746
weighted avg       0.78      0.80      0.73     26746

Average precision is : 0.22


In [263]:
y_test.value_counts()

0    21327
1     5419
Name: no-show, dtype: int64