# Predicting a 'no show' for a medical appointment based on historical data
This notebook uses a historical dataset from 2016 to predict someone not showing up for a medical appointment.
## Packages
The following packages were used.

In [253]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

## Data input
First the dataset was read in.

In [254]:
df = pd.read_csv('data/medical_no_show.csv')
print('Count of rows', str(df.shape[0]))
print('Count of Columns', str(df.shape[1]))
df.head()

Count of rows 110527
Count of Columns 14


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In the next part we check for missing data.

In [255]:
df.isnull().any().any()

False

As no missing data was found, we proceeded with verifying the dtypes for each of the columns.

In [256]:
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

Furthermore, we check how many unique values there are for each column.

In [257]:
for i in df.columns:
    print(i+":",len(df[i].unique()))

PatientId: 62299
AppointmentID: 110527
Gender: 2
ScheduledDay: 103549
AppointmentDay: 27
Age: 104
Neighbourhood: 81
Scholarship: 2
Hipertension: 2
Diabetes: 2
Alcoholism: 2
Handcap: 5
SMS_received: 2
No-show: 2


## Data Cleaning
First all column names are converted lowercase to achieve consistency.

In [258]:
df.columns = df.columns.str.lower().str.strip()

The `appointmentid` is set as index for the dataset.

In [259]:
df.set_index('appointmentid', inplace = True)

`patientid` needs to be converted to `int`.  
`no-show` needs to be converted to `int`.  
`gender` needs to be converted to `int`.  

In [260]:
df['patientid'] = df['patientid'].astype('int64')
df['no-show'] = df['no-show'].map({'No':0, 'Yes':1})
df['gender'] = df['gender'].map({'F':0, 'M':1})

`neighbourhood` is converted using one hot encoding.

In [261]:
df = pd.get_dummies(df, columns = ['neighbourhood'])
df.columns = df.columns.str.lower().str.strip()

A couple of features were added:
- `previous_app`: count how many previous appointments the patient has had (starting with 0)
- `previous_noshow`: percentage of missed appointments
- `apps_missed`: number of appointments missed

In [262]:
df['num_app'] = df.sort_values(by = ['patientid','scheduledday']).groupby(['patientid']).cumcount() + 1
df['noshow_pct'] = (df.sort_values(['patientid', 'scheduledday']).groupby(['patientid'])['no-show'].cumsum() / df['num_app'])
df['apps_missed'] = df.groupby('patientid')['no-show'].apply(lambda x: x.cumsum())

In [263]:
df[df['patientid'] == 838284762259].sort_values(by = ['patientid','scheduledday'])[['no-show', 'num_app', 'noshow_pct', 'apps_missed']]

Unnamed: 0_level_0,no-show,num_app,noshow_pct,apps_missed
appointmentid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5566277,0,1,0.0,0
5640434,0,2,0.0,0
5640443,1,3,0.333333,1
5653643,0,4,0.25,0
5674766,1,5,0.4,2
5685329,0,6,0.333333,0
5685501,0,7,0.285714,0
5716528,0,8,0.25,0
5716529,0,9,0.222222,0
5719659,0,10,0.2,0


In [264]:
df[(df['no-show'] == 1) & (df['num_app'] > 2)]

Unnamed: 0_level_0,patientid,gender,scheduledday,appointmentday,age,scholarship,hipertension,diabetes,alcoholism,handcap,...,neighbourhood_são benedito,neighbourhood_são cristóvão,neighbourhood_são josé,neighbourhood_são pedro,neighbourhood_tabuazeiro,neighbourhood_universitário,neighbourhood_vila rubim,num_app,noshow_pct,apps_missed
appointmentid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5629610,37976483781944,0,2016-04-27T13:46:37Z,2016-04-29T00:00:00Z,18,0,0,0,0,0,...,0,0,0,0,0,0,0,3,1.000000,1
5640178,653745118443,0,2016-04-29T10:13:22Z,2016-04-29T00:00:00Z,33,1,0,0,0,0,...,0,0,0,0,0,0,0,3,0.333333,1
5599192,56548277857,0,2016-04-19T08:35:26Z,2016-04-29T00:00:00Z,40,0,0,0,0,0,...,0,0,0,0,0,0,0,4,0.750000,1
5625977,343735171537732,1,2016-04-27T07:46:31Z,2016-04-29T00:00:00Z,43,0,0,0,0,0,...,0,0,0,0,1,0,0,3,0.666667,1
5637240,236326746564753,1,2016-04-28T18:02:54Z,2016-04-29T00:00:00Z,46,0,1,0,0,0,...,0,0,0,0,0,0,0,3,1.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5574038,2934141357952,0,2016-04-12T14:01:07Z,2016-06-06T00:00:00Z,41,0,0,0,0,0,...,0,0,0,0,0,0,0,4,0.250000,1
5736999,49861634253456,0,2016-05-25T09:01:33Z,2016-06-01T00:00:00Z,57,0,0,0,0,0,...,0,0,0,0,0,0,0,4,0.750000,3
5786741,645634214296344,1,2016-06-08T08:50:19Z,2016-06-08T00:00:00Z,33,0,1,0,0,0,...,0,0,0,0,0,0,0,3,0.666667,1
5779046,85442954737999,0,2016-06-06T17:35:38Z,2016-06-08T00:00:00Z,37,0,1,0,0,0,...,0,0,0,0,0,0,0,3,1.000000,3


Convert `scheduledday` and `appointmentday` to the datetime format.

In [265]:
df['scheduledday'] = pd.to_datetime(df['scheduledday']).dt.strftime('%Y-%m-%d')
df['scheduledday'] = pd.to_datetime(df['scheduledday'])
df['appointmentday'] = pd.to_datetime(df['appointmentday']).dt.strftime('%Y-%m-%d')
df['appointmentday'] = pd.to_datetime(df['appointmentday'])

Calculate the difference between the day that the appointment was scheduled and when the appointment actually occurred.  
Next we filter out those that have a difference less than zero, as this is likely erroneous data where the appointment occurred before the scheduled date.  
Also, people with an age lower or equal to 0 are filtered out, as these are likely wrong entries.

In [266]:
df['day_diff'] = (df['appointmentday'] - df['scheduledday']).dt.days
# Filter by day_diff
df = df[df['day_diff'] >= 0]
# Filter by age
df = df[df['age'] > 0]

Dummy variables are generated for `handcap` in the next step.

In [267]:
# Convert to Categorical
df['handcap'] = pd.Categorical(df['handcap'])
# Convert to Dummy Variables
Handicap = pd.get_dummies(df['handcap'], prefix = 'handicap')
df = pd.concat([df, Handicap], axis=1)

Unnecessary columns are subsequently dropped.

In [268]:
df.drop(['scheduledday'], axis=1, inplace=True)
df.drop(['appointmentday'], axis=1, inplace=True)
df.drop(['handcap'], axis=1, inplace = True)

## Exploratory analysis

## Machine learning

A random seed was set to ensure reproducability of the data.

In [269]:
np.random.seed(123)

Scaling of the data occurred using a robust scaler.

In [270]:
X = df.drop(['no-show'], axis=1)
y = df['no-show']
scaler = RobustScaler()
X = scaler.fit_transform(X)

Next the dataset is split into a training and test set after shuffling and stratification.

In [271]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify = y, test_size = 0.25)

In [283]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
lr = LogisticRegression(solver='newton-cg',)
lr.fit(X_train, y_train)
print(lr.score(X_train,y_train))
y_pred_lr = lr.predict(X_train)
clf_report = classification_report(y_train, y_pred_lr)
print(f"Classification Report : \n{clf_report}")

0.9452988683384017
Classification Report : 
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     63980
           1       0.90      0.82      0.86     16256

    accuracy                           0.95     80236
   macro avg       0.93      0.90      0.91     80236
weighted avg       0.94      0.95      0.94     80236



In [284]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_train)

clf_report = classification_report(y_train, y_pred_knn)
print(f"Classification Report : \n{clf_report}")

Classification Report : 
              precision    recall  f1-score   support

           0       0.97      0.98      0.98     63980
           1       0.92      0.89      0.90     16256

    accuracy                           0.96     80236
   macro avg       0.95      0.93      0.94     80236
weighted avg       0.96      0.96      0.96     80236



In [285]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_train)
clf_report = classification_report(y_train, y_pred_dtc)

print(f"Classification Report : \n{clf_report}")

Classification Report : 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     63980
           1       1.00      1.00      1.00     16256

    accuracy                           1.00     80236
   macro avg       1.00      1.00      1.00     80236
weighted avg       1.00      1.00      1.00     80236



In [286]:
from sklearn.ensemble import RandomForestClassifier
rd_clf = RandomForestClassifier()
rd_clf.fit(X_train, y_train)

y_pred_rd_clf = rd_clf.predict(X_train)
clf_report = classification_report(y_train, y_pred_rd_clf)

print(f"Classification Report : \n{clf_report}")

Classification Report : 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     63980
           1       1.00      1.00      1.00     16256

    accuracy                           1.00     80236
   macro avg       1.00      1.00      1.00     80236
weighted avg       1.00      1.00      1.00     80236



In [287]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator = dtc)
ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_train)
clf_report = classification_report(y_train, y_pred_ada)

print(f"Classification Report : \n{clf_report}")

Classification Report : 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     63980
           1       1.00      1.00      1.00     16256

    accuracy                           1.00     80236
   macro avg       1.00      1.00      1.00     80236
weighted avg       1.00      1.00      1.00     80236



The decision tree classifier, the random forest classifier and the AdaBoostClassifier seem to be performing the best.

In [279]:
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(estimator = rd_clf, X = X, y =y, cv = 8)
print("avg acc: ",np.mean(accuracy))
print("acg std: ",np.std(accuracy))

avg acc:  0.952187666833259
acg std:  0.018522848683311624
