# Cyft Data Investigation

Goal:


#### Description:

_30-Day All-Cause Hospital Readmissions_ is a quality measure that many healthcare organizations use to track their performance. Lower readmission rates indicate better patient outcomes, while higher ones tend to indicate system problems that are negatively impacting patients. The goal of this exercise is to analyze a dataset that simulates hospitalizations for a geriatric patient population in 2015 and 2016 to predict __if a patient is likely to have a readmission based on the information available at the time of their initial admission.__

You have 3 hours to complete the exercise. If you don't get through all the objectives, that's OK. After 3 hours, please finish what you're working on and send in whatever code, analyses, and visualizations (such as images) you have available. Include comments documenting any assumptions you've made as well as other ideas you would have tried if you had more time.

Feel free to use the language and statistical/machine learning libraries that you are most comfortable with, and ask questions along the way if any clarifications are necessary.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
import seaborn as sns



In [2]:
Dxs_summarized = {'A41': 'Sepsis',
 'E11': 'diabetes mellitus',
 'E56': 'Vitamin deficiency',
 'E86': 'Dehydration',
 'F03': 'dementia',
 'F05': 'Delirium',
 'F19': 'Drug abuse',
 'G31': 'Degeneration of nervous system',
 'G89': 'Chronic Pain',
 'H53': 'Visual discomfort',
 'H91': 'Hearing loss',
 'I10': 'hypertension',
 'I50': 'Heart Failure',
 'I51': 'heart disease',
 'J18': 'Pneumonia',
 'J44': 'COPD',
 'J45': 'asthma',
 'M54': 'Radiculopathy/Panniculitis/Sciatica',
 'N18': 'Chronic kidney disease',
 'N39': 'Urinary incontinence',
 'R05': 'cough',
 'R26': 'abnormalities of gait and mobility',
 'R39': 'Urgency of urination',
 'R41': 'Cognitive functions and awareness symptoms',
 'R51': 'headache',
 'T88': 'Anesthesia Complication',
 'W19': 'Fall'}

In [3]:
def add_target(full):
    full['Readmitted'] = full.groupby(['Patient'])['Patient'].transform('count')
    full['Readmitted'] = full['Readmitted'].map({2:1,1:0})
    full['DaysSinceAdmission'] = full[full['Readmitted'] == 1].groupby(['Patient']).diff()['AdmitDate']
    full['<=30Days'] = (full['DaysSinceAdmission'] <= pd.Timedelta('30 days')).astype(np.int)
    full['WillBe<=30Days'] = full[full['Readmitted'] == 1].groupby('Patient').shift(-1)[['<=30Days']].fillna(0).astype(np.int)
    full['WillBe<=30Days'] = full['WillBe<=30Days'].fillna(0)
    full['<=30Days'] = full['WillBe<=30Days']
    full.drop(['Readmitted','DaysSinceAdmission','WillBe<=30Days'],axis=1,inplace=True)
    return full

In [4]:
def add_combined_dx_feats(full_orig):
    full_orig['PrimaryDx_Dx2'] = full_orig['PrimaryDx']+full_orig['Dx2']
    full_orig['PrimaryDx_Dx3'] = full_orig['PrimaryDx']+full_orig['Dx3']
    full_orig['Dx2_Dx3'] = full_orig['Dx2']+full_orig['Dx3']
    return full_orig

In [5]:
def add_month_feat(df):
    df['MonthAdmit'] = df['AdmitDate'].apply(lambda x: x.month)
    df['MonthAdmit'] = df['MonthAdmit'].map({1:'Jan',
                                            2:'Feb',
                                            3:'Mar',
                                            4:'Apr',
                                            5:'May',
                                            6:'Jun',
                                            7:'Jul',
                                            8:'Aug',
                                            9:'Sep',
                                            10:'Oct',
                                            11:'Nov',
                                            12:'Dec'})
    return df

In [6]:
def replace_nulls(full):
    full = full.replace('@NA',np.NaN) # replace @NA
    full = full.replace('',np.NaN) # didn't find any empty strings
    full = full.replace(np.NaN,'')
    return full

#### Load and transform data

Functions created in other eda notebook and copied over

In [7]:
admit = pd.read_csv('../../Cyft/readmissions/admissions.csv')
claims = pd.read_csv('../../Cyft/readmissions/claims.csv')

full = pd.merge(admit,claims,on=['Patient','AdmitDate'])
full['AdmitDate'] = pd.to_datetime(full['AdmitDate'])

full = replace_nulls(full)
full = add_target(full)
#full = add_combined_dx_feats(full)
full = add_month_feat(full)

In [8]:
full_orig = full.copy()

full = full.set_index('AdmitDate')

for col in [col for col in full.columns if full[col].dtype == 'object' and 'Patient' not in col]:
    dummies = pd.get_dummies(full[col],prefix=col)
    full.drop(col,axis=1,inplace=True)
    full = pd.concat([full,dummies],axis=1)    

In [9]:
train = full['2015']
test = full['2016']

train = train.drop(['Patient'],axis=1)
test = test.drop(['Patient'],axis=1)

print('train: {} rows and {} columns'.format(*train.shape))
print('test: {} rows and {} columns'.format(*test.shape))

train: 2856 rows and 49 columns
test: 2938 rows and 49 columns


In [10]:
print('Accuracy baseline is: {:.2f}%'.format(100*(1-full['<=30Days'].mean())))


Accuracy baseline is: 76.44%


In [11]:
# X_train = train[top_feats]
# y_train = train['<=30Days']
# X_test = test[top_feats]
# y_test = test['<=30Days']

In [12]:
X_train = train.drop(['<=30Days'],axis=1)
y_train = train['<=30Days']
X_test = test.drop(['<=30Days'],axis=1)
y_test = test['<=30Days']

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

#### Only Top Feats

In [43]:
top_feats = ['Age',
 'Dx2_E11',
 'Dx2_E86',
 'Dx2_F03',
 'Dx2_I51',
 'Dx2_T88',
 'Dx2_W19',
 'Dx3_',
 'Dx3_J18',
 'Gender_F',
 'Gender_M',
 'LOS',
 'PastPCPVisits',
 'PrimaryDx_E11',
 'PrimaryDx_I50',
 'PrimaryDx_J44',
 'PrimaryDx_N18']

top_feats = ['Age',
 'Dx2_E11',
 'Dx2_E86',
 'Dx2_F03',
 'Dx2_F05',
 'Dx2_F19',
 'Dx2_G31',
 'Dx2_I51',
 'Dx2_R39',
 'Dx2_T88',
 'Dx2_W19',
 'Dx3_',
 'Dx3_H53',
 'Dx3_J18',
 'Dx3_M54',
 'Dx3_R05',
 'Dx3_R26',
 'Gender_F',
 'Gender_M',
 'LOS',
 'PastPCPVisits',
 'PrimaryDx_A41',
 'PrimaryDx_E11',
 'PrimaryDx_I50',
 'PrimaryDx_J44',
 'PrimaryDx_N18']

#### Weighted Ensemble

In [15]:
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.ensemble import VotingClassifier

In [16]:
knn_params = {
    'n_neighbors': 5,
    'weights': 'distance',
    'p': 1
}

lr_params = {
    'random_state': 0,
    'C': 1,
    'class_weight': 'balanced',
    'penalty': 'l1'
}

rf_params = {
    'random_state': 0,
    'class_weight': 'balanced',
    'criterion': 'gini',
    'max_depth': 50,
    'min_samples_split': 80,
    'n_estimators': 400
}

In [78]:
clf1 = KNeighborsClassifier(n_neighbors=5,weights='distance',p=1)
clf2 = LogisticRegression()#class_weight='balanced')
clf3 = RandomForestClassifier(n_estimators=400,class_weight='balanced')#,max_depth=50,min_samples_split=80,criterion='gini')

# clf1 = LogisticRegression(lr_params)
# clf2 = KNeighborsClassifier(knn_params)
# clf3 = RandomForestClassifier(rf_params)

In [79]:
weights = [1,1,1]
eclf = VotingClassifier(estimators=[('knn',clf1),('lr',clf2),('rf',clf3)],weights=weights)

In [93]:
for clf, label in zip([clf1, clf2, clf3, eclf], ['KNN', 'Logistic Regression', 'Random Forest', 'Ensemble']):

    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1')
    print("Accuracy: {:0.3f} (+/- {:0.3f}) [{}]".format(scores.mean(), scores.std(), label))

Accuracy: 0.436 (+/- 0.015) [KNN]
Accuracy: 0.474 (+/- 0.026) [Logistic Regression]
Accuracy: 0.455 (+/- 0.035) [Random Forest]
Accuracy: 0.474 (+/- 0.026) [Ensemble]


In [94]:
df = pd.DataFrame(columns=('w1', 'w2', 'w3', 'mean', 'std'))

i = 0
for w1 in range(1,4):
    for w2 in range(1,4):
        for w3 in range(1,4):

            if len(set((w1,w2,w3))) == 1: # skip if all weights are equal
                continue

            eclf = VotingClassifier(estimators=[('lr',clf1),('knn',clf2),('rf',clf3)], weights=[w1,w2,w3])
            scores = cross_val_score(estimator=eclf,
                                            X=X_train,
                                            y=y_train,
                                            cv=5,
                                            scoring='f1')

            df.loc[i] = [w1, w2, w3, scores.mean(), scores.std()]
            i += 1

In [95]:
df.sort_values(by='mean',ascending=False).head()

Unnamed: 0,w1,w2,w3,mean,std
5,1.0,3.0,1.0,0.474467,0.026103
1,1.0,1.0,3.0,0.45737,0.030004
18,3.0,1.0,3.0,0.451496,0.017208
23,3.0,3.0,2.0,0.451304,0.025148
7,1.0,3.0,3.0,0.450718,0.03196


In [96]:
eclf = VotingClassifier(estimators=[('knn',clf1),('lr',clf2),('rf',clf3)], weights=[1,3,1])

preds = pd.DataFrame()
clfs = [('knn',clf1),('lr',clf2),('rf',clf3),('ens',eclf)]
for name,clf in clfs:
    clf.fit(X_train,y_train)
    clf_pred = clf.predict(X_test)
    preds['{}'.format(name)] = clf_pred

#### Analyze the predictions

This is an interesting outcome. Talk through what the precision vs recall mean and calculate them out if there were real numbers involved

In [97]:
for col in [col for col in preds if '<=30Days' not in col]:
    y_pred = preds[col]
    print('Model: {}'.format(col))
    print('\tprecision: {:.3f}\n\trecall: {:.3f}\n\taccuracy: {:.3f}'.format(precision_score(y_test,y_pred),
                                                       recall_score(y_test,y_pred),
                                                       accuracy_score(y_test,y_pred)))

Model: knn
	precision: 0.421
	recall: 0.374
	accuracy: 0.736
Model: lr
	precision: 0.586
	recall: 0.415
	accuracy: 0.797
Model: rf
	precision: 0.547
	recall: 0.293
	accuracy: 0.780
Model: ens
	precision: 0.586
	recall: 0.415
	accuracy: 0.797
