## Module-1

## Predicting 30-Days Readmission from Hospital Lab Records & Discharge Summary for Heart Failure ICU Patients 



## Inspiration
Hospital Readmissions are a critical problem to address for hospitals.  Readmissions affect hospitals both Clinically and Financially. The hospitals are penalized due to 30-day readmission cases. 

As per CMS guidelines, [link](https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Readmissions-Reduction-Program) 
 - CMS began penalizing hospitals for 30-day readmissions on Oct. 1, 2012, at 1 percent, upping the penalty rate to 2 percent for the fiscal year 2014
 - CMS will cut payments to the penalized hospitals by as much as 3 percent for each Medicare case during fiscal 2020, which runs from Oct. 1 through September 2020

CMS includes the following condition or procedure-specific 30-day risk-standardized unplanned readmission measures in the program:
 - Acute myocardial infarction (AMI)
 - Chronic obstructive pulmonary disease (COPD)
 -  Heart failure (HF)
 -  Pneumonia
 -  Coronary artery bypass graft (CABG) surgery
 -  Elective primary total hip arthroplasty and/or total knee arthroplasty (THA/TKA)

If we can build a predictive model to predict the re-admission cases in advance, hospitals can take preventive action and take special care of those patients with higher re-admission risks.
Also Model should be able to predict the top influencing factors which increase the re-admission risks. Hospitals can focus on these critical factors and plan to prevent the re-admission.

### Project Objective:

 - Build a model which can predict 30-Day Re-admission cases for Heart Failure ICU Patients.
 - Identify how can we extract information from clinical text like clinical notes or discharge summaries (unstructured data) and use the various clinical attributes to train the model.
 - Identify how can we use AWS Comprehend Medical and extract important attributes from clinical text quickly and build a quick prototye model.
 - The model will identify the top factors which increase the re-admission risks in heart failure ICU patients.


In [95]:
import pandas as pd

data_path = ''
# notes = pd.read_csv(note_path+'NOTEEVENTS.csv', skiprows= lambda x: x in [1289580])
lab_items = pd.read_csv(data_path+'lab_items_v01.csv')
print(lab_items.columns)

Index(['subject_id', 'hadm_id', 'admission_type', 'age', 'insurance',
       'ethnicity', 'religion', 'marital_status', 'gender', 'value',
       'valuenum', 'label', 'admittime', 'dischtime'],
      dtype='object')


In [96]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [97]:
lab_items.head(5)

Unnamed: 0,subject_id,hadm_id,admission_type,age,insurance,ethnicity,religion,marital_status,gender,value,valuenum,label,admittime,dischtime
0,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,136,136.0,Sodium,2134-09-11 12:17:00.000,2134-09-24 16:15:00.000
1,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,71,71.0,Urea Nitrogen,2134-09-11 12:17:00.000,2134-09-24 16:15:00.000
2,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,84,84.0,Urea Nitrogen,2134-09-11 12:17:00.000,2134-09-24 16:15:00.000
3,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,101,101.0,Urea Nitrogen,2134-09-11 12:17:00.000,2134-09-24 16:15:00.000
4,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,102,102.0,Urea Nitrogen,2134-09-11 12:17:00.000,2134-09-24 16:15:00.000


In [98]:
lab_items['dischtime'] = pd.to_datetime(lab_items['dischtime'],infer_datetime_format=True)
lab_items['admittime'] = pd.to_datetime(lab_items['admittime'],infer_datetime_format=True)

In [99]:
lab_items.head(5)

Unnamed: 0,subject_id,hadm_id,admission_type,age,insurance,ethnicity,religion,marital_status,gender,value,valuenum,label,admittime,dischtime
0,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,136,136.0,Sodium,2134-09-11 12:17:00,2134-09-24 16:15:00
1,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,71,71.0,Urea Nitrogen,2134-09-11 12:17:00,2134-09-24 16:15:00
2,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,84,84.0,Urea Nitrogen,2134-09-11 12:17:00,2134-09-24 16:15:00
3,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,101,101.0,Urea Nitrogen,2134-09-11 12:17:00,2134-09-24 16:15:00
4,21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,102,102.0,Urea Nitrogen,2134-09-11 12:17:00,2134-09-24 16:15:00


In [100]:
group_idx = ['subject_id', 'hadm_id', 'admission_type', 'age', 'insurance',
       'ethnicity', 'religion', 'marital_status', 'gender','label','admittime','dischtime']

In [101]:
grouped_df = lab_items.groupby(group_idx,as_index=False).apply(lambda x: round(x[['valuenum']].mean()))

In [102]:
grouped_df.shape

(21036, 13)

In [103]:
grouped_df = grouped_df.pivot(index=['subject_id', 'hadm_id', 'admission_type', 'age', 'insurance', 'ethnicity', 'religion', 
                        'marital_status', 'gender','admittime','dischtime'],
                 columns='label',
                 values='valuenum').reset_index()

In [104]:
grouped_df.shape

(9584, 15)

In [105]:
grouped_df = grouped_df.set_index(['subject_id','hadm_id'], drop = True)
grouped_df

Unnamed: 0_level_0,label,admission_type,age,insurance,ethnicity,religion,marital_status,gender,admittime,dischtime,"Creatinine, Serum",NTproBNP,Sodium,Urea Nitrogen
subject_id,hadm_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,2134-09-11 12:17:00,2134-09-24 16:15:00,,,138.0,71.0
21,111970,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,2135-01-30 20:50:00,2135-02-08 02:08:00,,,139.0,38.0
34,115799,EMERGENCY,300.0,Medicare,WHITE,CATHOLIC,MARRIED,M,2186-07-18 16:46:00,2186-07-20 16:00:00,,,141.0,25.0
34,144319,EMERGENCY,304.0,Medicare,WHITE,CATHOLIC,MARRIED,M,2191-02-23 05:23:00,2191-02-25 20:20:00,,,141.0,32.0
68,108329,EMERGENCY,41.0,Medicare,BLACK/AFRICAN AMERICAN,PROTESTANT QUAKER,SINGLE,F,2174-01-04 22:21:00,2174-01-19 11:30:00,,64499.0,134.0,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99897,162913,EMERGENCY,53.0,Private,BLACK/HAITIAN,7TH DAY ADVENTIST,MARRIED,M,2181-08-06 02:22:00,2181-08-07 16:30:00,,,129.0,70.0
99897,181057,EMERGENCY,54.0,Private,BLACK/HAITIAN,7TH DAY ADVENTIST,MARRIED,M,2182-07-03 19:50:00,2182-07-08 19:52:00,,,134.0,65.0
99982,112748,EMERGENCY,65.0,Medicare,WHITE,CATHOLIC,MARRIED,M,2157-01-05 17:27:00,2157-01-12 13:00:00,,,137.0,29.0
99982,151454,EMERGENCY,65.0,Medicare,WHITE,CATHOLIC,MARRIED,M,2156-11-28 11:56:00,2156-12-08 13:45:00,,,138.0,22.0


In [106]:
grouped_df['target'] = 2

#### We are setting the target . The readmission cases are selected where number of admissions are more than one with first admission and 30-days duration. 

In [107]:


last_id = 0
for d in grouped_df.index:
    if last_id == d[0]:
        # print(d)
        # (lab_items['dischtime'] - lab_items['admittime']).dt.days
        # disch_dt = np.datetime64(grouped_df.loc[(21,109451),['dischtime']].values[0])
        # admit_dt = np.datetime64(grouped_df.loc[(21,109451),['admittime']].values[0])
        dur = pd.Timedelta(grouped_df.loc[last_d,['dischtime']].values[0] - grouped_df.loc[d,['admittime']].values[0])
        if dur.days < 30:
            grouped_df.loc[last_d,['target']] = 1
        else:
            grouped_df.loc[last_d,['target']] = 0

    last_d = d
    last_id = d[0]

In [108]:
grouped_df['duration'] = (grouped_df['dischtime'] - grouped_df['admittime']).apply(lambda x: x.days)
grouped_df = grouped_df.drop(['admittime','dischtime'],axis=1)

In [109]:
grouped_df.isna().sum()

label
admission_type          0
age                     0
insurance               0
ethnicity               0
religion                0
marital_status          0
gender                  0
Creatinine, Serum    9563
NTproBNP             7766
Sodium                  9
Urea Nitrogen           1
target                  0
duration                0
dtype: int64

In [110]:
grouped_df.fillna(0,inplace=True)

In [111]:
grouped_df

Unnamed: 0_level_0,label,admission_type,age,insurance,ethnicity,religion,marital_status,gender,"Creatinine, Serum",NTproBNP,Sodium,Urea Nitrogen,target,duration
subject_id,hadm_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
21,109451,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,0.0,0.0,138.0,71.0,1,13
21,111970,EMERGENCY,87.0,Medicare,WHITE,JEWISH,MARRIED,M,0.0,0.0,139.0,38.0,2,8
34,115799,EMERGENCY,300.0,Medicare,WHITE,CATHOLIC,MARRIED,M,0.0,0.0,141.0,25.0,1,1
34,144319,EMERGENCY,304.0,Medicare,WHITE,CATHOLIC,MARRIED,M,0.0,0.0,141.0,32.0,2,2
68,108329,EMERGENCY,41.0,Medicare,BLACK/AFRICAN AMERICAN,PROTESTANT QUAKER,SINGLE,F,0.0,64499.0,134.0,40.0,0,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99897,162913,EMERGENCY,53.0,Private,BLACK/HAITIAN,7TH DAY ADVENTIST,MARRIED,M,0.0,0.0,129.0,70.0,1,1
99897,181057,EMERGENCY,54.0,Private,BLACK/HAITIAN,7TH DAY ADVENTIST,MARRIED,M,0.0,0.0,134.0,65.0,2,5
99982,112748,EMERGENCY,65.0,Medicare,WHITE,CATHOLIC,MARRIED,M,0.0,0.0,137.0,29.0,0,6
99982,151454,EMERGENCY,65.0,Medicare,WHITE,CATHOLIC,MARRIED,M,0.0,0.0,138.0,22.0,1,10


In [112]:
grouped_df = grouped_df.loc[~(grouped_df['target']==2)]

In [113]:
grouped_df.target.value_counts()

1    3352
0    2867
Name: target, dtype: int64

In [114]:
grouped_df.columns

Index(['admission_type', 'age', 'insurance', 'ethnicity', 'religion',
       'marital_status', 'gender', 'Creatinine, Serum', 'NTproBNP', 'Sodium',
       'Urea Nitrogen', 'target', 'duration'],
      dtype='object', name='label')

In [115]:
grouped_df.shape

(6219, 13)

In [116]:
ohe_features = ['admission_type', 'insurance', 'ethnicity', 'religion', 'marital_status', 'gender']
# scaling_features = ['age','Creatinine, Serum', 'NTproBNP', 'Sodium', 'Urea Nitrogen', 'duration']
scaling_features = list((set(grouped_df.columns) - set(ohe_features)) - set(['target']))
print(scaling_features)

['Sodium', 'Urea Nitrogen', 'age', 'NTproBNP', 'duration', 'Creatinine, Serum']


In [117]:
len(scaling_features)

6

In [118]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(grouped_df[scaling_features]),columns=scaling_features,index=grouped_df.index)

In [119]:
df_ohe = pd.get_dummies(grouped_df[ohe_features])

In [120]:
df_scaled

Unnamed: 0_level_0,Unnamed: 1_level_0,Sodium,Urea Nitrogen,age,NTproBNP,duration,"Creatinine, Serum"
subject_id,hadm_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
21,109451,-0.088430,1.726392,0.039553,-0.269632,0.219167,-0.035574
34,115799,0.438340,-0.443299,3.368046,-0.269632,-0.814844,-0.035574
68,108329,-0.790791,0.264209,-0.679277,8.322805,0.305334,-0.035574
105,128744,0.087160,-1.056472,-0.773037,-0.269632,-0.556341,-0.035574
107,174162,-0.264021,2.575401,-0.226101,0.276962,-0.642509,-0.035574
...,...,...,...,...,...,...,...
99660,168541,0.087160,0.405711,-0.147968,-0.039165,-0.556341,-0.035574
99883,150755,0.262750,-0.207463,-0.179221,-0.269632,-0.642509,-0.035574
99897,162913,-1.668742,1.679225,-0.491756,-0.269632,-0.814844,-0.035574
99982,112748,-0.264021,-0.254630,-0.304235,-0.269632,-0.384006,-0.035574


In [121]:
df_ohe

Unnamed: 0_level_0,Unnamed: 1_level_0,admission_type_ELECTIVE,admission_type_EMERGENCY,admission_type_URGENT,insurance_Government,insurance_Medicaid,insurance_Medicare,insurance_Private,insurance_Self Pay,ethnicity_AMERICAN INDIAN/ALASKA NATIVE,ethnicity_ASIAN,...,religion_UNOBTAINABLE,marital_status_DIVORCED,marital_status_LIFE PARTNER,marital_status_MARRIED,marital_status_SEPARATED,marital_status_SINGLE,marital_status_UNKNOWN (DEFAULT),marital_status_WIDOWED,gender_F,gender_M
subject_id,hadm_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
21,109451,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
34,115799,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
68,108329,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
105,128744,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
107,174162,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99660,168541,0,1,0,0,0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
99883,150755,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
99897,162913,0,1,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
99982,112748,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [122]:
tmp_df = df_scaled.merge(df_ohe,left_index=True,right_index=True)
df_final = tmp_df.merge(grouped_df['target'],left_index=True,right_index=True)

In [123]:
df_final

Unnamed: 0_level_0,Unnamed: 1_level_0,Sodium,Urea Nitrogen,age,NTproBNP,duration,"Creatinine, Serum",admission_type_ELECTIVE,admission_type_EMERGENCY,admission_type_URGENT,insurance_Government,...,marital_status_DIVORCED,marital_status_LIFE PARTNER,marital_status_MARRIED,marital_status_SEPARATED,marital_status_SINGLE,marital_status_UNKNOWN (DEFAULT),marital_status_WIDOWED,gender_F,gender_M,target
subject_id,hadm_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
21,109451,-0.088430,1.726392,0.039553,-0.269632,0.219167,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,0,1,1
34,115799,0.438340,-0.443299,3.368046,-0.269632,-0.814844,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,0,1,1
68,108329,-0.790791,0.264209,-0.679277,8.322805,0.305334,-0.035574,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0
105,128744,0.087160,-1.056472,-0.773037,-0.269632,-0.556341,-0.035574,0,1,0,0,...,0,0,0,0,1,0,0,1,0,1
107,174162,-0.264021,2.575401,-0.226101,0.276962,-0.642509,-0.035574,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99660,168541,0.087160,0.405711,-0.147968,-0.039165,-0.556341,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,1,0,1
99883,150755,0.262750,-0.207463,-0.179221,-0.269632,-0.642509,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,0,1,1
99897,162913,-1.668742,1.679225,-0.491756,-0.269632,-0.814844,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,0,1,1
99982,112748,-0.264021,-0.254630,-0.304235,-0.269632,-0.384006,-0.035574,0,1,0,0,...,0,0,1,0,0,0,0,0,1,0


In [124]:
def predict(model, X, Y, test=True):
    import pandas as pd
    from sklearn.metrics import classification_report
    
    y_pred = pd.DataFrame(model.predict(X))
# print(ConfusionMatrixDisplay.from_predictions(Y, y_pred).confusion_matrix)
    # ConfusionMatrixDisplay.from_predictions(Y, y_pred)
    print(classification_report(Y, y_pred))
    return y_pred


def feature_importance(importance, features, top_n):
    d = {}
    for i,v in enumerate(importance):
        d[features[i]] = v
        # print('feature: %s, score: %.5f' % (features[i],v))
    d_srt = dict(sorted(d.items(), key=lambda x: x[1], reverse=True))
    
    cntr = 0
    print("Feature importance from higher to lower:")
    for k,v in d_srt.items():
        cntr += 1
        print('%s : %.2f' % (k,round(v,2)))
        if cntr > top_n:
            break

def train_and_test(X, y, epochs=100):
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.svm import SVC
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, classification_report,confusion_matrix
    import time
    from xgboost import XGBClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.manifold import Isomap


    seed = 7
    test_size = 0.1
    x_train, x_test, y_train, y_test=train_test_split(X, y, test_size=test_size, random_state=seed)    

    models = [
            DecisionTreeClassifier(),
            LogisticRegression(),
            RandomForestClassifier(max_depth=10,min_samples_leaf=20,criterion='entropy', max_features='sqrt', n_estimators=50),
            MLPClassifier(hidden_layer_sizes=(10,30,10),activation='tanh',solver='sgd',alpha=0.0001,learning_rate='adaptive'),
            XGBClassifier(),
            # TODO: LightGBM
            # SVC(),
            # KNeighborsClassifier(n_neighbors = 15)
            ]

    for m in models:
        st=time.time()
        class_name = m.__class__.__name__
        print("#" * 50)
        print(class_name)
        print("-" * len(class_name))
        m.fit(x_train,y_train)

        print("Result on test set")
        # predict_and_report(m, x_test, y_test)
        y_pred = predict(m, x_test, y_test)

        # TODO: need to find the best model and return it back
        print("-" * 50)
        # result = report(original_data, x_test, y_test, y_pred)
        if class_name == 'LogisticRegression':
            feature_importance(m.coef_[0], x_train.columns, 50)
        elif class_name in ['DecisionTreeClassifier', 'RandomForestClassifier']:
            feature_importance(m.feature_importances_, x_train.columns, 20)    

        print("#" * 50)
        #result = report(df_h, x_test, y_test, y_pred)
        

        # if class_name == 'MLPClassifier':
        #result.to_csv(OUTPUT_PATH + class_name + '_test_set_hcode.csv' )
        #classification_report.to_csv(classification_report+'report_'+m+'.csv')
        print()
        et=time.time()
        print('time_elapsed',et-st)

# TODO: torch implementation block
    # import torch
    # from common.torch_mlp import FFNModel,DNNModel

    # ffn = DNNModel(x_train.shape[1])
    # ffn.train(torch.FloatTensor(x_train.values), 
    #           torch.FloatTensor(y_train.values), 
    #           torch.FloatTensor(x_test.values), 
    #           torch.FloatTensor(y_test.values),
    #           epochs = epochs)
# torch implementation block

    return models[0] # TODO: this should be the best model


In [125]:
!pip install xgboost



In [126]:
df_final.columns

Index(['Sodium', 'Urea Nitrogen', 'age', 'NTproBNP', 'duration',
       'Creatinine, Serum', 'admission_type_ELECTIVE',
       'admission_type_EMERGENCY', 'admission_type_URGENT',
       'insurance_Government', 'insurance_Medicaid', 'insurance_Medicare',
       'insurance_Private', 'insurance_Self Pay',
       'ethnicity_AMERICAN INDIAN/ALASKA NATIVE', 'ethnicity_ASIAN',
       'ethnicity_ASIAN - ASIAN INDIAN', 'ethnicity_ASIAN - CAMBODIAN',
       'ethnicity_ASIAN - CHINESE', 'ethnicity_ASIAN - FILIPINO',
       'ethnicity_ASIAN - OTHER', 'ethnicity_ASIAN - THAI',
       'ethnicity_ASIAN - VIETNAMESE', 'ethnicity_BLACK/AFRICAN',
       'ethnicity_BLACK/AFRICAN AMERICAN', 'ethnicity_BLACK/CAPE VERDEAN',
       'ethnicity_BLACK/HAITIAN', 'ethnicity_HISPANIC OR LATINO',
       'ethnicity_HISPANIC/LATINO - CENTRAL AMERICAN (OTHER)',
       'ethnicity_HISPANIC/LATINO - CUBAN',
       'ethnicity_HISPANIC/LATINO - DOMINICAN',
       'ethnicity_HISPANIC/LATINO - GUATEMALAN',
       'ethnicity

In [127]:
X = df_final.drop(['target'],axis=1)
y = pd.DataFrame(df_final['target'],columns=['target'],index=X.index)
model = train_and_test(X, y, epochs=1000)
# save_model(model,model_output_path,model_file_name)

##################################################
DecisionTreeClassifier
----------------------
Result on test set
              precision    recall  f1-score   support

           0       0.47      0.50      0.49       285
           1       0.56      0.53      0.54       337

    accuracy                           0.52       622
   macro avg       0.51      0.51      0.51       622
weighted avg       0.52      0.52      0.52       622

--------------------------------------------------
Feature importance from higher to lower:
age : 0.19
Urea Nitrogen : 0.19
duration : 0.15
Sodium : 0.13
NTproBNP : 0.05
religion_CATHOLIC : 0.02
religion_JEWISH : 0.02
ethnicity_WHITE : 0.02
insurance_Private : 0.02
gender_F : 0.01
insurance_Medicare : 0.01
gender_M : 0.01
marital_status_SINGLE : 0.01
religion_PROTESTANT QUAKER : 0.01
insurance_Medicaid : 0.01
marital_status_DIVORCED : 0.01
marital_status_MARRIED : 0.01
religion_OTHER : 0.01
ethnicity_BLACK/AFRICAN AMERICAN : 0.01
religion_NOT SPECIFIE

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  m.fit(x_train,y_train)


Result on test set
              precision    recall  f1-score   support

           0       0.50      0.26      0.34       285
           1       0.56      0.78      0.65       337

    accuracy                           0.54       622
   macro avg       0.53      0.52      0.50       622
weighted avg       0.53      0.54      0.51       622

--------------------------------------------------
Feature importance from higher to lower:
age : 0.21
Urea Nitrogen : 0.18
duration : 0.14
NTproBNP : 0.10
Sodium : 0.09
admission_type_EMERGENCY : 0.04
admission_type_ELECTIVE : 0.03
insurance_Medicare : 0.03
religion_CATHOLIC : 0.02
marital_status_MARRIED : 0.02
marital_status_WIDOWED : 0.01
religion_NOT SPECIFIED : 0.01
ethnicity_WHITE : 0.01
insurance_Private : 0.01
gender_M : 0.01
marital_status_SINGLE : 0.01
religion_JEWISH : 0.01
gender_F : 0.01
insurance_Medicaid : 0.01
admission_type_URGENT : 0.01
marital_status_DIVORCED : 0.01
##################################################

time_elaps

  y = column_or_1d(y, warn=True)


Result on test set
              precision    recall  f1-score   support

           0       0.50      0.26      0.34       285
           1       0.55      0.78      0.65       337

    accuracy                           0.54       622
   macro avg       0.53      0.52      0.49       622
weighted avg       0.53      0.54      0.51       622

--------------------------------------------------
##################################################

time_elapsed 2.484727144241333
##################################################
XGBClassifier
-------------
Result on test set
              precision    recall  f1-score   support

           0       0.49      0.42      0.45       285
           1       0.56      0.63      0.59       337

    accuracy                           0.53       622
   macro avg       0.53      0.53      0.52       622
weighted avg       0.53      0.53      0.53       622

--------------------------------------------------
############################################

### MODEL EVALUATION

##################################################

DecisionTreeClassifier
----------------------
Result on test set
              precision    recall  f1-score   support

           0       0.50      0.50      0.50       296
           1       0.57      0.56      0.56       342

    accuracy                           0.53       638
   macro avg       0.53      0.53      0.53       638
weighted avg       0.53      0.53      0.53       638

--------------------------------------------------
Feature importance from higher to lower:

age : 0.17
Urea Nitrogen : 0.17
duration : 0.14
Sodium : 0.13
NTproBNP : 0.05
Topic : 0.04
religion_CATHOLIC : 0.02
gender_M : 0.02
marital_status_SINGLE : 0.02
religion_PROTESTANT QUAKER : 0.02
ethnicity_BLACK/AFRICAN AMERICAN : 0.02
religion_JEWISH : 0.02
gender_F : 0.01
marital_status_DIVORCED : 0.01
religion_NOT SPECIFIED : 0.01
marital_status_MARRIED : 0.01
admission_type_ELECTIVE : 0.01
marital_status_WIDOWED : 0.01
insurance_Medicare : 0.01
ethnicity_WHITE : 0.01
religion_OTHER : 0.01

##################################################

time_elapsed 0.04962730407714844

##################################################
LogisticRegression
------------------
Result on test set
              precision    recall  f1-score   support

           0       0.56      0.27      0.37       296
           1       0.56      0.81      0.67       342

    accuracy                           0.56       638
   macro avg       0.56      0.54      0.52       638
weighted avg       0.56      0.56      0.53       638

--------------------------------------------------
Feature importance from higher to lower:
ethnicity_BLACK/AFRICAN : 1.51
religion_CHRISTIAN SCIENTIST : 0.62
ethnicity_WHITE - BRAZILIAN : 0.60
ethnicity_UNABLE TO OBTAIN : 0.46
religion_ROMANIAN EAST. ORTH : 0.45
religion_MUSLIM : 0.45
ethnicity_ASIAN - FILIPINO : 0.43
religion_GREEK ORTHODOX : 0.42
ethnicity_AMERICAN INDIAN/ALASKA NATIVE : 0.37
ethnicity_UNKNOWN/NOT SPECIFIED : 0.36
ethnicity_ASIAN - OTHER : 0.36
ethnicity_PORTUGUESE : 0.34
ethnicity_BLACK/CAPE VERDEAN : 0.32
ethnicity_WHITE - EASTERN EUROPEAN : 0.26
marital_status_SEPARATED : 0.25
insurance_Self Pay : 0.24
ethnicity_PATIENT DECLINED TO ANSWER : 0.24
religion_HEBREW : 0.23
marital_status_MARRIED : 0.23
marital_status_WIDOWED : 0.23
ethnicity_MIDDLE EASTERN : 0.18
admission_type_URGENT : 0.17
ethnicity_HISPANIC/LATINO - GUATEMALAN : 0.16
ethnicity_BLACK/AFRICAN AMERICAN : 0.13
marital_status_SINGLE : 0.13
insurance_Medicaid : 0.12
ethnicity_WHITE : 0.11
admission_type_ELECTIVE : 0.07
duration : 0.07
religion_NOT SPECIFIED : 0.06
gender_M : 0.05
ethnicity_HISPANIC OR LATINO : 0.04
Sodium : 0.04
Creatinine, Serum : 0.03
marital_status_DIVORCED : 0.02
insurance_Private : 0.01
age : 0.01
Topic : 0.01
religion_OTHER : 0.01
gender_F : -0.00
religion_EPISCOPALIAN : -0.01
religion_UNOBTAINABLE : -0.02
marital_status_LIFE PARTNER : -0.04
religion_PROTESTANT QUAKER : -0.05
religion_7TH DAY ADVENTIST : -0.06
religion_JEWISH : -0.08
ethnicity_ASIAN - ASIAN INDIAN : -0.11
religion_BUDDHIST : -0.11
insurance_Medicare : -0.12
ethnicity_ASIAN : -0.12
NTproBNP : -0.13

##################################################

time_elapsed 0.04867839813232422

##################################################


RandomForestClassifier
----------------------


Result on test set
              precision    recall  f1-score   support

           0       0.55      0.21      0.31       296
           1       0.56      0.85      0.67       342

    accuracy                           0.55       638
   macro avg       0.55      0.53      0.49       638
weighted avg       0.55      0.55      0.50       638

--------------------------------------------------
Feature importance from higher to lower:
age : 0.20
Urea Nitrogen : 0.18
duration : 0.11
NTproBNP : 0.11
Sodium : 0.09
Topic : 0.05
admission_type_EMERGENCY : 0.03
religion_CATHOLIC : 0.02
insurance_Medicare : 0.02
admission_type_ELECTIVE : 0.02
gender_M : 0.02
marital_status_MARRIED : 0.01
ethnicity_WHITE : 0.01
gender_F : 0.01
marital_status_WIDOWED : 0.01
insurance_Private : 0.01
marital_status_SINGLE : 0.01
ethnicity_UNKNOWN/NOT SPECIFIED : 0.01
religion_PROTESTANT QUAKER : 0.01
admission_type_URGENT : 0.01
insurance_Medicaid : 0.01

##################################################

time_elapsed 0.17884421348571777

##################################################

MLPClassifier
-------------

Result on test set
              precision    recall  f1-score   support

           0       0.57      0.27      0.37       296
           1       0.57      0.82      0.67       342

    accuracy                           0.57       638
   macro avg       0.57      0.55      0.52       638
weighted avg       0.57      0.57      0.53       638

--------------------------------------------------
##################################################

time_elapsed 2.324270486831665
##################################################
XGBClassifier
-------------
Result on test set
              precision    recall  f1-score   support

           0       0.49      0.42      0.45       296
           1       0.55      0.63      0.59       342

    accuracy                           0.53       638
   macro avg       0.52      0.52      0.52       638
weighted avg       0.53      0.53      0.52       638

--------------------------------------------------

Feature importance from higher to lower:

admission_type_EMERGENCY : 0.04
admission_type_URGENT : 0.04
marital_status_SEPARATED : 0.04
religion_GREEK ORTHODOX : 0.03
marital_status_WIDOWED : 0.03
marital_status_SINGLE : 0.03
age : 0.03
marital_status_DIVORCED : 0.02
ethnicity_BLACK/AFRICAN AMERICAN : 0.02
duration : 0.02
admission_type_ELECTIVE : 0.02
Topic : 0.02
Urea Nitrogen : 0.02
religion_OTHER : 0.02
insurance_Medicaid : 0.02
religion_UNOBTAINABLE : 0.02
ethnicity_HISPANIC OR LATINO : 0.02
insurance_Medicare : 0.02
ethnicity_ASIAN : 0.02
insurance_Government : 0.02
NTproBNP : 0.02
##################################################

time_elapsed 0.45507001876831055
​
​
​


#### Next Step

We will now extract an unstructured discharge summary and process it with **AWS Comprehend Medical** to extract diagnosis information in a semi-structured way. The semi structured text information will be further processed through NLP-based Topic Modelling and build a clustered dataset.
 