# Diabetes 130-US hospitals for years 1999-2008

Dataset consists of hospital admissions of length between one and 14 days that did not result in a patient death or discharge to a hospice. Each encounter corresponds to a unique patient diagnosed with diabetes, although the primary diagnosis may be different. During each of the analyzed encounters, lab tests were ordered and medication was administered.
Since we are primarily interested in factors that lead to early readmission, we defined the readmission attribute (outcome) as having two values: “readmitted,” if the patient was readmitted within 30 days of discharge or “otherwise,” which covers both readmission after 30 days and no readmission at all.

#### **Data Set Description**
* ***Encounter ID***: Unique identifier of an encounter
* ***Patient number***: Unique identifier of a patient
* ***Race Values***: Caucasian, Asian, African American, Hispanic, and other
* ***Gender Values***: male, female, and unknown/invalid
* ***Age***: Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100)
* ***Weight***: Weight in pounds
* ***Admission type***: Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available
* ***Discharge disposition***: Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available
* ***Admission source***: Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital
* ***Time in hospital***: Integer number of days between admission and discharge
* ***Payer code***: Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay Medical
* ***Medical specialty***: Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon
* ***Number of lab procedures***: Number of lab tests performed during the encounter
* ***Number of procedures***: Numeric Number of procedures (other than lab tests) performed during the encounter
* ***Number of medications***: Number of distinct generic names administered during the encounter
* ***Number of outpatient visits***: Number of outpatient visits of the patient in the year preceding the encounter
* ***Number of emergency visits***: Number of emergency visits of the patient in the year preceding the encounter
* ***Number of inpatient visits***: Number of inpatient visits of the patient in the year preceding the encounter
* ***Diagnosis 1***: The primary diagnosis (coded as first three digits of ICD9); 848 distinct values
* ***Diagnosis 2***: Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values
* ***Diagnosis 3***: Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values
* ***Number of diagnoses***: Number of diagnoses entered to the system 0%
* ***Glucose serum test***: result Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured
* ***A1c test result***: Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.
Change of medications Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”
* ***Diabetes medications***: Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”
* 24 features for medications For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride- pioglitazone, metformin-rosiglitazone, and metformin- pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed
* ***Readmitted***: Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('diabetic_data.csv', keep_default_na=' ')
df.head(10)

In [None]:
df.shape

In [None]:
df.info()

## Missing Values

In [None]:
def get_missing_values(dataframe):
  '''creates a table with the feature, the total of missing value for the feature, and the percentage'''
  df = dataframe
  missing_values = pd.DataFrame(index=None, columns= ['feature','quantity','percentage'])
  for column in df.columns:
    if df[column].dtype == object:
      quantity = df[df[column]=='?'][column].count()
      percentage = quantity/df[column].count()
      missing_values = missing_values.append({"feature" : column,
                                              "quantity" : quantity,
                                              "percentage": percentage
                                              },ignore_index=True)
  return missing_values

In [None]:
missing_values = get_missing_values(df)
missing_values

In [None]:
df.gender.unique()

In [None]:
print('gender',df[df['gender']=='Unknown/Invalid']['gender'].count())

Drop variables with more than 39% of missing values

In [None]:
null_features = missing_values[missing_values['percentage']>= 0.39]['feature']
null_features

In [None]:
for feature in null_features:
  df = df.drop(feature,axis=1)

Now we are going to drop rows with "?" in any column

In [None]:
index_to_drop = set(df[(df['diag_1'] == '?') & (df['diag_2'] == '?') & (df['diag_3'] == '?')].index)
index_to_drop = index_to_drop.union(set(df['diag_1'][df['diag_1'] == '?'].index))
index_to_drop = index_to_drop.union(set(df['diag_2'][df['diag_2'] == '?'].index))
index_to_drop = index_to_drop.union(set(df['diag_3'][df['diag_3'] == '?'].index))
index_to_drop = index_to_drop.union(set(df['race'][df['race'] == '?'].index))
index_to_drop = index_to_drop.union(set(df[df['discharge_disposition_id'] == 11].index))#this corresponds to
index_to_drop = index_to_drop.union(set(df['gender'][df['gender'] == 'Unknown/Invalid'].index))
new_index = list(set(df.index) - set(index_to_drop))
df = df.iloc[new_index]

In [None]:
df.shape

we are also going to delete this 2 features that has the same value in all rows

In [None]:
df['citoglipton'].append(df['examide']).unique()

In [None]:
df = df.drop(['citoglipton','examide'], axis = 1)

In [None]:
df.shape

## Exploratory Data Analysis

#### Readmitted

In [None]:
df['readmitted'].value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df.readmitted.value_counts().plot.pie(autopct = "%.1f%%")
plt.title("Proportion of Target Value")
plt.show()

#### Race

In [None]:
df.race.value_counts()

In [None]:
sns.countplot(x=df.race, data = df)
plt.xticks(rotation=90)
plt.title("Number of Race values")
plt.show()

print("Proportion of Race")
print(df.race.value_counts(normalize = True)*100)

In [None]:
mapped_race = {"Asian":"Other","Hispanic":"Other"}
#df.race = df.race.replace(mapped_race)

sns.countplot(x=df.race.replace(mapped_race), data = df)
plt.title("Number of Race values")
plt.show()

print("Proportion of Race After the Mapping")
print(df.race.replace(mapped_race).value_counts(normalize= True)*100)

In [None]:
sns.countplot(x="race", hue= "readmitted", data = df)
plt.title("Readmitted by Race")
plt.show()

In [None]:
fig = plt.figure(figsize=(8,8))
sns.countplot(y = df['race'], hue = df['readmitted'])

#### Gender

In [None]:
sns.countplot(x = "gender", data = df)
plt.title("Distribution of Number of Gender")
plt.show()

print("Proportions of gender")
print(df.gender.value_counts(normalize = True))

In [None]:
sns.countplot(x = "gender", hue = "readmitted", data = df)
plt.title("Gender - Readmitted")
plt.show()

#### Age

In [None]:
sns.countplot(x="age", data = df)
plt.xticks(rotation = 90)
plt.show()

In [None]:
df.age = df.age.replace({"[70-80)":75,
                         "[60-70)":65,
                         "[50-60)":55,
                         "[80-90)":85,
                         "[40-50)":45,
                         "[30-40)":35,
                         "[90-100)":95,
                         "[20-30)":25,
                         "[10-20)":15,
                         "[0-10)":5})

sns.countplot(x="age", data = df)
#plt.xticks(rotation = 90)
plt.show()

In [None]:
fig = plt.figure(figsize=(15,10))
sns.countplot(y= df['age'], hue = df['readmitted']).set_title('Age of Patient VS. Readmission')

#### Admission Type ID


* Emergency : 1
* Urgent : 2
* Elective : 3
* Newborn : 4
* Not Available : 5
* NULL : 6
* Trauma Center : 7
* Not Mapped : 8

In [None]:
sns.countplot(x = "admission_type_id", data = df)
plt.title("Distribution of Admission IDs")
plt.show()

print("Distribution of Admission IDs")
print(df.admission_type_id.value_counts())

In [None]:
mapped = {1.0:"Emergency",
          2.0:"Emergency",
          3.0:"Elective",
          4.0:"New Born",
          5.0:np.nan,
          6.0:np.nan,
          7.0:"Trauma Center",
          8.0:np.nan}

sns.countplot(x = df.admission_type_id.replace(mapped), data = df)
plt.title("-Distribution of Admission IDs-")
plt.show()

print("-Distribution of ID's-")
print(df.admission_type_id.replace(mapped).value_counts())

In [None]:
g = sns.catplot(x = "admission_type_id", y ="readmitted", 
                    data = df, height = 6, kind = "bar")
g.set_ylabels("Readmitted Probability")
plt.show()

#### Discharge Disposition ID

-Integer identifier corresponding to 29 distinct values. For example, discharged to home, expired, and not available

In [None]:
sns.countplot(x ="discharge_disposition_id", data = df)
plt.show()

#### Admission Source ID

Integer identifier corresponding to 21 distinct values.For example, physician referral, emergency room, and transfer from a hospital

In [None]:
sns.countplot(df["admission_source_id"])
plt.show()

* we'll put the similar ones together like Referral or Transfer
* we will replace Null, Not Mapped, Unknown values as NAN

Readmitted Probability of Referral is very close to Emergency, although Emergency is have more samples than other

#### Number of Lab Procedures

Number of lab tests performed during the encounter

In [None]:
sns.countplot(x = "num_lab_procedures", data = df)
plt.show()

print("Proportions of Column")
print(df.num_lab_procedures.value_counts().head(10))

#### Time in Hospital VS. Readmission

In [None]:
fig = plt.figure(figsize=(10,8),)
ax=sns.kdeplot(df.loc[(df['readmitted'] != '<30'),'time_in_hospital'] , color='g',shade=True,label='Not Readmitted')
ax=sns.kdeplot(df.loc[(df['readmitted'] == '<30'),'time_in_hospital'] , color='r',shade=True, label='Readmitted')
ax.set(xlabel='Time in Hospital', ylabel='Frequency')
plt.title('Time in Hospital VS. Readmission')

Most of patients stayed 2 - 3 days in hospital

In [None]:
fig = plt.figure(figsize=(10,5))

ax = sns.kdeplot(df.loc[(df.readmitted == 'NO'), "num_lab_procedures"],
                 color = "g", shade = True,label = "Not Readmitted")

ax = sns.kdeplot(df.loc[(df.readmitted != 'NO'), "num_lab_procedures"],
                 color = "r", shade = True, label = "Readmitted")

ax.legend(loc="upper right")
ax.set_xlabel("Number of Lab Procedures")
ax.set_ylabel("Frequency")
ax.set_title("Number of Lab Procedures - Readmission")

plt.show()

## Feature Engineering

Number of outpatient visits, number of inpatient visits and  emergency room visits (in the year before the hospitalization) measures how much hospital services a person has used.

In [None]:
df['hospital_service_usage'] = df['number_inpatient'] + df['number_outpatient'] + df['number_emergency']

#### Number of medication changes

we are going to create a new feature in order to measure the change in the medication

In [None]:
drugs = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone','metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin', 'troglitazone', 'tolbutamide', 'acetohexamide']
for col in drugs:
    colname = str(col) + 'temp'
    df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Steady') else 1) # here we care about changes in the drug so put 1 if 'Up' or 'Down'
df['numchange'] = 0
for col in drugs:
    colname = str(col) + 'temp'
    df['numchange'] = df['numchange'] + df[colname]
    del df[colname]
    
df['numchange'].value_counts() 

Now we can check for the patients with more changes in their drugs

In [None]:
df[df.numchange == 4]

## Enconding variables

In [None]:
df.change.unique(), df.gender.unique() ,df.diabetesMed.unique()

In [None]:
df['change'] = df['change'].replace('Ch', 1)
df['change'] = df['change'].replace('No', 0)
df['gender'] = df['gender'].replace('Male', 1)
df['gender'] = df['gender'].replace('Female', 0)
df['diabetesMed'] = df['diabetesMed'].replace('Yes', 1)
df['diabetesMed'] = df['diabetesMed'].replace('No', 0)
# drugs is the same as before
for col in drugs: #here we care about having or not the drug
    df[col] = df[col].replace('No', 0)
    df[col] = df[col].replace('Steady', 1)
    df[col] = df[col].replace('Up', 1)
    df[col] = df[col].replace('Down', 1)

We also reduced both A1C test result and Glucose serum test result into categories of Normal, Abnormal and Not tested.

In [None]:
df.A1Cresult.unique(), df.max_glu_serum.unique()

In [None]:
df['A1Cresult'] = df['A1Cresult'].replace('>7', 1)
df['A1Cresult'] = df['A1Cresult'].replace('>8', 1)
df['A1Cresult'] = df['A1Cresult'].replace('Norm', 0)
df['A1Cresult'] = df['A1Cresult'].replace('None', -99)
df['max_glu_serum'] = df['max_glu_serum'].replace('>200', 1)
df['max_glu_serum'] = df['max_glu_serum'].replace('>300', 1)
df['max_glu_serum'] = df['max_glu_serum'].replace('Norm', 0)
df['max_glu_serum'] = df['max_glu_serum'].replace('None', -99)

In [None]:
df.shape

Some patients in the dataset had more than one encounter, we can't count them as independent encounters cause that is going to bias the result towards those who had several encounters.

We can considered the first and last encounter separately as possible representations of multiple encounters. So we are going to evaluate the balance of the data in order to see wich aproach is better


#### Readmissions vs No Readmissions using this approach

Since we are primarily interested in factors that lead to early readmission, we defined the readmission attribute (outcome) as having two values: “readmitted,” if the patient was readmitted within 30 days of discharge or “otherwise,” which covers both readmission after 30 days and no readmission at all.

In [None]:
#keeping the last encounter
duplicated_last_approach = df[df.duplicated(subset=['patient_nbr'], keep='last')]
len(duplicated_last_approach)

In [None]:
last_encounter_readmission = duplicated_last_approach[duplicated_last_approach['readmitted'] == '<30']
len(last_encounter_readmission)

In [None]:
percentage_of_readmission = len(last_encounter_readmission)/len(duplicated_last_approach)
print(f'keeping the last encounters we get {len(duplicated_last_approach)} records for which {round(percentage_of_readmission * 100)} % of patients has been readmitted')

In [None]:
#keeping the first encounter
duplicated_first_approach = df[df.duplicated(subset=['patient_nbr'], keep='first')]
len(duplicated_first_approach)

In [None]:
first_encounter_readmission = duplicated_first_approach[duplicated_first_approach['readmitted'] == '<30']
len(first_encounter_readmission)

In [None]:
percentage_of_readmission = len(first_encounter_readmission)/len(duplicated_first_approach)
print(f'keeping the first encounters we get {len(duplicated_first_approach)} records for which {round(percentage_of_readmission * 100)} % of patients has been readmitted')

Using the last encounters approach we end up with a less imbalanced data for readmissions (27/73 Readmissions vs No Readmissions) and so we are going to use last encounters of patients

In [None]:
df = df.drop_duplicates(subset= ['patient_nbr'], keep = 'last')
df.shape

In [None]:
df['readmitted'] = df['readmitted'].replace('>30', 0)
df['readmitted'] = df['readmitted'].replace('<30', 1)
df['readmitted'] = df['readmitted'].replace('NO', 0)

df.readmitted.value_counts().plot.pie(autopct = "%.1f%%")
plt.title("Proportion of Target Value")
plt.show()

## Pre-Modeling Data Preprocessing

In [None]:
# convert data type of nominal features in dataframe to 'object' type
i = ['encounter_id', 'patient_nbr', 'gender', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id',\
          'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', \
          'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose','miglitol', \
          'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin', \
          'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', \
          'age', 'A1Cresult', 'max_glu_serum']

df[i] = df[i].astype('object')

In [None]:
df.dtypes

In [None]:
#number of medication used
df['num_med'] = 0

for col in drugs:
    df['num_med'] = df['num_med'] + df[col]
df['num_med'].value_counts()

In [None]:
# get list of only numeric features
num_col = list(set(list(df._get_numeric_data().columns))- {'readmitted'})
num_col

In [None]:
# Removing skewnewss and kurtosis using log transformation if it is above a threshold value -  2

statdataframe = pd.DataFrame()
statdataframe['numeric_column'] = num_col
skew_before = []
skew_after = []

kurt_before = []
kurt_after = []

standard_deviation_before = []
standard_deviation_after = []

log_transform_needed = []

log_type = []

for i in num_col:
    skewval = df[i].skew()
    skew_before.append(skewval)
    
    kurtval = df[i].kurtosis()
    kurt_before.append(kurtval)
    
    sdval = df[i].std()
    standard_deviation_before.append(sdval)
    
    if (abs(skewval) >2) & (abs(kurtval) >2):
        log_transform_needed.append('Yes')
        
        if len(df[df[i] == 0])/len(df) <=0.02:
            log_type.append('log')
            skewvalnew = np.log(pd.DataFrame(df[train_data[i] > 0])[i]).skew()
            skew_after.append(skewvalnew)
            
            kurtvalnew = np.log(pd.DataFrame(df[train_data[i] > 0])[i]).kurtosis()
            kurt_after.append(kurtvalnew)
            
            sdvalnew = np.log(pd.DataFrame(df[train_data[i] > 0])[i]).std()
            standard_deviation_after.append(sdvalnew)
            
        else:
            log_type.append('log1p')
            skewvalnew = np.log1p(pd.DataFrame(df[df[i] >= 0])[i]).skew()
            skew_after.append(skewvalnew)
        
            kurtvalnew = np.log1p(pd.DataFrame(df[df[i] >= 0])[i]).kurtosis()
            kurt_after.append(kurtvalnew)
            
            sdvalnew = np.log1p(pd.DataFrame(df[df[i] >= 0])[i]).std()
            standard_deviation_after.append(sdvalnew)
            
    else:
        log_type.append('NA')
        log_transform_needed.append('No')
        
        skew_after.append(skewval)
        kurt_after.append(kurtval)
        standard_deviation_after.append(sdval)

statdataframe['skew_before'] = skew_before
statdataframe['kurtosis_before'] = kurt_before
statdataframe['standard_deviation_before'] = standard_deviation_before
statdataframe['log_transform_needed'] = log_transform_needed
statdataframe['log_type'] = log_type
statdataframe['skew_after'] = skew_after
statdataframe['kurtosis_after'] = kurt_after
statdataframe['standard_deviation_after'] = standard_deviation_after

In [None]:
statdataframe

In [None]:
# perform log transformation.

for i in range(len(statdataframe)):
    if statdataframe['log_transform_needed'][i] == 'Yes':
        colname = str(statdataframe['numeric_column'][i])
        
        if statdataframe['log_type'][i] == 'log':
            df = df[df[colname] > 0]
            df[colname + "_log"] = np.log(df[colname])
            
        elif statdataframe['log_type'][i] == 'log1p':
            df = df[df[colname] >= 0]
            df[colname + "_log1p"] = np.log1p(df[colname])

In [None]:
#drop columns with no tranformation
df = df.drop(['number_outpatient', 'number_inpatient', 'number_emergency','hospital_service_usage'], axis = 1)

In [None]:
len(df.columns)

In [None]:
# get list of only numeric features
numerics = list(set(list(df._get_numeric_data().columns))- {'readmitted'})
numerics

In [None]:
# covariance - uses spearman rank covariance coeff.
sns.heatmap(df[numerics].corr(), annot =  True)


### bivariate analysis of related features

In [None]:
#

#number of emergency visit/hospital usage
var = 'number_emergency_log1p'
data = pd.concat([df['hospital_service_usage_log1p'], df[var]], axis=1)
data.plot.scatter(x=var, y='hospital_service_usage_log1p');

In [None]:
#number of inpatient visit/hospital usage
var = 'number_inpatient_log1p'
data = pd.concat([df['hospital_service_usage_log1p'], df[var]], axis=1)
data.plot.scatter(x=var, y='hospital_service_usage_log1p');

In [None]:
#number of outpatient visit/hospital usage
var = 'number_outpatient_log1p'
data = pd.concat([df['hospital_service_usage_log1p'], df[var]], axis=1)
data.plot.scatter(x=var, y='hospital_service_usage_log1p');

In [None]:
# show list of features that are categorical
df.encounter_id = df.encounter_id.astype('int64')
df.patient_nbr = df.patient_nbr.astype('int64')
df.diabetesMed = df.diabetesMed.astype('int64')
df.change = df.change.astype('int64')

# convert data type of nominal features in dataframe to 'object' type for aggregating
i = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', \
          'glipizide', 'glyburide', 'tolbutamide','pioglitazone', 'rosiglitazone', 'acarbose','miglitol', \
          'troglitazone', 'tolazamide','insulin', 'glyburide-metformin', 'glipizide-metformin', \
          'glimepiride-pioglitazone','metformin-rosiglitazone', 'metformin-pioglitazone','A1Cresult']
df[i] = df[i].astype('int64')

In [None]:
dfcopy = df.copy(deep = True)

In [None]:
df.shape

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(df[numerics])

In [None]:
df[numerics]

In [None]:
df.columns

In [None]:
non_num_cols = ['race', 'gender', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 
                'max_glu_serum', 'A1Cresult' ]

In [None]:
# num_cols = list(set(list(df._get_numeric_data().columns))- {'readmitted', 'change'})
# len(num_cols)

### One Hot Encoder - variables with many categories

In [None]:
#how many categories each variable has
for col in df[non_num_cols]:
    print(f'{col}: {len(df[col].unique())} categories')

The features 'discharge_disposition_id' and 'admission_source_id' have lot of categories, if we use one hot encoding we end up with 40 new features.

In the winning solution of the KDD 2009 cup: Winning the KDD Cup Orange Challenge with Ensemble Selection the authors limit one hot encoding to the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummies variables indicate if one of the 10 most frequent labels is present(1) or not (0) for a particular observation

In [None]:
#lets find the top 10 most frequent categories for the variable discharge_disposition_id
df.discharge_disposition_id.value_counts().sort_values(ascending = False).head(10)

In [None]:
#lets make a list with the most frequent categories of the variable
top_10_dd = [ x for x in df.discharge_disposition_id.value_counts().sort_values(ascending=False).head(10).index]
top_10_dd

In [None]:
# and now we make the 10 binary variables
for label in top_10_dd:
  df['discharge_disposition_id'+'_'+ str(label)]= np.where(df['discharge_disposition_id']==label,1,0)

In [None]:
df.shape

In [None]:
df = df.drop(columns= ['discharge_disposition_id'])

Same process for admission_source_id

In [None]:
top_10_as = [ x for x in df.admission_source_id.value_counts().sort_values(ascending=False).head(10).index]
top_10_as

In [None]:
for label_ in top_10_dd:
  df['admission_source_id'+'_'+ str(label_)]= np.where(df['admission_source_id']==label_,1,0)

In [None]:
df = df.drop(columns= ['admission_source_id'])

In [None]:
df.columns

In [None]:
df = df.drop(columns= ['encounter_id','patient_nbr','diag_1','diag_2','diag_3'])

In [None]:
df.A1Cresult.unique()

In [None]:
#showing some columns
df.iloc[:10,37:47]

In [None]:
df = pd.get_dummies(df,columns=['race', 'admission_type_id',
                                                   'max_glu_serum', 'A1Cresult' ],drop_first = True)

In [None]:
df.shape

In [None]:
df.readmitted.unique()

In [None]:
X = df.loc[:, ~df.columns.isin(['readmitted'])]
y = df['readmitted']

In [None]:
X.shape

In [None]:
df['readmitted'].value_counts()

In [None]:
X.columns

## Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
logit = LogisticRegression(fit_intercept=True, penalty='l2',solver='lbfgs', max_iter=1000) 
logit.fit(X_train, y_train)

In [None]:
logit_pred = logit.predict(X_test)

In [None]:
logit.score(X_test,y_test)

In [None]:
from sklearn.metrics import confusion_matrix
cmtx = pd.DataFrame(
    confusion_matrix(y_test, logit_pred), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score , f1_score
print(f'Accuracy is {accuracy_score(y_test, logit_pred):.2f}')
print(f'Precision is {precision_score(y_test, logit_pred):.2f}')
print(f'Recall is {recall_score(y_test, logit_pred):.2f}')
print(f'f1Score is {f1_score(y_test, logit_pred):.2f}')

Since our target variable is having class imbalance problem, So will use SMOTE technique to resolve it

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter
print(f'Original dataset shape {Counter(y_train)}')
sm = SMOTE(random_state=20)
train_input_new, train_output_new = sm.fit_resample(X_train, y_train)
print(f'New dataset shape {Counter(train_output_new)}')

In [None]:
train_input_new = pd.DataFrame(train_input_new, columns = list(X.columns))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X_train, X_test, y_train, y_test = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)
logit = LogisticRegression(fit_intercept=True, penalty='l2')
logit.fit(X_train, y_train)

In [None]:
logit_pred = logit.predict(X_test)

In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_test, logit_pred), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)

In [None]:
print(f"Accuracy is {accuracy_score(y_test, logit_pred):.2f}")
print(f"Precision is {precision_score(y_test, logit_pred):.2f}")
print(f"Recall is {recall_score(y_test, logit_pred):.2f}")
print(f"f1Score is {f1_score(y_test, logit_pred):.2f}")

We are going to give more importance to Recall because we consider more relevant to detect correctly the cases when the patient need to be readmitted

In [None]:
df._get_numeric_data()

## Decision Tree

In [None]:
feature_set_no_int = ['age', 'time_in_hospital', 'num_procedures', 'num_medications', 'number_outpatient_log1p', 
                 'number_emergency_log1p', 'number_inpatient_log1p', 'number_diagnoses', 'metformin', 
                 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'glipizide', 
                 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 
                 'tolazamide', 'insulin', 'glyburide-metformin',
                 'AfricanAmerican', 'Asian', 'Caucasian', 
                 'Hispanic', 'Other', 'gender_1', 
                 'admission_type_id_3', 'admission_type_id_5', 
                 'discharge_disposition_id_2', 'discharge_disposition_id_7', 
                 'discharge_disposition_id_10', 'discharge_disposition_id_18', 
                 'admission_source_id_4', 'admission_source_id_7', 
                 'admission_source_id_9', 'max_glu_serum_0', 
                 'max_glu_serum_1', 'A1Cresult_0', 'A1Cresult_1', 
                 ]

In [None]:
print(f'Original dataset shape {Counter(y)}')
smt = SMOTE(random_state=20)
train_input_new, train_output_new = smt.fit_resample(X, y)
print(f'New dataset shape {Counter(train_output_new)}')
train_input_new = pd.DataFrame(train_input_new, columns = list(X.columns))
X_train, X_test, y_train, y_test = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=28, criterion = "entropy", min_samples_split=10)
dtree.fit(X_train, y_train)

In [None]:
dtree_pred = dtree.predict(X_test)


In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_test, dtree_pred), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)

In [None]:
print(f"Accuracy is {accuracy_score(y_test, dtree_pred):.2f}")
print(f"Precision is {precision_score(y_test, dtree_pred):.2f}")
print(f"Recall is {recall_score(y_test, dtree_pred):.2f}")
print(f"f1Score is {f1_score(y_test, dtree_pred):.2f}")


In [None]:
# features with most impact
feature_names = X_train.columns
feature_imports = dtree.feature_importances_
most_imp_features = pd.DataFrame([f for f in zip(feature_names,feature_imports)], columns=["Feature", "Importance"]).nlargest(10, "Importance")
most_imp_features.sort_values(by="Importance", inplace=True)
print(most_imp_features)
plt.figure(figsize=(10,6))
plt.barh(range(len(most_imp_features)), most_imp_features.Importance, align='center', alpha=0.8)
plt.yticks(range(len(most_imp_features)), most_imp_features.Feature, fontsize=14)
plt.xlabel('Importance')
plt.title('Most important features - Decision Tree')
plt.show()

## Random Forest

In [None]:
print(f'Original dataset shape {Counter(y)}')
smt = SMOTE(random_state=20)
train_input_new, train_output_new = smt.fit_resample(X, y)
print(f'New dataset shape {Counter(train_output_new)}')
train_input_new = pd.DataFrame(train_input_new, columns = list(X.columns))
X_train, X_test, y_train, y_test = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rm = RandomForestClassifier(n_estimators = 10, max_depth=25, criterion = "gini", min_samples_split=10)
rm.fit(X_train, y_train)

In [None]:
rm_prd = rm.predict(X_test)

In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_test, rm_prd), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)

In [None]:
print(f"Accuracy is {accuracy_score(y_test, rm_prd):.2f}")
print(f"Precision is {precision_score(y_test, rm_prd):.2f}")
print(f"Recall is {recall_score(y_test, rm_prd):.2f}")
print(f"f1Score is {f1_score(y_test, rm_prd):.2f}")


In [None]:
# features with most impact
feature_names = X_train.columns
feature_imports = rm.feature_importances_
most_imp_features = pd.DataFrame([f for f in zip(feature_names,feature_imports)], columns=["Feature", "Importance"]).nlargest(10, "Importance")
most_imp_features.sort_values(by="Importance", inplace=True)
plt.figure(figsize=(10,6))
plt.barh(range(len(most_imp_features)), most_imp_features.Importance, align='center', alpha=0.8)
plt.yticks(range(len(most_imp_features)), most_imp_features.Feature, fontsize=14)
plt.xlabel('Importance')
plt.title('Most important features - Random Forest ')
plt.show()

In [None]:
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = rm.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

# plotting the curve
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = f'AUC = {roc_auc:.2f}')
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The closer the curve reaching point 1 on the top left of the curve, the better the model is since we maximize correct predictions and minimize incorrect ones

# XGBoost

Before proceding we need to change the object type columns

In [None]:
object_columns = X.loc[:, df.dtypes == object].columns
object_columns

In [None]:
for col in object_columns:
    print(col)
    X[col] = pd.to_numeric(X[col])

In [None]:
import xgboost as xgb
smt = SMOTE(random_state=20)
train_input_new, train_output_new = smt.fit_resample(X, y)
print(f'New dataset shape {Counter(train_output_new)}')
train_input_new = pd.DataFrame(train_input_new, columns = list(X.columns))
X_train, X_test, y_train, y_test = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


In [None]:
xgb_model = xgb.XGBClassifier(max_depth=10,n_estimators=100)
xgb_model.fit(X_train,y_train)

In [None]:
xgb_pred=xgb_model.predict(X_test)
xgb_pred

In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_test, xgb_pred), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)

In [None]:
print(f"Accuracy is {accuracy_score(y_test, xgb_pred):.2f}")
print(f"Precision is {precision_score(y_test, xgb_pred):.2f}")
print(f"Recall is {recall_score(y_test, xgb_pred):.2f}")
print(f"f1Score is {f1_score(y_test, xgb_pred):.2f}")

In [None]:
# features with most impact
feature_names = X_train.columns
feature_imports = xgb_model.feature_importances_
most_imp_features = pd.DataFrame([f for f in zip(feature_names,feature_imports)], columns=["Feature", "Importance"]).nlargest(10, "Importance")
most_imp_features.sort_values(by="Importance", inplace=True)
plt.figure(figsize=(10,6))
plt.barh(range(len(most_imp_features)), most_imp_features.Importance, align='center', alpha=0.8)
plt.yticks(range(len(most_imp_features)), most_imp_features.Feature, fontsize=14)
plt.xlabel('Importance')
plt.title('Most important features - XGB Classifier ')
plt.show()

# Artificial Neural Network

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Model
model = Sequential()
model.add(Dense(4, input_dim=71, activation='relu', name = 'input_layer'))
model.add(Dense(10, activation='relu', name = 'hidden_layer'))
model.add(Dense(1, activation='sigmoid', name = 'output_layer'))

# Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit
model.fit(X_train, y_train, epochs=100, batch_size=25)

# Evaluation
scores = model.evaluate(X_train, y_train)
print(f'\n{model.metrics_names[1]}: {(scores[1]*100):.2f}')


In [None]:
nn_pred = model.predict(X_test)

In [None]:
rounded = [round(z[0]) for z in nn_pred]
nn_pred = rounded

In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_test, nn_pred), 
    index=['actual:0', 'actual:1'], 
    columns=['pred:0', 'pred:1']
)
print(cmtx)

In [None]:
print(f"Accuracy is {accuracy_score(y_test, nn_pred):.2f}")
print(f"Precision is {precision_score(y_test, nn_pred):.2f}")
print(f"Recall is {recall_score(y_test, nn_pred):.2f}")
print(f"f1Score is {f1_score(y_test, nn_pred):.2f}")
