# Capstone Two: Pre-processing & Training Data Development - Hospital readmission for patients with diabetes

### Our aim is to predict if a patient is likely to be re-admitted within 30 days of discharge. In the Data Wrangling step we combined the '>30' and 'NO' readmission categories into a single class == 0 and  '<30' == 1 , making this a binary classification problem.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
from matplotlib import pyplot as plt
import random
import numpy as np
%matplotlib inline
from scipy.stats import norm
from scipy.stats import t
from IPython.core.interactiveshell import InteractiveShell

**Load the `df_clean.csv` file created during the Data Wrangling step into a dataframe and view the first 5 rows**

In [2]:
#load the dataframe
df = pd.read_csv('data/df_clean.csv')
df.head()

Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,...,tolazamide,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change_of_meds,diabetesMed,readmitted
0,Caucasian,Female,[0-10),Elective,Not Mapped,Physician Referral,1,Pediatrics-Endocrinology,41,0,...,No,No,No,No,No,No,No,No,No,0
1,Caucasian,Female,[10-20),Emergency,Discharged to home,Emergency Room,3,Surgery-General,59,0,...,No,Up,No,No,No,No,No,Ch,Yes,0
2,AfricanAmerican,Female,[20-30),Emergency,Discharged to home,Emergency Room,2,Orthopedics,11,5,...,No,No,No,No,No,No,No,No,Yes,0
3,Caucasian,Male,[30-40),Emergency,Discharged to home,Emergency Room,2,InternalMedicine,44,1,...,No,Up,No,No,No,No,No,Ch,Yes,0
4,Caucasian,Male,[40-50),Emergency,Discharged to home,Emergency Room,1,Surgery-General,51,0,...,No,Steady,No,No,No,No,No,Ch,Yes,0


## Feature Engineering
We will create features for our predictive model. For each section, we will add new variables to the dataframe and then keep track of which columns of the dataframe we want to use as part of the predictive model features.

### Numerical Features
The easiest type of features to use is numerical features. These features do not need any modification.

In [3]:
# The columns that are numerical that we will use are shown below
cols_num = ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'num_outpatient_visit', 'num_emerg_visit', 'num_inpatient_visit', 'number_diagnoses']

### Categorical Features
To turn these non-numerical data into variables, the simplest thing is to use a technique called one-hot encoding

In [4]:
# The columns that are categorical that we will use are shown below
cols_cat = ['race', 'gender', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'medical_specialty',
       'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change_of_meds', 'diabetesMed']

**Let’s investigate medical specialty before we begin with one-hot encoding.**

In [5]:
# list the unique values of 'medical_specialty'
print('# Medical Specialties Unique Values: ', df.medical_specialty.nunique())
df.groupby('medical_specialty').size().sort_values(ascending=False).head(25)

# Medical Specialties Unique Values:  72


medical_specialty
InternalMedicine                     28089
Emergency/Trauma                     14550
Family/GeneralPractice               14419
Cardiology                           10384
Surgery-General                       5998
Nephrology                            3083
Orthopedics                           2699
Orthopedics-Reconstructive            2422
Radiologist                           2238
Pulmonology                           1645
Psychiatry                            1641
Urology                               1364
ObstetricsandGynecology               1312
Surgery-Cardiovascular/Thoracic       1266
Gastroenterology                      1112
Surgery-Vascular                      1055
Surgery-Neuro                          925
PhysicalMedicineandRehabilitation      750
Oncology                               652
Pediatrics                             486
Hematology/Oncology                    380
Neurology                              378
Pediatrics-Endocrinology            

We don’t want to add 72 new variables during one-hot encoding since some of them only have a few samples.  We want to reduce the number of possible categories. We can create a new variable where we bucket the variables so that they only have 17 options (the top 16 specialities and then an 'Other' category).

In [6]:
# create a variable containing the top 16 'medical_specialty' categories
top_16 = ['InternalMedicine', 'Emergency/Trauma', 'Family/GeneralPractice', 'Cardiology', 'Surgery-General', 'Nephrology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Radiologist', 'Pulmonology', 'Psychiatry', 'Urology', 'ObstetricsandGynecology', 'Surgery-Cardiovascular/Thoracic', 'Gastroenterology', 'Surgery-Vascular',' Surgery-Neuro']

# create a new column with duplicated data
df['med_spec'] = df['medical_specialty'].copy()

#replace all specialties not in the top 16 with the 'Other' category
df.loc[~df.med_spec.isin(top_16), 'med_spec'] = 'Other'

# display the new 'medical_specialty' category buckets
df.groupby('med_spec').size()

med_spec
Cardiology                         10384
Emergency/Trauma                   14550
Family/GeneralPractice             14419
Gastroenterology                    1112
InternalMedicine                   28089
Nephrology                          3083
ObstetricsandGynecology             1312
Orthopedics                         2699
Orthopedics-Reconstructive          2422
Other                               6765
Psychiatry                          1641
Pulmonology                         1645
Radiologist                         2238
Surgery-Cardiovascular/Thoracic     1266
Surgery-General                     5998
Surgery-Vascular                    1055
Urology                             1364
dtype: int64

**To convert our categorical features to numbers, we will use a technique called one-hot encoding by creating dummy or indicator features**

In [7]:
# create dummy features of the categorical columns list and the medical specialty list and save into a new variable, df_cat
# Using the drop_first option, which will drop the first categorical value for each column.
df_cat = pd.get_dummies(df[cols_cat + ['med_spec']], drop_first = True)
df_cat.head()

Unnamed: 0,race_Asian,race_Caucasian,race_Hispanic,race_Other,gender_Male,admission_type_id_Emergency,admission_type_id_Newborn,admission_type_id_Not Available,admission_type_id_Not Mapped,admission_type_id_Trauma Center,...,med_spec_Orthopedics,med_spec_Orthopedics-Reconstructive,med_spec_Other,med_spec_Psychiatry,med_spec_Pulmonology,med_spec_Radiologist,med_spec_Surgery-Cardiovascular/Thoracic,med_spec_Surgery-General,med_spec_Surgery-Vascular,med_spec_Urology
0,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,1,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


**To add the one-hot encoding columns to the original dataframe we use the concat function.**

In [8]:
# concatenate the original data frame, df with the dummy variables, df_cat
df = pd.concat([df, df_cat], axis=1)
df.head()

Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,...,med_spec_Orthopedics,med_spec_Orthopedics-Reconstructive,med_spec_Other,med_spec_Psychiatry,med_spec_Pulmonology,med_spec_Radiologist,med_spec_Surgery-Cardiovascular/Thoracic,med_spec_Surgery-General,med_spec_Surgery-Vascular,med_spec_Urology
0,Caucasian,Female,[0-10),Elective,Not Mapped,Physician Referral,1,Pediatrics-Endocrinology,41,0,...,0,0,1,0,0,0,0,0,0,0
1,Caucasian,Female,[10-20),Emergency,Discharged to home,Emergency Room,3,Surgery-General,59,0,...,0,0,0,0,0,0,0,1,0,0
2,AfricanAmerican,Female,[20-30),Emergency,Discharged to home,Emergency Room,2,Orthopedics,11,5,...,1,0,0,0,0,0,0,0,0,0
3,Caucasian,Male,[30-40),Emergency,Discharged to home,Emergency Room,2,InternalMedicine,44,1,...,0,0,0,0,0,0,0,0,0,0
4,Caucasian,Male,[40-50),Emergency,Discharged to home,Emergency Room,1,Surgery-General,51,0,...,0,0,0,0,0,0,0,1,0,0


In [9]:
# saving the column names of the categorical data to keep track of them.
cat_cols_names = list(df_cat.columns)

**The last column we want to make features for is `age`. `age` is categorical in this dataset.**

In [10]:
df.groupby('age').size()

age
[0-10)        160
[10-20)       690
[20-30)      1650
[30-40)      3763
[40-50)      9615
[50-60)     17086
[60-70)     22173
[70-80)     25539
[80-90)     16700
[90-100)     2666
dtype: int64

Since there is a natural order to these values, it might make more sense to convert these to numerical data that is ordered.

In [11]:
#Let’s map these to 0 to 90 by 10s for the numerical data.
age_id = {'[0-10)':0,
        '[10-20)':10, 
        '[20-30)':20,
        '[30-40)':30,
        '[40-50)':40,
        '[50-60)':50,
        '[60-70)':60,
        '[70-80)':70,
        '[80-90)':80,
        '[90-100)':90}

# create a new column, 'age_group' with the new age category groupings
df['age_group'] = df['age'].replace(age_id)
df.head()

Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,...,med_spec_Orthopedics-Reconstructive,med_spec_Other,med_spec_Psychiatry,med_spec_Pulmonology,med_spec_Radiologist,med_spec_Surgery-Cardiovascular/Thoracic,med_spec_Surgery-General,med_spec_Surgery-Vascular,med_spec_Urology,age_group
0,Caucasian,Female,[0-10),Elective,Not Mapped,Physician Referral,1,Pediatrics-Endocrinology,41,0,...,0,1,0,0,0,0,0,0,0,0
1,Caucasian,Female,[10-20),Emergency,Discharged to home,Emergency Room,3,Surgery-General,59,0,...,0,0,0,0,0,0,1,0,0,10
2,AfricanAmerican,Female,[20-30),Emergency,Discharged to home,Emergency Room,2,Orthopedics,11,5,...,0,0,0,0,0,0,0,0,0,20
3,Caucasian,Male,[30-40),Emergency,Discharged to home,Emergency Room,2,InternalMedicine,44,1,...,0,0,0,0,0,0,0,0,0,30
4,Caucasian,Male,[40-50),Emergency,Discharged to home,Emergency Room,1,Surgery-General,51,0,...,0,0,0,0,0,0,1,0,0,40


In [12]:
#Let’s keep track of these extra columns too.
extra_cols_age = ['age_group']

**Let’s make a new dataframe that only has the features**

In [13]:
# create a new variable only containing the features
cols2use = cols_num + cat_cols_names + extra_cols_age
df_final = df[cols2use]
df_final.head()

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,num_outpatient_visit,num_emerg_visit,num_inpatient_visit,number_diagnoses,race_Asian,race_Caucasian,...,med_spec_Orthopedics-Reconstructive,med_spec_Other,med_spec_Psychiatry,med_spec_Pulmonology,med_spec_Radiologist,med_spec_Surgery-Cardiovascular/Thoracic,med_spec_Surgery-General,med_spec_Surgery-Vascular,med_spec_Urology,age_group
0,1,41,0,1,0,0,0,1,0,1,...,0,1,0,0,0,0,0,0,0,0
1,3,59,0,18,0,0,0,9,0,1,...,0,0,0,0,0,0,1,0,0,10
2,2,11,5,13,2,0,1,6,0,0,...,0,0,0,0,0,0,0,0,0,20
3,2,44,1,16,0,0,0,7,0,1,...,0,0,0,0,0,0,0,0,0,30
4,1,51,0,8,0,0,0,5,0,1,...,0,0,0,0,0,0,1,0,0,40


## Scale the data
As you have features measured in many different units, with numbers that vary by orders of magnitude, start off by scaling them to put them all on a consistent scale.

In [14]:
features = [i for i in df_final.columns]

In [15]:
from sklearn.preprocessing import StandardScaler
X = df_final
y = df['readmitted'].ravel()

# Make a scalar object
scaler = StandardScaler() 
# Fit and transform the data to the scalar object
X_scaled = scaler.fit_transform(X)

## Train/Test Split

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, shuffle=True, random_state=42)

In [17]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((70029, 2445), (30013, 2445), (70029,), (30013,))

In [18]:
# transform the train and test set, and add on the 'readmitted' variable
data = pd.concat([df[['readmitted']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(df_final[features]), columns=features)],
                    axis=1)
data.head()

Unnamed: 0,readmitted,time_in_hospital,num_lab_procedures,num_procedures,num_medications,num_outpatient_visit,num_emerg_visit,num_inpatient_visit,number_diagnoses,race_Asian,...,med_spec_Orthopedics-Reconstructive,med_spec_Other,med_spec_Psychiatry,med_spec_Pulmonology,med_spec_Radiologist,med_spec_Surgery-Cardiovascular/Thoracic,med_spec_Surgery-General,med_spec_Surgery-Vascular,med_spec_Urology,age_group
0,0,-1.139526,-0.099004,-0.782552,-1.851737,-0.292355,-0.212081,-0.5015,-3.306971,-0.080681,...,-0.157513,3.713243,-0.129138,-0.129298,-0.15127,-0.113212,-0.252544,-0.103237,-0.11757,-3.814251
1,0,-0.46704,0.81828,-0.782552,0.249529,-0.292355,-0.212081,-0.5015,0.82063,-0.080681,...,-0.157513,-0.269306,-0.129138,-0.129298,-0.15127,-0.113212,3.9597,-0.103237,-0.11757,-3.187236
2,0,-0.803283,-1.62781,2.158794,-0.36849,1.289418,-0.212081,0.291029,-0.72722,-0.080681,...,-0.157513,-0.269306,-0.129138,-0.129298,-0.15127,-0.113212,-0.252544,-0.103237,-0.11757,-2.56022
3,0,-0.803283,0.053877,-0.194283,0.002322,-0.292355,-0.212081,-0.5015,-0.21127,-0.080681,...,-0.157513,-0.269306,-0.129138,-0.129298,-0.15127,-0.113212,-0.252544,-0.103237,-0.11757,-1.933205
4,0,-1.139526,0.410598,-0.782552,-0.98651,-0.292355,-0.212081,-0.5015,-1.243171,-0.080681,...,-0.157513,-0.269306,-0.129138,-0.129298,-0.15127,-0.113212,3.9597,-0.103237,-0.11757,-1.306189


In [None]:
# save the training data into a new csv file
#data.to_csv('data/train_data.csv',index=False)