> ## HCA Data Science Summit Python Tutorial  

- Brooke Hamilton, Brooke.Hamilton2@hcahealthcare.com
- Trevor Townsend, trevor.townsend@hcahealthcare.com

This tutorial walks through the process of training a machine learning model to predict whether a diabetic patient will be readmitted within 30 days of an inpatient stay.

## Roadmap
1.  Where are we?  Jupyter notebooks in Kaggle
2.  Mock business case
3.  Import the data
4.  Data exploration/ Getting to know the data
5.  Clean the data and get it ready for building a model
6.  Train a model
7.  Evaluate the model's performance
8.  Conclusions (Back to the business case)

## 1.  Where are we?  Jupyter notebooks in Kaggle
KAGGLE is an online community of data scientists and machine learners, owned by Google LLC. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. 

<font size="6">**WARNING:  No PHI in Kaggle!  Don't upload any real HCA data here!**</font>

### Intro to Jupyter Notebooks  

Jupyter notebooks
- Installation: https://jupyter.org/install.html
- Docs: https://jupyter-notebook.readthedocs.io/en/stable/

Great things about Jupyter notebooks:
- You can run small segments of code instead of the whole script at once
- Easy to visualize the output of the code
- Inline graphics 
- Markdown comments

Caveats:
- Not for production; only for development
- Hidden state
- You can get into trouble if you run the cells out of order
- I Don't Like Notebooks by Joel Grus:  https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/preview?slide=id.g362da58057_0_1

## 2.  Mock business case
- **Problem**:  Preventable readmissions after inpatient stays
- **Goal:**  Predict whether a patient will be readmitted within 30 days of discharge 
- **Patient population:**  Diabetic patients currently inpatient in one of our facilities  
- **Timeline:**  Deliver a prediction ("score") to the facility during the patient's stay, before discharge
- **Intervention**:  The facility will take actions to reduce the chance of readmission for patients flagged by the model
- **Success Metric**:  Fewer readmissions

### Import Python libraries
A python library is a collection of functions and methods that allow you to perform many actions without writing your own code.

In [None]:
import pandas as pd       ## data analysis
import numpy as np        ## mathematical functions
import sklearn            ## machine learning
import xgboost            ## machine learning - a particular model

## 3.  Import the data

For this tutorial, we will use a public dataset with diabetic patients from 130 US hospitals.  This dataset is hosted on Kaggle.  

`File --> Add or Upload Data --> Search for "Diabetes"-->   
Select "Diabetes 130 US hospitals for years 1999-2008"`

In [None]:
## Import the dataset and name it 'data'
data = pd.read_csv('/kaggle/input/diabetes/diabetic_data.csv')    

## 4.  Data exploration / Getting to know the data 

### View the dataframe to get an idea what we're working with

In [None]:
print(data.shape)                               ## print the shape of the dataframe (rows, columns)
print(f'number of rows: {data.shape[0]}')       ## print the number of rows in the dataframe 
print(f'number of columns: {data.shape[1]}')    ## print the number of columns in the dataframe

In [None]:
data.head(8)    ## view the first 8 rows

In [None]:
data.head(3).T     ## .T transposes the rows and columns

### Intro to lists and loops
lists are defined by square "[ ]" brackets and contain any number of elements or none at all.

In [None]:
pets = ['dog','cat','bird']     ## create a list of pets
pets

Loops can iterate over a list and perform a task for each item in the list

In [None]:
for pet in pets:                ## loop through each pet in the list
    print(f'I have a {pet}')    ## print the string 'I have a ' + pet

### Look at each column's contents 
Let's create a loop to print the contents of each column and their associated counts within the column. <br />
To do this, we can use the `value_counts()` function.

In [None]:
for column in data.columns:                           ## loop through each column in the dataframe 
    print(column)                                     ## print the name of the column
    print(data[column].unique())                      ## show unique values
    print('\n')                                       ## print an empty line

## 5.  Clean the data and get it ready for building a model

### Data cleaning tasks that are usually a good idea:
* Eyeball the data and make sure it has the right number of rows and columns
* Make sure missing values are coded correctly (usually as NaN in pandas)
* Make sure the outcome variable is coded correctly for your project
* Check the data type of each feature (*e.g.*, numeric, string, datetime).  Correct as necessary.
* Drop any feature that's mostly missing data

### Make sure missing values are coded correctly

In [None]:
## We noticed up above that there were some "?" values in some fields
data[['payer_code']]

Pandas `DataFrame.replace()` function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

In [None]:
## Anywhere in the dataframe that has a value of `?`, replace with NaN
data = data.replace('?', np.nan)

In [None]:
## Check that the `?` values have changed to `NaN`
data[['payer_code']]

### Make sure the outcome variable is coded correctly for your project

What are the unique values of the outcome variable (`readmitted`), and how many patients had each outcome?

In [None]:
data['readmitted'].unique()   ## show value counts for readmitted column

In [None]:
data['readmitted'].value_counts(normalize=True)    ## show value counts with percentage 

So there are three values:
    * No readmission  (~ 54%)
    * Readmitted after more than 30 days (~35%)
    * Readmitted within 30 days (~11%)
    
How we treat this variable would depend on the business purpose of the model.  Do we just want to predict readmissions within 30 days, or any readmission?  This would be a good question to discuss with the business owner.  For this workshop, we will assume we only want to predict readmissions **within 30 days**.  More than 30 days will be combined with the "NO" group.

In [None]:
## Anywhere where `readmitted` is '<30', give a 1, otherwise give a 0
data['readmitted_30'] = np.where(data['readmitted'] == '<30', 1, 0)    ## Create new column

data[['readmitted', 'readmitted_30']].head(15)                         ## Compare to original

### Check the data type of each feature (*e.g.*, numeric, string, datetime).  Correct as necessary  
How has pandas interpreted the data type of each feature?  

Pandas data types:
- object
    - Text 
    - `"Dog", "Cat", "Frog"`
- int64	
    - Integer numbers
    - `-12, 5, 1064`
- float64
    - Floating point (decimal) numbers
    - `3.14159`
- datetime64
    - Date and time values
    - `2019-10-28 11:32:04`

In [None]:
data.dtypes    ## print each column name and its data type

Most of the data types look correct, except for a few features that look numeric but are actually categorical:

`admission_type_id` <br />
`discharge_disposition_id` <br />
`admission_source_id` <br />
<br />
These three columns look like numbers but they are actually categorical values. <br />
Let's change their type to 'object'

In [None]:
## create a list of columns
columns_to_correct = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id']

## Look at those columns
data[columns_to_correct]

In [None]:
## Check what the current data types are
data[columns_to_correct].dtypes

In [None]:
## Convert each column in the list to type "object"
data[columns_to_correct] = data[columns_to_correct].astype('object')    

In [None]:
## Look at those columns again now that we've corrected them
data[columns_to_correct]

In [None]:
data[columns_to_correct].dtypes

### Patient age  
* The `age` column is formatted as a string.  It might be more meaningful as a numeric

In [None]:
data[['age']]

In [None]:
## Look at the value for the first row to see that it is a string (text)
data.age[0]

Before we fix `age`, let's talk about how Python uses **zero-based indexing**. Each element of a sequence is assigned a number - its position or index. The first index is zero, the second index is one, and so forth.

In [None]:
pets = ['dog', 'cat', 'bird']   ## create a list of pets
pets

In [None]:
## Get the first element of `pets`
pets[0]

In [None]:
## Get the first element of `pets`, then get the first element of that
pets[0][0]

Now, let's correct the `age` column by taking a substring of each value and then converting it to a number:

In [None]:
## As an example, look at the value of age for the fifth row of the dataframe
data['age'][5]

In [None]:
## Get a substring from positions 1 and 2 
## (python goes up to but doesn't include the last value in the range) 
data['age'][5][1:3]

That looks like what we want. Take the substring for every value in the column:

In [None]:
data['age'] = data['age'].str[1:3]    ## take the 1-2 characters of the string

In [None]:
## Does it look correct now?
data['age'].value_counts()

This worked for every value except `[0-10)`.  We can remove the hyphen using the `np.where()` function again:

In [None]:
## Find where age = '-0' and replace it with '0'; otherwise take its existing value
data['age'] = np.where(data['age'] == '0-', '0', data['age'])

In [None]:
## Does it look right now?
data['age'].value_counts()

In [None]:
## It looks like numbers but it's still a string, so now convert to integer
data['age'] = data['age'].astype('int64')    ## convert age column to int64
data['age'].dtype                            ## print new data type

In [None]:
## Does the `age` column look like a number now, instead of text?
data['age'][0]

Now that `age` is correctly coded as a number, we can look at a histogram:

In [None]:
data['age'].hist()    ## create a histogram with the age column 

### Drop features that have a high percentage of missing values  
If a feature has too many missing values, it probably won't be helpful to the model.

In [None]:
## Find columns with more than 30% missing values
for column in data.columns:
    if sum(data[column].isnull())/len(data[column]) > 0.3:
        print(column)

In [None]:
## show number of columns before   
print(f'number of columns before: {data.shape[1]}')                       

## drop columns in list
data = data.drop(['weight','payer_code','medical_specialty'], axis = 1)    

## show number of columns before
print(f'number of columns after: {data.shape[1]}')                            

### Columns with little or no variation  
Find and drop columns with no meaningful variation:  

In [None]:
no_variation_cols = []                           ## initialize empty list

for column in data.columns:                      ## loop through each column in the dataframe
    if len(data[column].unique()) == 1:          ## if only 1 unique value exists
        no_variation_cols.append(column)         ## add the column name to the list


no_variation_cols                                ## print the list of columns with no variation

Read more about list comprehensions here: https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

In [None]:
## A different way to do the same thing, with a list comprehension
no_variation_cols = [i for i in list(data) if len(data[i].unique()) == 1]
no_variation_cols

In [None]:
## show number of columns before   
print(f'number of columns before: {data.shape[1]}')                       

## drop columns in list
data = data.drop(no_variation_cols, axis=1) 

## show number of columns before
print(f'number of columns after: {data.shape[1]}')  

### Choose features for training  
Lastly, identifying the features (predictors) we want to include for modeling.  You can use any subset of features for modeling but will want to remove those features which have little or no bearing on the outcome.

For this model we can assume the patient identifiers will not be useful to predicting the likelihood of readmission.  Since we created our own 30 day readmission indictor flag from the 'readmitted' column,  we want to be sure to remove that column as well.

In [None]:
## Drop unique identifiers and the old outcome variable
print(f'number of columns before: {data.shape[1]}')                             ## show number of columns before    

data.drop(['encounter_id','patient_nbr','readmitted'], axis=1, inplace=True)    ## drop columns

print(f'number of columns after: {data.shape[1]}')                              ## show number of columns after  

### Timeliness of Features  
We should be careful not to train a model using features that wouldn't be available at the time of scoring, especially features that may unfairly "peek" at what the outcome is.  

Let's look at the features again and see if any of them wouldn't be available.

In [None]:
list(data)

`discharge_disposition_id` sticks out as a feature that probably wouldn't be available at the time of scoring, assuming we would score the patients *before* they're discharged.  We should remove this from the training set.

In [None]:
data = data.drop('discharge_disposition_id', axis=1)

### Categorical Variable Handling 

Since many models can't use string (text) values, we need to convert the categorical features into numerical ones. We can do this in two ways.

### Option 1:  Label Encoding
We can give each categorical value a number.  This allows us to keep the variable in one column.  However, encoding variables in this way may inherently imply an order to the values that does't actually exist.

In [None]:
## Show an example of a text column
pets = ['dog', 'cat', 'dog', 'bird', 'turtle']
pets_df = pd.DataFrame(pets, columns=['pet'])  ## display as DataFrame
pets_df

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

pets_transformed = le.fit_transform(pets)  ## Change each text value to a number

## Show the original column side-by-side with new
pd.DataFrame(list(zip(pets, pets_transformed)), columns=['original', 'label_encoded']) 

### Option 2:  One-Hot Encoding  
One-Hot Encoding makes a new column for each value of the feature, with a 0 or 1 for each value.

Note:  XGBoost recommends one-hot encoding (https://xgboost.readthedocs.io/en/latest/python/python_intro.html)

The benefit of this method is there is no inherent order to the numbers. The downside is it can make your dataframe very large and unwieldy.

In [None]:
pets_df

Pandas `get_dummies()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

In [None]:
## Create a one hot encoded dataframe
pd.get_dummies(pets_df)    

For this tutorial, we will choose one-hot encoding for our categorical features.  However, there are three problematic columns:

`diag_1`, `diag_2`, & `diag_3` each have over 700 potential options. <br /> 
We will remove these 3 columns for now.

In [None]:
print(f'diag_1: {len(data.diag_1.unique())}')    ## show number of distinct values
print(f'diag_2: {len(data.diag_2.unique())}')    ## show number of distinct values
print(f'diag_3: {len(data.diag_3.unique())}')    ## show number of distinct values

In [None]:
print(f'number of columns before: {data.shape[1]}')                ## show number of columns before 

data.drop(['diag_1', 'diag_2', 'diag_3'], axis=1, inplace=True)    ## drop columns

print(f'number of columns after: {data.shape[1]}')                 ## show number of columns after

Now we will proceed with one-hot encoding the rest of the data:

In [None]:
data = pd.get_dummies(data)     ## use the get_dummies function to one hot encode

In [None]:
data.shape    ## get the shape of the dataframe in rows and columns

In [None]:
## Look at the dataframe to see if everything looks correct
data.head(10)

### Train/Test/Validate Data Split  
We will split the data into three dataframes:
- Training (80%):  For building the model  
- Validation (10%):  For evaluating the trained model as we go  
- Testing (10%):  For evaluating the *final* trained model chosen from all the models we tried

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
## Split the data into two dataframes, train_df with 90% and val_df with 10%
train_df, val_df = train_test_split(data, test_size=0.10, random_state=7,
                                     stratify=data.readmitted_30)

In [None]:
## Split the data again, pulling out another 10% for the testing set
train_df, test_df = train_test_split(train_df, test_size=(10/90),random_state=7, 
                                    stratify=train_df.readmitted_30)

In [None]:
data.shape

In [None]:
print("train_df:", train_df.shape, "\n",round(train_df.shape[0]/data.shape[0], 2), "%", "\n")
print("val_df:", val_df.shape, "\n",round(val_df.shape[0]/data.shape[0], 2), "%", "\n")
print("test_df:", test_df.shape, "\n",round(test_df.shape[0]/data.shape[0], 2), "%", "\n")

Divide each of the dataframes into predictors (X) and the outcome (Y):

In [None]:
X_train = train_df.drop('readmitted_30', axis=1)
Y_train = train_df[['readmitted_30']]

X_val = val_df.drop('readmitted_30', axis=1)
Y_val = val_df[['readmitted_30']]

X_test = test_df.drop('readmitted_30', axis=1)
Y_test = test_df[['readmitted_30']]

In [None]:
Y_train

In [None]:
X_train

## 6.  Train A Model

In [None]:
# bring in the model classifier you want to use
from xgboost import XGBClassifier

Create a model object, including hyperparameters:

In [None]:
estimator = XGBClassifier(n_estimators=500,
                          objective= 'binary:logistic', 
                          nthread=50,
                          seed=27)

Look at the object.  What are the default values for hyperparameters it chose?

In [None]:
estimator

### Train the model

In [None]:
estimator.fit(X_train, 
              Y_train.values.ravel(), 
              eval_metric=['logloss','aucpr'],
              verbose=True)

In [None]:
# bring in sklearn metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [None]:
val_preds = estimator.predict(X_val)

In [None]:
accuracy_score(Y_val, val_preds)

In [None]:
## Write a function to generate a nicely formatted confusion matrix

def get_cm(Y_val, val_preds):
    cm = pd.DataFrame(confusion_matrix(Y_val, val_preds))
    cm = cm.rename(columns={0: 'predict not readmitted', 1: 'predict readmitted'})
    cm['Actual'] = ['not readmitted', 'readmitted']
    cm = cm.set_index('Actual')
    del cm.index.name
    
    return cm

In [None]:
get_cm(Y_val, val_preds)

### Better metrics than Accuracy

**Precision:** Out of all the cases we predicted would be admitted, what percent were actually readmitted?  
`tp / (tp + fp)`

In [None]:
precision_score(Y_val, val_preds, pos_label=1)

**Recall:**  Out of all the true readmissions, what percent did we correctly flag?  
`tp / (tp + fn)`

In [None]:
recall_score(Y_val, val_preds)

### Variable Importances

Interpreting variable importance with xgboost:  https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7

In [None]:
from xgboost import plot_importance

In [None]:
import matplotlib.pyplot as plt
xgboost.plot_importance(estimator, importance_type='gain', max_num_features=15, 
                        show_values=False, height=0.6, grid=False)
plt.show()

In [None]:
data[['readmitted_30', 'number_inpatient']].groupby('readmitted_30').mean()

### Now try weighting readmissions more heavily with `scale_pos_weight`:  
https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
estimator_weighted = XGBClassifier(n_estimators=500,
                          objective= 'binary:logistic', 
                          nthread=50,
                          scale_pos_weight=9,
                          seed=27)

In [None]:
estimator_weighted.fit(X_train, 
              Y_train, 
              eval_metric=['logloss','aucpr'],
              verbose=True)

In [None]:
val_preds_weighted = estimator_weighted.predict(X_val)

In [None]:
get_cm(Y_val, val_preds_weighted)

In [None]:
accuracy_score(Y_val, val_preds_weighted)

In [None]:
precision_score(Y_val, val_preds_weighted)

In [None]:
recall_score(Y_val, val_preds_weighted)

In [None]:
xgboost.plot_importance(estimator_weighted, importance_type='gain', max_num_features=15, 
                        show_values=False, height=0.6, grid=False)
plt.show()

## 7.  Evaluate the model's performance 

Although the first model we trained looked promising at first, when we dug deeper into the metrics, we found out that it would be of little use to our business owners.  

The second model, while it still has room for improvement, was moving in the right direction.  

## 8. Conclusions (Back to the business case)

Reasonable next steps for this model would be:
* Hyperparameter tuning to improve the performance of the model
* Look for additional features
    * Add back in diag_1, diag_2, diag_3
* Try other model types (*e.g.*, Random forest)
* When we land on a model we're happy with, check its performance on the final holdout dataset (the "Test" set) to make sure we don't have overfitting
* Communicate progress to the business owner and get feedback on the business requirements