# Building a Binary Classification Machine Learning Model To Predict Hospital Readmission in Patients with Diabetes

In this tutorial, we'll be looking at hospital admission data in patients with diabetes. This dataset was collected from 130 hospitals in the United States from 1999 to 2008. More details can be found on the UCI Machine Learning Repository [website](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).

This notebook is incomplete. You'll need to fill in the blanks in order for the cells to run successfully. To avoid setting up an environment locally, you can run this noteobok in the cloud using Google Colab. See Colab notebook [here](https://colab.research.google.com/drive/1LBth_Yk2jAyegg-elx9P7ljrYhojhe0z).

## Step 1: Importing Depedencies

Before getting started, we'll need to import several packages. These include:

- [pandas](https://pandas.pydata.org/pandas-docs/stable/) - a package for performing data analysis and manipulation
- [matplotlib](https://matplotlib.org/) - the standard Python plotting package
- [seaborn](https://seaborn.pydata.org/) - a dataframe-centric visualization package that is built off of **matplotlib**

In [None]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Step 2: Load the Data

We will be loading in the data as a pandas DataFrame.

The data is stored in a csv file. We'll import this data using a pandas method called `read_csv`.

In [None]:
data = pd.read_csv("../data/patient_data.csv")

To get a glimpse of our data, we can use either the `head()`, which shows the first 5 rows, or `sample()` which randomly samples rows.

In [None]:
data.___() # look at first 5 rows

In [None]:
data.sample(n= __) # set n to equal the number of rows you want to sample

### How many rows and columns are in our dataset?

In [None]:
data.____ # get shape of dataset

In [None]:
print(f"There are {} columns (features) and {} rows (hospital admissions).")

### Does each row represent a unique patient?

In [None]:
n_patients = data['   '].nunique()
n_admissions = data['   '].nunique()


print(f"There are {n_patients} patients in this dataset.")
print(f"There are {n_admissions} hospital admissions in this dataset.")

## Step 3: Data Cleaning 

There are 3 columns in our dataset which represent ID's that link to descriptors in separate files: 

1. `admission_type_id`
2. `admission_source_id`
3. `discharge_disposition_id`

In [None]:
data[['admission_type_id', 'admission_source_id', 'discharge_disposition_id']].head()

We'll update these 3 columns so that they represent the descriptor name instead of simply the id number.

Our mapper files are located in `data/id_mappers/` as shown below.

In [None]:
os.listdir('data/id_mappers/')

### i) Decoding  `admission_type_id`

In [None]:
admission_type = pd.read_csv("data/id_mappers/admission_type_id.csv")
admission_type

We can see that the admission type mapper file has 3 values which represent missing data:

1. NaN
2. 'Not Mapped'
3. 'Not Available'

Let's collapse these into one category that represents a missing value. 

In [None]:
missing_values = {'nan': None, 'Not Available': None, 'Not Mapped': None}
admission_type['description'] = admission_type['description'].replace(missing_values)

In [None]:
admission_type_mapper = admission_type.to_dict()['description']
admission_type_mapper

Now that we have a "clean" mapper, we can apply it to our dataset. We can use [pandas.Series.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) to map `admission_type_id` values in our original dataframe to the descriptors in our `admission_type_mapper` dictionary.

In [None]:
data['admission_type'] = data['admission_type_id'].map(admission_type_mapper)
data[['admission_type']].head()

### ii) Decoding  `admission_source_id`

In [None]:
admission_source = pd.read_csv("data/id_mappers/admission_source_id.csv")
admission_source.shape

There are significantly more ID's represented in the `admission_source_id.csv` file as compared to `admission_type_id.csv`. Let's take a look at the list of all descriptions.

In [None]:
admission_source['description'].tolist()

Here, we can see that there are 4 missing values:

- 'Not Available' 
- 'Unknown/Invalid'
- 'Not Mapped'
- 'nan'

### iii) Decoding  `discharge_disposition_id`

In [None]:
discharge_disposition = pd.read_csv("data/id_mappers/discharge_disposition_id.csv")

discharge_disposition['description'].tolist()
# discharge_disposition['description'] = discharge_disposition['description'].replace({'Not Available': None, 'NaN': None, 'nan': None, 'Unknown/Invalid': None, 'Not Mapped': None})
# discharge_disposition_mapper = discharge_disposition.to_dict()['description']

# data['discharge_disposition'] = data['discharge_disposition_id'].map(discharge_disposition_mapper)

In [None]:
data = data.drop(columns=['admission_type_id', 'admission_source_id', 'discharge_disposition_id'])

### Missing Values Assessment

To get a better sense of the missing values in our data, let's visualize it using [missingno](https://github.com/ResidentMario/missingno)'s "nullity" matrix.

In [None]:
import missingno as msno

msno.matrix(data)

# other methods to check out:
# - msno.bar
# - msno.heatmap

### Age

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(data['age'], palette='viridis')

### Time in Hospital

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(data['time_in_hospital'], palette='viridis')

### Number of Diagnoses, Procedures, Medications

In [None]:
features = ['number_emergency', 'num_procedures', 'number_diagnoses', 'number_inpatient']
plt.figure(figsize=(12,8))
for i, f in enumerate(features):
    plt.subplot(2,2,i+1)
    sns.countplot(data[f])
    plt.title(f)

### Medical Specialty

Medical specialty of attending physician.

In [None]:
data['medical_specialty'].unique()

In [None]:
data['diabetesMed'] = data['diabetesMed'].map({'Yes': 1, 'No':0})
data['diabetesMed'].value_counts()

In [None]:
data['A1Cresult']

In [None]:
data['A1Cresult'].value_counts()

In [None]:
msno.matrix(data)

### Readmitted

In [None]:
data['readmitted'].value_counts()

In [None]:
data['readmitted_bool'] = np.where(data['readmitted']=='NO', 0, 1)

In [None]:
data.columns

In [None]:
data['gender'] = data['gender'].map({'Female': 0, 'Male':1})
data['gender'].value_counts()