# Capstone Project: Predicting the need for hospital admittance using machine learning
##### The aim of this report is to predict whether or not a person needs to be admitted to the hospital. We have a dataset of over 560,000 unidentified patients. The dataset consists of unique 972 features.

#### Author: Yash Nagpaul (Data Science Diploma Candidate, BrainStation)

## Table of Contents:
### 1. <a href="part-1">Part 1 - Data Cleaning</a>

In [41]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv('./data/admit_data.csv')

In [3]:
# work with a 10% sample
df = df.groupby('disposition').sample(frac=0.1, random_state=1011)

In [4]:
# drop duplicate index column
df = df.drop(columns=['Unnamed: 0'])

In [14]:
# reset index values since we're working with a random 10% sample
df.reset_index(drop=True, inplace=True)

In [6]:
# Check for duplicate rows
df.duplicated().sum()

0

In [7]:
# Only 2 rows are duplicates, we can drop the duplicates with confidence because:
# it's an extremely small proprtion of the data, and
# we have over 900 features, which makes it very likely that the duplicate row is not intentional
df.drop_duplicates(inplace=True)

In [23]:
# TODO: deal with nulls

# peek
df.isna().sum().sort_values(ascending=False)

phencyclidine(pcp)screen,urine,noconf._median    56049
phencyclidine(pcp)screen,urine,noconf._max       56049
phencyclidine(pcp)screen,urine,noconf._min       56049
benzodiazepinesscreen,urine,noconf._last         56049
phencyclidine(pcp)screen,urine,noconf._last      56049
                                                 ...  
othcnsinfx                                           0
othematldx                                           0
othercvd                                             0
othereardx                                           0
dep_name                                             0
Length: 972, dtype: int64

#### NOTES:
- 56000 nulls means the entire column is null?
    - Drop those columns?

In [39]:
# How many non-numeric columns do we have?

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df.select_dtypes(exclude=numerics).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56049 entries, 0 to 56048
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   dep_name          56049 non-null  object
 1   gender            56049 non-null  object
 2   ethnicity         56049 non-null  object
 3   race              56046 non-null  object
 4   lang              56049 non-null  object
 5   religion          56049 non-null  object
 6   maritalstatus     56049 non-null  object
 7   employstatus      56049 non-null  object
 8   insurance_status  56049 non-null  object
 9   disposition       56049 non-null  object
 10  arrivalmode       53870 non-null  object
 11  arrivalmonth      56049 non-null  object
 12  arrivalday        56049 non-null  object
 13  arrivalhour_bin   56049 non-null  object
 14  previousdispo     56049 non-null  object
dtypes: object(15)
memory usage: 6.4+ MB


#### NOTES:
- A total of 15 non-numeric columns that either need to be converted to numeric somehow or dropped

### `disposition` column
- Convert the column values to binary
- `Admit` = 1
- `Discharge` = 0

In [43]:
df.disposition = df.disposition.apply(lambda x: 1 if x == 'Admit' else 0)
df.disposition

0        1
1        1
2        1
3        1
4        1
        ..
56044    0
56045    0
56046    0
56047    0
56048    0
Name: disposition, Length: 56049, dtype: int64

### `religion` column
- multiple categories
- make dummy variables

In [40]:
# df['religiion'].value_counts()

# # instantiate a OneHotEncoder object
# ohe = OneHotEncoder()

# # make a separate dataframe consisting of just 1 column (disposition)
# religion_df = pd.DataFrame(df.religion)

# # fit the one hot encoder and transform the column
# dummified_religion_df = ohe.fit_transform(religion_df)



Discharge    39385
Admit        16664
Name: disposition, dtype: int64

#### NOTES:
- Doing this process many times will be tedious,
- We can write a function that will run for each of the column
- And repeat the same process for all of them (TODO: check script 21)