# Capstone 2: Analysis and Prediction of Length of Stay (LOS) in Hospital

## The Data Science Method

1. Problem Identification

2. Data Wrangling

3. Exploratory Data Analysis

#### 4. Pre-processing and Training Data Development

5. Modeling
    * Fit Models with Training Data Set
    * Review Model Outcomes — Iterate over additional models as needed.
    * Identify the Final Model


6. Documentation
    * Review the Results
    * Present and share your findings - storytelling
    * Finalize Code
    * Finalize Documentation


In  **Pre-processing and Training Data Development** step of the guided capstone, following activities has been done:
    
    * Create dummy or indicator features for categorical variables
    * Standardize the magnitude of numeric features using a scaler
    * Split your data into testing and training datasets

In [54]:
#Import necessary packages and load dataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn

df=pd.read_csv('LOS_train_Data.csv', parse_dates=['Department'])
sns.set(style='ticks')

In [55]:
df.describe()

Unnamed: 0,case_id,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,patientid,City_Code_Patient,Visitors with Patient,Admission_Deposit
count,3000.0,3000.0,3000.0,3000.0,2998.0,3000.0,2978.0,3000.0,3000.0
mean,1500.5,19.021333,4.701667,3.155667,2.501668,68701.302333,6.896239,3.362,4903.961
std,866.169729,8.684278,3.277333,1.163853,0.815575,37660.818352,4.247663,1.89688,1045.71649
min,1.0,1.0,1.0,1.0,1.0,208.0,1.0,1.0,2039.0
25%,750.75,11.0,2.0,2.0,2.0,36990.0,4.0,2.0,4202.0
50%,1500.5,21.0,4.0,3.0,2.0,71329.5,8.0,3.0,4775.0
75%,2250.25,26.0,7.0,4.0,3.0,101597.0,8.0,4.0,5430.0
max,3000.0,32.0,13.0,10.0,4.0,131488.0,28.0,24.0,9423.0


In [39]:
#check datatypes of all features
df.dtypes

case_id                              float64
Hospital_code                        float64
Hospital_type_code                    object
City_Code_Hospital                   float64
Hospital_region_code                  object
Available Extra Rooms in Hospital    float64
Department                            object
Ward_Type                             object
Ward_Facility_Code                    object
Bed Grade                            float64
patientid                            float64
City_Code_Patient                    float64
Type of Admission                     object
Severity of Illness                   object
Visitors with Patient                float64
Age                                   object
Admission_Deposit                    float64
Stay                                  object
dtype: object

I strugled the column names having spaces. I replaced the spaces with underscore below  

In [40]:
# remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')

In [41]:
# Accessing the features (column names)
df.columns
df.dtypes

case_id                              float64
Hospital_code                        float64
Hospital_type_code                    object
City_Code_Hospital                   float64
Hospital_region_code                  object
Available_Extra_Rooms_in_Hospital    float64
Department                            object
Ward_Type                             object
Ward_Facility_Code                    object
Bed_Grade                            float64
patientid                            float64
City_Code_Patient                    float64
Type_of_Admission                     object
Severity_of_Illness                   object
Visitors_with_Patient                float64
Age                                   object
Admission_Deposit                    float64
Stay                                  object
dtype: object

In [47]:
# Number of distinct observations in test dataset
for i in df.columns:
    print(i, ':', df[i].nunique())

case_id : 3000
Hospital_code : 32
Hospital_type_code : 7
City_Code_Hospital : 11
Hospital_region_code : 3
Available_Extra_Rooms_in_Hospital : 10
Department : 5
Ward_Type : 5
Ward_Facility_Code : 6
Bed_Grade : 4
patientid : 573
City_Code_Patient : 24
Type_of_Admission : 3
Severity_of_Illness : 3
Visitors_with_Patient : 16
Age : 10
Admission_Deposit : 2007
Stay : 11


In [56]:
df.Stay.unique()

array(['0-10', '41-50', '31-40', '20-Nov', '51-60', '21-30', '71-80',
       'More than 100 Days', '81-90', '61-70', '91-100', nan],
      dtype=object)

In [57]:
df['Stay'] = df['Stay'].replace(['20-Nov'],'11-20')
df['Stay'] = df['Stay'].replace(['More than 100 Days'],'100-200')

In [58]:
df.Stay.unique()

array(['0-10', '41-50', '31-40', '11-20', '51-60', '21-30', '71-80',
       '100-200', '81-90', '61-70', '91-100', nan], dtype=object)

In [59]:
df.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1.0,8.0,c,3.0,Z,3.0,radiotherapy,R,F,2.0,31397.0,7.0,Emergency,Extreme,2.0,51-60,4911.0,0-10
1,2.0,2.0,c,5.0,Z,2.0,radiotherapy,S,F,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,5954.0,41-50
2,3.0,10.0,e,1.0,X,2.0,anesthesia,S,E,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,4745.0,31-40
3,4.0,26.0,b,2.0,Y,2.0,radiotherapy,R,D,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,7272.0,41-50
4,5.0,26.0,b,2.0,Y,2.0,radiotherapy,S,D,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,5558.0,41-50


In [60]:
# Create dummy features for categorical variables
dummies=pd.get_dummies(df.Stay)
merged=pd.concat([df,dummies],axis=1)
final=merged.drop(['Stay'], axis=1)
df1=final
df1.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,...,100-200,11-20,21-30,31-40,41-50,51-60,61-70,71-80,81-90,91-100
0,1.0,8.0,c,3.0,Z,3.0,radiotherapy,R,F,2.0,...,0,0,0,0,0,0,0,0,0,0
1,2.0,2.0,c,5.0,Z,2.0,radiotherapy,S,F,2.0,...,0,0,0,0,1,0,0,0,0,0
2,3.0,10.0,e,1.0,X,2.0,anesthesia,S,E,2.0,...,0,0,0,1,0,0,0,0,0,0
3,4.0,26.0,b,2.0,Y,2.0,radiotherapy,R,D,2.0,...,0,0,0,0,1,0,0,0,0,0
4,5.0,26.0,b,2.0,Y,2.0,radiotherapy,S,D,2.0,...,0,0,0,0,1,0,0,0,0,0


In [70]:
import sklearn
print(sklearn.__version__)

0.24.1


In [80]:
df = df.dropna()
df

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1.0,8.0,c,3.0,Z,3.0,radiotherapy,R,F,2.0,31397.0,7.0,Emergency,Extreme,2.0,51-60,4911.0,0-10
1,2.0,2.0,c,5.0,Z,2.0,radiotherapy,S,F,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,5954.0,41-50
2,3.0,10.0,e,1.0,X,2.0,anesthesia,S,E,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,4745.0,31-40
3,4.0,26.0,b,2.0,Y,2.0,radiotherapy,R,D,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,7272.0,41-50
4,5.0,26.0,b,2.0,Y,2.0,radiotherapy,S,D,2.0,31397.0,7.0,Trauma,Extreme,2.0,51-60,5558.0,41-50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,2996.0,17.0,e,1.0,X,2.0,gynecology,R,E,4.0,10553.0,4.0,Trauma,Moderate,2.0,71-80,5200.0,11-20
2996,2997.0,26.0,b,2.0,Y,4.0,gynecology,Q,D,3.0,10553.0,4.0,Trauma,Moderate,4.0,71-80,3982.0,11-20
2997,2998.0,3.0,c,3.0,Z,2.0,gynecology,R,A,2.0,10553.0,4.0,Trauma,Moderate,3.0,71-80,5245.0,21-30
2998,2999.0,28.0,b,11.0,X,4.0,gynecology,R,F,3.0,10553.0,4.0,Trauma,Moderate,2.0,71-80,5199.0,51-60


In [85]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

X = df.drop(['Stay'], axis=1)

y = df.Stay



In [86]:
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.65,test_size=0.35, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:        case_id  Hospital_code Hospital_type_code  City_Code_Hospital  \
270     271.0           28.0                  b                11.0   
2038   2039.0            6.0                  a                 6.0   
2179   2180.0            2.0                  c                 5.0   
1731   1732.0           12.0                  a                 9.0   
1495   1496.0           26.0                  b                 2.0   
...       ...            ...                ...                 ...   
608     609.0           14.0                  a                 1.0   
1611   1612.0           19.0                  a                 7.0   
1373   1374.0           15.0                  c                 5.0   
1559   1560.0           19.0                  a                 7.0   
872     873.0           28.0                  b                11.0   

     Hospital_region_code  Available Extra Rooms in Hospital    Department  \
270                     X                                3.