<a href="https://colab.research.google.com/github/stevenazeez/Health-Hospital-Analytics/blob/main/Health%20Hospital%20Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Welcome!*

#**Health Hospital Analytics**

This project uses analytics to predict the length of stay for each patient. It aims to improve hospital efficiency and reduce costs.
>Let's Begin!

**Importing Libaries**

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

**Getting Data**

In [4]:
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')

## **Exploratory Data Analysis**

> Let's get a quick look of the data




In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 18 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   case_id                            318438 non-null  int64  
 1   Hospital_code                      318438 non-null  int64  
 2   Hospital_type_code                 318438 non-null  object 
 3   City_Code_Hospital                 318438 non-null  int64  
 4   Hospital_region_code               318438 non-null  object 
 5   Available Extra Rooms in Hospital  318438 non-null  int64  
 6   Department                         318438 non-null  object 
 7   Ward_Type                          318438 non-null  object 
 8   Ward_Facility_Code                 318438 non-null  object 
 9   Bed Grade                          318325 non-null  float64
 10  patientid                          318438 non-null  int64  
 11  City_Code_Patient                  3139

In [6]:
# Number of distinct observations in test dataset
for i in test.columns:
    print(i, ':', test[i].nunique())

case_id : 137057
Hospital_code : 32
Hospital_type_code : 7
City_Code_Hospital : 11
Hospital_region_code : 3
Available Extra Rooms in Hospital : 15
Department : 5
Ward_Type : 6
Ward_Facility_Code : 6
Bed Grade : 4
patientid : 39607
City_Code_Patient : 37
Type of Admission : 3
Severity of Illness : 3
Visitors with Patient : 27
Age : 10
Admission_Deposit : 6609


### **Data Cleansing**
>Filtering the null values

In [7]:
train.isnull().sum().sort_values(ascending = False)

City_Code_Patient                    4532
Bed Grade                             113
Stay                                    0
Ward_Type                               0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Facility_Code                      0
Admission_Deposit                       0
patientid                               0
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
case_id                                 0
dtype: int64

In [8]:
#Replacing Null values in Bed Grade Column for both Train dataset
train['Bed Grade'].fillna(train['Bed Grade'].mode()[0], inplace = True)

#Replacing Null values in City Code Patient Column for both Train dataset
train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0], inplace = True)

In [9]:
test.isnull().sum().sort_values(ascending = False)

City_Code_Patient                    2157
Bed Grade                              35
Admission_Deposit                       0
Department                              0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Ward_Facility_Code                      0
Ward_Type                               0
Age                                     0
patientid                               0
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
case_id                                 0
dtype: int64

In [10]:
#Replacing Null values in Bed Grade Column for both Test dataset
test['Bed Grade'].fillna(test['Bed Grade'].mode()[0], inplace = True)

#Replacing NA values in City Code Patient Column for both Train dataset
test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0], inplace = True)

In [11]:
train.shape

(318438, 18)

In [12]:
test.shape

(137057, 17)

In [13]:
train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


In [14]:
test.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0


In [15]:
# Label Encoding Stay column in train dataset to numeric
from sklearn import preprocessing
lencoder = preprocessing.LabelEncoder()

train['Stay'] = lencoder.fit_transform(train['Stay'].astype('str'))

In [16]:
train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,4
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,3
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,4
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,4


In [17]:
#Imputing dummy Stay column in test dataset  
test['Stay'] = -1

In [18]:
#concatenate test with train dataset
data_concat = pd.concat([train, test])                                                                          

In [19]:
data_concat.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,4
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,3
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,4
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,4


In [20]:
#Label Encoding all the columns in Train and test datasets
for i in ['Hospital_type_code', 'Hospital_region_code', 'Department',
          'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age']:
    lencoder = preprocessing.LabelEncoder()
    data_concat[i] = lencoder.fit_transform(data_concat[i].astype(str))

In [21]:
#Spearating Train and Test Datasets
train = data_concat[data_concat['Stay'] != -1]
test = data_concat[data_concat['Stay'] == -1]

## **Feature Engineering**

In [22]:
def get_countid (train, test, cols, name):
  temp_train = train.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  temp_test = test.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  train = pd.merge(train, temp_train, how='left', on= cols)
  test = pd.merge(test, temp_test, how='left', on= cols)
  train[name] = train[name].astype('float')
  test[name] = test[name].astype('float')
  train[name].fillna(np.median(temp_train[name]), inplace = True)
  test[name].fillna(np.median(temp_test[name]), inplace = True)
  return train, test

In [23]:
train, test = get_countid (train, test, ['patientid'],  name = 'CountPatientID')
train, test = get_countid(train, test, 
                                 ['patientid', 'Hospital_region_code'], name = 'countPatientIDHospitalCode')
train, test = get_countid(train, test, 
                                 ['patientid', 'Ward_Facility_Code'], name = 'CountIDPatientWardFacilityCode')

In [24]:
train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay,CountPatientID,countPatientIDHospitalCode,CountIDPatientWardFacilityCode
0,1,8,2,3,2,3,3,2,5,2.0,31397,7.0,0,0,2,5,4911.0,0,14.0,4.0,5.0
1,2,2,2,5,2,2,3,3,5,2.0,31397,7.0,1,0,2,5,5954.0,4,14.0,4.0,5.0
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,0,2,5,4745.0,3,14.0,4.0,2.0
3,4,26,1,2,1,2,3,2,3,2.0,31397,7.0,1,0,2,5,7272.0,4,14.0,6.0,3.0
4,5,26,1,2,1,2,3,3,3,2.0,31397,7.0,1,0,2,5,5558.0,4,14.0,6.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318433,318434,6,0,6,0,3,3,1,5,4.0,86499,23.0,0,2,3,4,4144.0,1,1.0,1.0,1.0
318434,318435,24,0,1,0,2,1,1,4,4.0,325,8.0,2,2,4,8,6699.0,3,1.0,1.0,1.0
318435,318436,7,0,4,0,3,2,2,5,4.0,125235,10.0,0,1,3,7,4235.0,1,1.0,1.0,1.0
318436,318437,11,1,2,1,3,1,1,3,3.0,91081,8.0,1,1,5,1,3761.0,1,1.0,1.0,1.0


In [25]:
# Droping duplicate columns
test_updated = test.drop(['Stay', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis =1)
train_updated = train.drop(['case_id', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis =1)

Splitting train data for Naive Bayes and XGBoost

In [26]:
 X1 = train_updated.drop('Stay', axis =1)
y1 = train_updated['Stay']

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size =0.20, random_state =100)

## **XGBoost Model**

In [28]:
import xgboost
from sklearn.metrics import accuracy_score

In [29]:
XBG = xgboost.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=800,
                                  objective='multi:softmax', reg_alpha=0.5, reg_lambda=1.5,
                                  booster='gbtree', n_jobs=4, min_child_weight=2, base_score= 0.75)

In [30]:
XBG_Classifier = XBG.fit(X_train, y_train)

In [31]:
XGB_Predict = XBG_Classifier.predict(X_test)

In [32]:
Accuracy1 = accuracy_score(XGB_Predict,y_test)
print("XGBoost Model Accuracy Score:", Accuracy1*100)

XGBoost Model Accuracy Score: 43.047355859816605


## **Naive Bayes Model**

In [33]:
from sklearn.naive_bayes import GaussianNB

In [34]:
target = y_train.values
features = X_train.values

In [35]:
classifier_nb = GaussianNB()
model_nb = classifier_nb.fit(features, target)

In [36]:
NaiveBayes = classifier_nb.fit(features, target)

In [37]:
NaiveBayes_Predict = model_nb.predict(X_test)

In [38]:
Accuracy2 = accuracy_score(NaiveBayes_Predict,y_test)
print("Acurracy Score for Naive Bayes Model:", Accuracy2*100)

Acurracy Score for Naive Bayes Model: 34.55439015199096


## **Predictions & Results**

>**XGBoost**

In [51]:
XBG_prediction = XBG_Classifier.predict(test_updated.iloc[:,1:])

In [52]:
XBG_output = pd.DataFrame(pred_xgb, columns=['Stay'])
XBG_output['case_id'] = test_updated['case_id']
XBG_output = XBG_output[['case_id', 'Stay']]

In [43]:
XBG_output['Stay'] = result_xgb['Stay'].replace({0:'0-10', 1: '11-20', 2: '21-30', 3:'31-40', 4: '41-50', 5: '51-60', 6: '61-70', 7: '71-80', 8: '81-90', 9: '91-100', 10: 'More than 100 Days'})
XBG_output.head() 

Unnamed: 0,case_id,Stay
0,318439,0-10
1,318440,51-60
2,318441,21-30
3,318442,21-30
4,318443,51-60


In [60]:
print(XBG_output.groupby('Stay')['case_id'].nunique())

Stay
0      4373
1     39337
2     58261
3     12100
4        61
5     19217
6        16
7       302
8      1099
9        78
10     2213
Name: case_id, dtype: int64


> **Naive Bayes**

In [54]:
NaiveBayes_Prediction = classifier_nb.predict(test_updated.iloc[:,1:])

In [55]:
NaiveBayes_output = pd.DataFrame(pred_nb, columns=['Stay'])
NaiveBayes_output['case_id'] = test_updated['case_id']
NaiveBayes_output = NaiveBayes_output[['case_id', 'Stay']]

In [58]:
NaiveBayes_output['Stay'] = NaiveBayes_output['Stay'].replace({0:'0-10', 1: '11-20', 2: '21-30', 3:'31-40', 4: '41-50', 5: '51-60', 6: '61-70', 7: '71-80', 8: '81-90', 9: '91-100', 10: 'More than 100 Days'})
NaiveBayes_output.head()

Unnamed: 0,case_id,Stay
0,318439,21-30
1,318440,51-60
2,318441,21-30
3,318442,21-30
4,318443,31-40


In [59]:
print(NaiveBayes_output.groupby('Stay')['case_id'].nunique())

Stay
0-10                   2598
11-20                 26827
21-30                 72206
31-40                 15639
41-50                   469
51-60                 13651
61-70                    92
71-80                   955
81-90                   296
91-100                    2
More than 100 Days     4322
Name: case_id, dtype: int64


## **Conclusion**

> 
The aim is to predict a patient's duration of stay at the time of admission allows hospitals to better allocate resources and manage their patients. Identifying factors associated with Length of Stay in order to predict and manage the number of days patients stay in the hospital could aid hospitals in resource management and the creation of innovative treatment regimens. Using patient-level and hospital-level data, numerous variables that correlate with Length of Stay were investigated in this project.