## Problem Statement
- Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 


- This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.


- Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.


- The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.


## Data Description

### Train.zip contains 1 csv alongside the data dictionary that contains definitions for each variable

**train.csv** – File containing features related to patient, hospital and Length of stay on case basis

**train_data_dict.csv** – File containing the information of the features in train file

| Column	                            |Description                                                  |
|-------------------------------        |-------------------------------------------------------------|
|case_id	                            |Case_ID registered in Hospital                               |
|Hospital_code	                        |Unique code for the Hospital                                 |
|Hospital_type_code	                    |Unique code for the type of Hospital                         |
|City_Code_Hospital	                    |City Code of the Hospital                                    |
|Hospital_region_code	                |Region Code of the Hospital                                  |
|Available Extra Rooms in Hospital	    |Number of Extra rooms available in the Hospital              |
|Department	                            |Department overlooking the case                              |
|Ward_Type	                            |Code for the Ward type                                       |
|Ward_Facility_Code	                    |Code for the Ward Facility                                   |
|Bed Grade	                            |Condition of Bed in the Ward                                 |
|patientid	                            |Unique Patient Id                                            |
|City_Code_Patient	                    |City Code for the patient                                    |
|Type of Admission	                    |Admission Type registered by the Hospital                    |
|Severity of Illness	                |Severity of the illness recorded at the time of admission    |
|Visitors with Patient	                |Number of Visitors with the patient                          |
|Age	                                |Age of the patient                                           |
|Admission_Deposit	                    |Deposit at the Admission Time                                |
|Stay	                                |Stay Days by the patient                                     |


### Test Set

**test.csv** – File containing features related to patient, hospital. Need to predict the Length of stay for each case_id



### Sample Submission

**case_id:** Unique id for each case

**Stay:** Length of stay for the patient w.r.t each case id in test data

-------------

In [47]:
def reduce_mem_usage_colwise(col):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    col_type = col.dtypes
    if col_type in numerics:
        c_min = col.min()
        c_max = col.max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                col = col.astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                col = col.astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                col = col.astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                col = col.astype(np.int64)  
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                col = col.astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                col = col.astype(np.float32)
            else:
                col = col.astype(np.float64)    
#     gc.collect()
    return col

<h2 style="color:blue" align="left"> 1. Import necessary Libraries </h2>

In [48]:
# Read data
import numpy as np                           # Linear Algebra (calculate the mean and standard deviation)
import pandas as pd                          # manipulate data, data processing, load csv file I/O (e.g. pd.read_csv)

# Visualization
import matplotlib.pyplot as plt              # Visualization using matplotlib
%matplotlib inline
import seaborn as sns                        # Visualization using seaborn

# style
plt.style.use("fivethirtyeight")             # Set Graphs Background style using matplotlib
sns.set_style("darkgrid")                    # Set Graphs Background style using seaborn

In [49]:
from sklearn.preprocessing import LabelEncoder # import the LabelEncoder from sklrean library
le = LabelEncoder()    # create the instance of LabelEncoder
import category_encoders as ce

In [50]:
# ML model building; Pre Processing & Evaluation
from sklearn.model_selection import train_test_split                     # split  data into training and testing sets
from sklearn.linear_model import LogisticRegression                      # LogisticRegression
from sklearn.tree import DecisionTreeClassifier                          # Decision tree Classifier
from sklearn.ensemble import RandomForestClassifier                      # this will make a Random Forest Classifier
import xgboost
from xgboost import XGBClassifier                                        # XGBoost Classifier
from sklearn.preprocessing import StandardScaler                         # Standard Scalar
from sklearn.metrics import confusion_matrix, classification_report      # this creates a confusion matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV     # this will do cross validation

import warnings                              # To ignore any warnings
warnings.filterwarnings("ignore")

<h2 style="color:blue" align="left"> 2. Load data </h2>

In [51]:
# Read train and test dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission_lfbv3c3.csv")

In [52]:
# Import first 5 rows
display(train.head())
display(test.head())
display(submission.head())

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0


Unnamed: 0,case_id,Stay
0,318439,0-10
1,318440,0-10
2,318441,0-10
3,318442,0-10
4,318443,0-10


In [53]:
# checking dimension (num of rows and columns) of dataset
print("Training data shape (Rows, Columns):",train.shape)
print("Test data shape (Rows, Columns):",test.shape)

Training data shape (Rows, Columns): (318438, 18)
Test data shape (Rows, Columns): (137057, 17)


In [54]:
train_original=train.copy() 
test_original=test.copy()

In [55]:
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 18 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   case_id                            318438 non-null  int64  
 1   Hospital_code                      318438 non-null  int64  
 2   Hospital_type_code                 318438 non-null  object 
 3   City_Code_Hospital                 318438 non-null  int64  
 4   Hospital_region_code               318438 non-null  object 
 5   Available Extra Rooms in Hospital  318438 non-null  int64  
 6   Department                         318438 non-null  object 
 7   Ward_Type                          318438 non-null  object 
 8   Ward_Facility_Code                 318438 non-null  object 
 9   Bed Grade                          318325 non-null  float64
 10  patientid                          318438 non-null  int64  
 11  City_Code_Patient                  3139

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137057 entries, 0 to 137056
Data columns (total 17 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   case_id                            137057 non-null  int64  
 1   Hospital_code                      137057 non-null  int64  
 2   Hospital_type_code                 137057 non-null  object 
 3   City_Code_Hospital                 137057 non-null  int64  
 4   Hospital_region_code               137057 non-null  object 
 5   Available Extra Rooms in Hospital  137057 non-null  int64  
 6   Department                         137057 non-null  object 
 7   Ward_Type                          137057 non-null  object 
 8   Ward_Facility_Code                 137057 non-null  object 
 9   Bed Grade                          137022 non-null  float64
 10  patientid                          137057 non-null  int64  
 11  City_Code_Patient                  1349

None

<h2 style="color:blue" align="left"> 3. EDA (Exploratory Data Analysis) </h2>

### Missing Values

In [56]:
display(train.isnull().sum())
display(test.isnull().sum())

case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                             113
patientid                               0
City_Code_Patient                    4532
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
Stay                                    0
dtype: int64

case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                              35
patientid                               0
City_Code_Patient                    2157
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
dtype: int64

In [57]:
train = train.drop(['case_id', 'patientid', 'Stay'], axis=1) 
test = test.drop(['case_id', 'patientid'], axis=1)

### Hash Encoding

Train_category = Train_category.drop('Age', axis=1) 
Test_category = Test_category.drop('Age', axis=1)

## 1. Hospital_type_code
 1. Hash Encoding : Rank 1
 2. map : Rank 2

In [12]:
display(train['Hospital_type_code'].value_counts())
display(test['Hospital_type_code'].value_counts())

a    143425
b     68946
c     45928
e     24770
d     20389
f     10703
g      4277
Name: Hospital_type_code, dtype: int64

a    61305
b    29938
c    20219
e    10658
d     8659
f     4549
g     1729
Name: Hospital_type_code, dtype: int64

In [13]:
train['Hospital_type_code'] = train['Hospital_type_code'].map({'a':0, 'b':1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6})
test['Hospital_type_code'] = test['Hospital_type_code'].map({'a':0, 'b':1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6})

In [14]:
display(train['Hospital_type_code'].value_counts())
display(test['Hospital_type_code'].value_counts())

0    143425
1     68946
2     45928
4     24770
3     20389
5     10703
6      4277
Name: Hospital_type_code, dtype: int64

0    61305
1    29938
2    20219
4    10658
3     8659
5     4549
6     1729
Name: Hospital_type_code, dtype: int64

## 2. Hospital_region_code
 1. Hash Encoding : Rank 1
 2. map : Rank 2

In [16]:
display(train['Hospital_region_code'].value_counts())
display(test['Hospital_region_code'].value_counts())

X    133336
Y    122428
Z     62674
Name: Hospital_region_code, dtype: int64

X    57513
Y    52279
Z    27265
Name: Hospital_region_code, dtype: int64

## 3. Department
 1. OHE : Rank 1
 2. Label Encoding : Rank 2

In [58]:
display(train['Department'].value_counts())
display(test['Department'].value_counts())

gynecology            249486
anesthesia             29649
radiotherapy           28516
TB & Chest disease      9586
surgery                 1201
Name: Department, dtype: int64

gynecology            107202
anesthesia             12709
radiotherapy           12517
TB & Chest disease      4165
surgery                  464
Name: Department, dtype: int64

In [59]:
train_Depa = pd.get_dummies(train['Department'], drop_first=True)
test_Depa = pd.get_dummies(test['Department'], drop_first=True)

In [60]:
display(train_Depa.head())
display(test_Depa.head())

Unnamed: 0,anesthesia,gynecology,radiotherapy,surgery
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0


Unnamed: 0,anesthesia,gynecology,radiotherapy,surgery
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0


In [61]:
train = train.drop('Department', axis=1)
test = test.drop('Department', axis=1)

## 4. Ward_Type
 1. Label Encoding : Rank 1
 2. OHE : Rank 2

In [15]:
display(train['Ward_Type'].value_counts())
display(test['Ward_Type'].value_counts())

R    127947
Q    106165
S     77794
P      5046
T      1477
U         9
Name: Ward_Type, dtype: int64

R    54992
Q    45881
S    33372
P     2153
T      656
U        3
Name: Ward_Type, dtype: int64

In [16]:
train['Ward_Type'] = le.fit_transform(train['Ward_Type'])
test['Ward_Type'] = le.fit_transform(test['Ward_Type'])

In [17]:
display(train['Ward_Type'].value_counts())
display(test['Ward_Type'].value_counts())

2    127947
1    106165
3     77794
0      5046
4      1477
5         9
Name: Ward_Type, dtype: int64

2    54992
1    45881
3    33372
0     2153
4      656
5        3
Name: Ward_Type, dtype: int64

## 5. Ward_Facility_Code
 1. Label Encoding : Rank 1
 2. OHE : Rank 2

In [18]:
display(train['Ward_Facility_Code'].value_counts())
display(test['Ward_Facility_Code'].value_counts())

F    112753
E     55351
D     51809
C     35463
B     35156
A     27906
Name: Ward_Facility_Code, dtype: int64

F    48717
E    23707
D    22503
B    14960
C    14816
A    12354
Name: Ward_Facility_Code, dtype: int64

In [19]:
train['Ward_Facility_Code'] = le.fit_transform(train['Ward_Facility_Code'])
test['Ward_Facility_Code'] = le.fit_transform(test['Ward_Facility_Code'])

In [20]:
display(train['Ward_Facility_Code'].value_counts())
display(test['Ward_Facility_Code'].value_counts())

5    112753
4     55351
3     51809
2     35463
1     35156
0     27906
Name: Ward_Facility_Code, dtype: int64

5    48717
4    23707
3    22503
1    14960
2    14816
0    12354
Name: Ward_Facility_Code, dtype: int64

## 6. Type of Admission
 1. Label Encoding : Rank 1
 2. OHE : Rank 2

In [62]:
display(train['Type of Admission'].value_counts())
display(test['Type of Admission'].value_counts())

Trauma       152261
Emergency    117676
Urgent        48501
Name: Type of Admission, dtype: int64

Trauma       65411
Emergency    50687
Urgent       20959
Name: Type of Admission, dtype: int64

In [63]:
train['Type of Admission'] = le.fit_transform(train['Type of Admission'])
test['Type of Admission'] = le.fit_transform(test['Type of Admission'])

In [64]:
display(train['Type of Admission'].value_counts())
display(test['Type of Admission'].value_counts())

1    152261
0    117676
2     48501
Name: Type of Admission, dtype: int64

1    65411
0    50687
2    20959
Name: Type of Admission, dtype: int64

## 7. Severity of Illness
 1. Label Encoding : Rank 1
 2. OHE : Rank 2

In [65]:
display(train['Severity of Illness'].value_counts())
display(test['Severity of Illness'].value_counts())

Moderate    175843
Minor        85872
Extreme      56723
Name: Severity of Illness, dtype: int64

Moderate    75722
Minor       36863
Extreme     24472
Name: Severity of Illness, dtype: int64

In [66]:
train['Severity of Illness'] = le.fit_transform(train['Severity of Illness'])
test['Severity of Illness'] = le.fit_transform(test['Severity of Illness'])

In [67]:
display(train['Severity of Illness'].value_counts())
display(test['Severity of Illness'].value_counts())

2    175843
1     85872
0     56723
Name: Severity of Illness, dtype: int64

2    75722
1    36863
0    24472
Name: Severity of Illness, dtype: int64

## Bed Grade

In [68]:
display(train['Bed Grade'].value_counts())
display(test['Bed Grade'].value_counts())

2.0    123671
3.0    110583
4.0     57566
1.0     26505
Name: Bed Grade, dtype: int64

2.0    52780
3.0    48359
4.0    24821
1.0    11062
Name: Bed Grade, dtype: int64

In [69]:
train['Bed Grade'] = train['Bed Grade'].fillna(2.0)
test['Bed Grade'] = test['Bed Grade'].fillna(2.0)

In [70]:
display(train['Bed Grade'].isnull().sum())
display(test['Bed Grade'].isnull().sum())

0

0

## City_Code_Patient

In [71]:
display(train['City_Code_Patient'].value_counts())
display(test['City_Code_Patient'].value_counts())

8.0     124011
2.0      38869
1.0      26377
7.0      23807
5.0      20079
4.0      15380
9.0      11795
15.0      8950
10.0      8174
6.0       6005
12.0      5647
3.0       3772
23.0      3698
14.0      2927
16.0      2254
13.0      1625
21.0      1602
20.0      1409
18.0      1404
19.0      1028
26.0      1023
25.0       798
27.0       771
11.0       658
28.0       521
22.0       405
24.0       360
30.0       133
29.0        98
33.0        78
31.0        59
37.0        57
32.0        52
34.0        46
35.0        16
36.0        12
38.0         6
Name: City_Code_Patient, dtype: int64

8.0     52814
2.0     16812
1.0     11395
7.0     10151
5.0      8899
4.0      6664
9.0      4897
15.0     3854
10.0     3635
6.0      2718
12.0     2477
3.0      1629
23.0     1622
14.0     1291
16.0      933
21.0      696
18.0      606
13.0      603
20.0      527
26.0      499
19.0      430
25.0      373
11.0      275
27.0      261
28.0      193
24.0      154
22.0      129
29.0       94
30.0       61
34.0       50
33.0       43
32.0       43
37.0       21
36.0       17
35.0       14
38.0       12
31.0        8
Name: City_Code_Patient, dtype: int64

In [72]:
train['City_Code_Patient'] = train['City_Code_Patient'].fillna(8.0)
test['City_Code_Patient'] = test['City_Code_Patient'].fillna(8.0)

In [73]:
display(train['City_Code_Patient'].isnull().sum())
display(test['City_Code_Patient'].isnull().sum())

0

0

## Age

In [74]:
display(train['Age'].value_counts())
display(test['Age'].value_counts())

41-50     63749
31-40     63639
51-60     48514
21-30     40843
71-80     35792
61-70     33687
11-20     16768
81-90      7890
0-10       6254
91-100     1302
Name: Age, dtype: int64

41-50     27746
31-40     26781
51-60     20992
21-30     17717
71-80     14945
61-70     14932
11-20      7103
81-90      3350
0-10       2886
91-100      605
Name: Age, dtype: int64

In [75]:
train['Age'] = le.fit_transform(train['Age'])
test['Age'] = le.fit_transform(test['Age'])

In [76]:
display(train['Age'].value_counts())
display(test['Age'].value_counts())

4    63749
3    63639
5    48514
2    40843
7    35792
6    33687
1    16768
8     7890
0     6254
9     1302
Name: Age, dtype: int64

4    27746
3    26781
5    20992
2    17717
7    14945
6    14932
1     7103
8     3350
0     2886
9      605
Name: Age, dtype: int64

## Stay

train["Stay"] = train["Stay"].astype("int64")
train.dtypes

train["Stay"] = train["Stay"].map({0-10:1, 11-20:2, 21-30:3, 31-40:4, 41-50:5, 51-60:6, 61-70:7,
                                   71-80:8, 81-90:9, 91-100:10, "More than 100 Days":11})

train["Stay"].head()

train.head()

In [78]:
display(train.head())
display(test.head())

Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,8,c,3,Z,3,2.0,7.0,0,0,2,5,4911.0
1,2,c,5,Z,2,2.0,7.0,1,0,2,5,5954.0
2,10,e,1,X,2,2.0,7.0,1,0,2,5,4745.0
3,26,b,2,Y,2,2.0,7.0,1,0,2,5,7272.0
4,26,b,2,Y,2,2.0,7.0,1,0,2,5,5558.0


Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,21,c,3,Z,3,2.0,2.0,0,2,2,7,3095.0
1,29,a,4,X,2,2.0,2.0,1,2,4,7,4018.0
2,26,b,2,Y,3,4.0,2.0,0,2,3,7,4492.0
3,6,a,6,X,3,2.0,2.0,1,2,3,7,4173.0
4,28,b,11,X,2,2.0,2.0,1,2,4,7,4161.0


In [79]:
display(train.shape)
display(test.shape)

(318438, 12)

(137057, 12)

In [80]:
train_regcode = ce.HashingEncoder(n_components=10, cols=['Hospital_region_code', 'Hospital_type_code'])
train4 = train_regcode.fit_transform(train)
train4

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,0,1,0,1,0,0,0,0,0,0,8,3,3,2.0,7.0,0,0,2,5,4911.0
1,0,1,0,1,0,0,0,0,0,0,2,5,2,2.0,7.0,1,0,2,5,5954.0
2,0,0,0,0,0,0,1,0,0,1,10,1,2,2.0,7.0,1,0,2,5,4745.0
3,0,0,1,1,0,0,0,0,0,0,26,2,2,2.0,7.0,1,0,2,5,7272.0
4,0,0,1,1,0,0,0,0,0,0,26,2,2,2.0,7.0,1,0,2,5,5558.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318433,0,0,0,0,0,0,0,1,0,1,6,6,3,4.0,23.0,0,2,3,4,4144.0
318434,0,0,0,0,0,0,0,1,0,1,24,1,2,4.0,8.0,2,2,4,8,6699.0
318435,0,0,0,0,0,0,0,1,0,1,7,4,3,4.0,10.0,0,1,3,7,4235.0
318436,0,0,1,1,0,0,0,0,0,0,11,2,3,3.0,8.0,1,1,5,1,3761.0


In [81]:
train4.shape

(318438, 20)

In [82]:
test_regcode = ce.HashingEncoder(n_components=10, cols=['Hospital_region_code', 'Hospital_type_code'])
test4 = test_regcode.fit_transform(test)
test4

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,0,1,0,1,0,0,0,0,0,0,21,3,3,2.0,2.0,0,2,2,7,3095.0
1,0,0,0,0,0,0,0,1,0,1,29,4,2,2.0,2.0,1,2,4,7,4018.0
2,0,0,1,1,0,0,0,0,0,0,26,2,3,4.0,2.0,0,2,3,7,4492.0
3,0,0,0,0,0,0,0,1,0,1,6,6,3,2.0,2.0,1,2,3,7,4173.0
4,0,0,0,1,0,0,0,0,0,1,28,11,2,2.0,2.0,1,2,4,7,4161.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137052,0,0,1,1,0,0,0,0,0,0,11,2,4,3.0,3.0,0,1,4,4,6313.0
137053,0,0,0,0,0,0,1,0,0,1,25,1,2,4.0,7.0,0,2,2,0,3510.0
137054,0,1,0,1,0,0,0,0,0,0,30,3,2,4.0,12.0,2,1,2,0,7190.0
137055,0,0,0,0,0,0,0,1,0,1,5,1,2,4.0,10.0,1,1,2,4,5435.0


In [83]:
test4.shape

(137057, 20)

In [84]:
# Independant variable
X = train4                                # All rows & columns exclude Target features

# Dependant variable
y = train_original['Stay']               # Only target feature

In [85]:
# split  data into training and testing sets of 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)

## Random Forest

In [41]:
rf = RandomForestClassifier(n_estimators=300, criterion='gini', max_features='auto', max_depth=7,
                            min_samples_split=2, min_samples_leaf=1, bootstrap=True)
rf.fit(X_train,y_train)

RandomForestClassifier(max_depth=7, n_estimators=300)

In [42]:
pred_rf = rf.predict(X_test)

In [43]:
accuracy_score(y_test, pred_rf )

0.4017554327345811

In [63]:
print("Train Score {:.2f} & Test Score {:.2f}".format(rf.score(X_train, y_train), rf.score(X_test, y_test)))

Train Score 0.40 & Test Score 0.40


## XGBOOST

In [56]:
reg_xgb = xgboost.XGBClassifier()
reg_xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [57]:
# predicting X_test
y_pred_xgb = reg_xgb.predict(X_test)

In [58]:
accuracy_score(y_test, y_pred_xgb)

0.3464373464373464

In [59]:
print("Train Score {:.2f} & Test Score {:.2f}".format(reg_xgb.score(X_train,y_train),reg_xgb.score(X_test,y_test)))

Train Score 0.69 & Test Score 0.35


------------

In [44]:
reg_xgb1 = xgboost.XGBClassifier(learning_rate=0.05, max_depth=5, n_estimators=500, objective='multi:softprob', 
                                 subsample=0.9, verbosity = 1, colsample_bytree=0.9, min_child_weight=2)
reg_xgb1.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.9, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=2, missing=nan, monotone_constraints='()',
              n_estimators=500, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=0.9,
              tree_method='exact', validate_parameters=1, verbosity=1)

In [45]:
y_pred_xgb1 = reg_xgb1.predict(X_test)

In [46]:
accuracy_score(y_test, y_pred_xgb1)

0.426422559979902

In [67]:
print("Train Score {:.2f} & Test Score {:.2f}".format(reg_xgb1.score(X_train,y_train), reg_xgb1.score(X_test,y_test)))

Train Score 0.56 & Test Score 0.35


## LGBM

In [86]:
from lightgbm import LGBMClassifier

In [87]:
lgbm_model1 = LGBMClassifier(boosting_type='gbdt', max_depth=15, learning_rate=0.15, objective='multiclass',
                           random_state=100, n_estimators=200, reg_alpha=0, reg_lambda=1, n_jobs=-1)
lgbm_model1 = lgbm_model1.fit(X_train, y_train)

In [88]:
y_pred_LGBM = lgbm_model1.predict(X_test)

In [89]:
accuracy_score(y_test, y_pred_LGBM)

0.4098103253360131

In [53]:
print("Train Score {:.2f} & Test Score {:.2f}".format(lgbm_model.score(X_train,y_train),lgbm_model.score(X_test,y_test)))

Train Score 0.45 & Test Score 0.43


## CATBOOST

In [89]:
from catboost import CatBoostClassifier

CB = CatBoostClassifier(verbose=0, n_estimators=500)
CB.fit(X_train, y_train)

CatBoostError: c:/program files (x86)/go agent/pipelines/buildmaster/catboost.git/catboost/libs/data/features_layout.cpp:94: All feature names should be different, but 'Age' used more than once.

In [52]:
y_pred_cat = CB.predict(X_test)

In [53]:
accuracy_score(y_test, y_pred_cat)

0.42125675166436377

In [83]:
print("Train Score {:.2f} & Test Score {:.2f}".format(CB.score(X_train,y_train), CB.score(X_test,y_test)))

Train Score 0.43 & Test Score 0.42


## Gradient Boosting

In [54]:
from sklearn.ensemble import GradientBoostingClassifier
     
GBC = GradientBoostingClassifier(n_estimators=300)
GBC.fit(X_train, y_train)

KeyboardInterrupt: 

In [88]:
y_pred_gbc = GBC.predict(X_test)

In [89]:
accuracy_score(y_test, y_pred_gbc)

0.4214765732948122

In [90]:
print("Train Score {:.2f} & Test Score {:.2f}".format(GBC.score(X_train,y_train), GBC.score(X_test,y_test)))

Train Score 0.43 & Test Score 0.42


### Submission

In [54]:
# predicting X_test
y_pred_test = lgbm_model.predict(test3)

In [55]:
submission = pd.DataFrame({'case_id': test_original['case_id'], 'Stay': y_pred_test})
submission.to_csv('Stay.csv', index=False)