**Project Overview: Optimizing Patient Care with Advanced Predictive Modeling**

**Objective:**
In this project, the main goal was to leverage machine learning techniques to enhance the efficiency of hospital operations by predicting the length of stay for patients. The accurate prediction of patient stays is crucial for optimizing resource allocation and streamlining healthcare processes.

**Project Highlights:**

1. **Data Exploration and Preprocessing:**
   - Loaded and explored the training data from a CSV file.
   - Addressed missing values in 'Bed Grade' and 'City_Code_Patient' through mode imputation.
   - Applied Label Encoding to the categorical 'Stay' column.

2. **Model Exploration and Selection:**
   - Investigated various machine learning models, including XGBoost, LightGBM, Logistic Regression, K-Nearest Neighbors, Naive Bayes, and a Neural Network.
   - Evaluated each model's performance and accuracy.
   - Chose LightGBM as the preferred model based on superior predictive capabilities.

3. **Model Implementation and Predictions:**
   - Trained the selected LightGBM model on the training data.
   - Utilized the trained model to predict patient stays in the test data.
   - Organized and presented the prediction results for further analysis.

**Conclusion:**
This project provides a comprehensive overview of employing advanced predictive modeling to optimize patient care and hospital efficiency. By selecting and implementing the LightGBM model, I aim to make accurate predictions of patient stays, leading to improved resource management and streamlined healthcare operations. Future endeavors include refining the model and potential deployment for real-time predictions in healthcare management.

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk(''):
    for filename in filenames:
        print(os.path.join(dirname, filename))

print("ok")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

ok


In [4]:
train = pd.read_csv('train.csv')

In [5]:
train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


In [6]:
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0


In [7]:
train.isnull().sum()

case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                             113
patientid                               0
City_Code_Patient                    4532
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
Stay                                    0
dtype: int64

In [8]:
test.isnull().sum()

case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                              35
patientid                               0
City_Code_Patient                    2157
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
dtype: int64

In [11]:
train['Bed Grade'] = train['Bed Grade'].fillna(train['Bed Grade'].mode()[0])
train['City_Code_Patient'] = train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0])


In [13]:
test['Bed Grade'] = test['Bed Grade'].fillna(test['Bed Grade'].mode()[0])
test['City_Code_Patient'] = test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0])


In [14]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

train['Stay'] = le.fit_transform(train['Stay'].astype('str'))

In [15]:
test['Stay'] = -1

In [16]:
df = pd.concat([train, test], ignore_index=True)

In [17]:
# List of categorical column names
categorical_columns = ['Hospital_type_code', 'Hospital_region_code', 'Department', 
                       'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 
                       'Severity of Illness', 'Age']

# Iterating through each categorical column and apply label encoding
for col in categorical_columns:
    df[col] = le.fit_transform(df[col].astype(str))

In [18]:
# Creating a new DataFrame 'train' by filtering rows where 'Stay' is not equal to -1
train = df[df['Stay'] != -1]

In [19]:
# Creating a new DataFrame 'test' by filtering rows where 'Stay' is equal to -1
test = df[df['Stay'] == -1]

In [20]:
columns_to_remove = ['Stay', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code']

# Creating a new DataFrame 'test1' by removing specified columns from 'test'
test1 = test.drop(columns=columns_to_remove, axis=1)

In [21]:
columns_to_remove_train = ['case_id', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code']

# Creating a new DataFrame 'train1' by removing specified columns from 'train'
train1 = train.drop(columns=columns_to_remove_train, axis=1)

In [22]:
from sklearn.model_selection import train_test_split

X1 = train1.drop(columns=['Stay'], axis=1)

y1 = train1['Stay']

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.2, random_state=100)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (254750, 13)
Shape of X_test: (63688, 13)
Shape of y_train: (254750,)
Shape of y_test: (63688,)



## XGBoost

In [24]:
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Creating an XGBoost classifier
classifier_xgb = xgb.XGBClassifier(
    max_depth=4,
    learning_rate=0.1,
    n_estimators=1000,
    objective='multi:softmax',
    reg_alpha=0.5,
    reg_lambda=1.5,
    booster='gbtree',
    n_jobs=4,
    min_child_weight=2,
    base_score=0.75
)


model_xgb = classifier_xgb.fit(X_train, y_train)


prediction_xgb = model_xgb.predict(X_test)

# Calculate\ing accuracy of the XGBoost model's predictions
acc_score_xgb = accuracy_score(y_test, prediction_xgb)

# Rounding the accuracy score to two decimal places
acc_score_xgb = round(acc_score_xgb, 2)

print("Accuracy Score of XGBoost Model:", acc_score_xgb)

Accuracy Score of XGBoost Model: 0.42


## Random Forest

In [25]:
from sklearn.ensemble import RandomForestClassifier

classifier_rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

model_rf = classifier_rf.fit(X_train, y_train)

prediction_rf = model_rf.predict(X_test)

acc_score_rf = accuracy_score(y_test, prediction_rf)

acc_score_rf = round(acc_score_rf, 2)

print("Accuracy Score of Random Forest Model:", acc_score_rf)

Accuracy Score of Random Forest Model: 0.39


## lightgbm

In [27]:
import lightgbm as lgb

classifier_lgb = lgb.LGBMClassifier(max_depth=5, learning_rate=0.1, n_estimators=100, random_state=42)

model_lgb = classifier_lgb.fit(X_train, y_train)

prediction_lgb = model_lgb.predict(X_test)

acc_score_lgb = accuracy_score(y_test, prediction_lgb)

acc_score_lgb = round(acc_score_lgb, 2)

print("Accuracy Score of LightGBM Model:", acc_score_lgb)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002306 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 420
[LightGBM] [Info] Number of data points in the train set: 254750, number of used features: 13
[LightGBM] [Info] Start training from score -2.598954
[LightGBM] [Info] Start training from score -1.404748
[LightGBM] [Info] Start training from score -1.291316
[LightGBM] [Info] Start training from score -1.755525
[LightGBM] [Info] Start training from score -3.295221
[LightGBM] [Info] Start training from score -2.206793
[LightGBM] [Info] Start training from score -4.771564
[LightGBM] [Info] Start training from score -3.437125
[LightGBM] [Info] Start training from score -4.193769
[LightGBM] [Info] Start training from score -4.751371
[LightGBM] [Info] Start training from score -3.861879
Accuracy Score of LightGBM Model: 0.42


## KNN

In [28]:
from sklearn.neighbors import KNeighborsClassifier


classifier_knn = KNeighborsClassifier(n_neighbors=5)

model_knn = classifier_knn.fit(X_train, y_train)

prediction_knn = model_knn.predict(X_test)

acc_score_knn = accuracy_score(y_test, prediction_knn)

acc_score_knn = round(acc_score_knn, 2)

print("Accuracy Score of K-Nearest Neighbors (KNN) Model:", acc_score_knn)

Accuracy Score of K-Nearest Neighbors (KNN) Model: 0.27


## LogisticRegression

In [29]:
from sklearn.linear_model import LogisticRegression

classifier_lr = LogisticRegression(max_iter=100, random_state=42)

model_lr = classifier_lr.fit(X_train, y_train)

prediction_lr = model_lr.predict(X_test)

acc_score_lr = accuracy_score(y_test, prediction_lr)

acc_score_lr = round(acc_score_lr, 2)

print("Accuracy Score of Logistic Regression Model:", acc_score_lr)

Accuracy Score of Logistic Regression Model: 0.31


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##  GaussianNB

In [30]:
from sklearn.naive_bayes import GaussianNB

classifier_nb = GaussianNB()

model_nb = classifier_nb.fit(X_train, y_train)

prediction_nb = model_nb.predict(X_test)

acc_score_nb = accuracy_score(y_test, prediction_nb)

acc_score_nb = round(acc_score_nb, 2)

print("Accuracy Score of Naive Bayes Model:", acc_score_nb)

Accuracy Score of Naive Bayes Model: 0.36


## Neural network

In [31]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_nn = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

model_nn.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model_nn.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=2)

probabilities = model_nn.predict(X_test_scaled)

prediction_nn = np.argmax(probabilities, axis=1)

acc_score_nn = accuracy_score(y_test, prediction_nn)

acc_score_nn = round(acc_score_nn, 2)

print("Accuracy Score of Neural Network Model:", acc_score_nn)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
6369/6369 - 9s - 1ms/step - accuracy: 0.3841 - loss: 1.6209 - val_accuracy: 0.3969 - val_loss: 1.5823
Epoch 2/10
6369/6369 - 9s - 1ms/step - accuracy: 0.3976 - loss: 1.5653 - val_accuracy: 0.4034 - val_loss: 1.5611
Epoch 3/10
6369/6369 - 8s - 1ms/step - accuracy: 0.4024 - loss: 1.5517 - val_accuracy: 0.4037 - val_loss: 1.5539
Epoch 4/10
6369/6369 - 8s - 1ms/step - accuracy: 0.4041 - loss: 1.5448 - val_accuracy: 0.4021 - val_loss: 1.5500
Epoch 5/10
6369/6369 - 9s - 1ms/step - accuracy: 0.4068 - loss: 1.5388 - val_accuracy: 0.4026 - val_loss: 1.5481
Epoch 6/10
6369/6369 - 9s - 1ms/step - accuracy: 0.4108 - loss: 1.5333 - val_accuracy: 0.4087 - val_loss: 1.5430
Epoch 7/10
6369/6369 - 9s - 1ms/step - accuracy: 0.4125 - loss: 1.5292 - val_accuracy: 0.4118 - val_loss: 1.5396
Epoch 8/10
6369/6369 - 9s - 1ms/step - accuracy: 0.4136 - loss: 1.5258 - val_accuracy: 0.4104 - val_loss: 1.5371
Epoch 9/10
6369/6369 - 8s - 1ms/step - accuracy: 0.4141 - loss: 1.5236 - val_accuracy: 0.4120 - 

## xgboost model is giving more accuracy than other models

In [32]:
pred_xgb = classifier_xgb.predict(test1.drop(columns=['case_id']))

# Creating a new DataFrame 'result_xgb' to organize prediction results
result_xgb = pd.DataFrame()

# Adding the 'case_id' column to 'result_xgb'
result_xgb['case_id'] = test1['case_id']

result_xgb['Stay'] = pred_xgb

# Reordering columns in 'result_xgb'
result_xgb = result_xgb[['case_id', 'Stay']]

# # Replacing numeric labels in 'Stay' column with meaningful categories
label_mapping = {0: 'CategoryA', 1: 'CategoryB', 2: 'CategoryC', 3: 'CategoryD', 4: 'CategoryE', 5: 'CategoryF', 6: 'CategoryG', 7: 'CategoryH', 8: 'CategoryI', 9: 'CategoryJ', 10: 'CategoryK', 11: 'CategoryL', 12: 'CategoryM', 13: 'CategoryN', 14: 'CategoryO'}
result_xgb['Stay'] = result_xgb['Stay'].replace(label_mapping)

print(result_xgb)

        case_id       Stay
318438   318439  CategoryA
318439   318440  CategoryF
318440   318441  CategoryC
318441   318442  CategoryC
318442   318443  CategoryF
...         ...        ...
455490   455491  CategoryC
455491   455492  CategoryA
455492   455493  CategoryB
455493   455494  CategoryB
455494   455495  CategoryF

[137057 rows x 2 columns]


In [33]:
# Grouping the data by unique 'Stay' values and count the unique 'case_id' values
result_xgb.groupby('Stay')['case_id'].nunique()

Stay
CategoryA     4256
CategoryB    39321
CategoryC    58357
CategoryD    12779
CategoryE       68
CategoryF    18473
CategoryG       17
CategoryH      361
CategoryI     1177
CategoryJ       81
CategoryK     2167
Name: case_id, dtype: int64