# Hospital Length of Stay Dataset
## Linear Regression Modeling (Prediction)


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split   
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score



In [3]:
df = pd.read_csv("../data/raw/LengthOfStay.csv")
df.head()


Unnamed: 0,eid,vdate,rcount,gender,dialysisrenalendstage,asthma,irondef,pneum,substancedependence,psychologicaldisordermajor,...,glucose,bloodureanitro,creatinine,bmi,pulse,respiration,secondarydiagnosisnonicd9,discharged,facid,lengthofstay
0,1,8/29/2012,0,F,0,0,0,0,0,0,...,192.476918,12.0,1.390722,30.432418,96,6.5,4,9/1/2012,B,3
1,2,5/26/2012,5+,F,0,0,0,0,0,0,...,94.078507,8.0,0.943164,28.460516,61,6.5,1,6/2/2012,A,7
2,3,9/22/2012,1,F,0,0,0,0,0,0,...,130.530524,12.0,1.06575,28.843812,64,6.5,2,9/25/2012,B,3
3,4,8/9/2012,0,F,0,0,0,0,0,0,...,163.377028,12.0,0.906862,27.959007,76,6.5,1,8/10/2012,A,1
4,5,12/20/2012,0,F,0,0,0,1,0,1,...,94.886654,11.5,1.242854,30.258927,67,5.6,2,12/24/2012,E,4


In [4]:
target = "lengthofstay"

X = df.drop(columns=[target])
y = df[target]


In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)




In [6]:
numeric_features = X.select_dtypes(include=[np.number]).columns
categorical_features = X.select_dtypes(exclude=[np.number]).columns

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("regressor", LinearRegression())
])

model.fit(X_train, y_train)


In [8]:
pred = model.predict(X_test)

mae = mean_absolute_error(y_test, pred)

mse = mean_squared_error(y_test, pred)
rmse = mse ** 0.5   # RMSE = sqrt(MSE)

r2 = r2_score(y_test, pred)

print("MAE :", mae)
print("RMSE:", rmse)
print("R^2 :", r2)



MAE : 0.8898838789160418
RMSE: 1.1573679842371645
R^2 : 0.7558861473632807


The model performance was evaluated using MAE, RMSE, and R².
RMSE was computed as the square root of Mean Squared Error (MSE).


## Model Interpretation

- **MAE** shows the average error in predicted length of stay (in days).
- **RMSE** penalizes larger errors more than MAE.
- **R²** shows how much of the variation in length of stay is explained by the model.

This Linear Regression model is used as a baseline prediction model for hospital length of stay.
