This notebook is for testing out different ways to build a linear regression model for predicting length of stay from the NY State Hospital Discharge dataset. The steps are as following:<br />
<ol>
<b>1.)</b> Build the linear regression model with default parameters, trained on entire population, with basic label encoding for categorical variables<br />
<b>2.)</b> Build the linear regression model with default parameters, trained on subpopulation (population with heart failure), with basic label encoding for categorical variables<br />
<b>3.)</b> Build the linear regression model with default parameters, trained on subpopulation (population with heart failure), with OneHotEncoding for categorical variables<br />
<b>4.)</b> Try different models
</ol>

In [52]:
#Import statements for entire notebook
import pandas as pd
import numpy as np
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import prince
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor

In [53]:
#Run ingension and cleaning files
#! python data/data_ingestion.py

In [54]:
#Run ingension and cleaning files
#! python data/data_cleaning.py

In [55]:
#Load in the data
df = pd.read_csv('data\Hospital_Inpatient_Discharges_17_18_cleaned.csv')
df.head()

#Create dataframe to hold scores at each step
evaluation_df = pd.DataFrame(columns=['Model', 'Features', 'Population', 'Target Variable Normalized', 'Encoding Type', 'R2', 'MSE', 'MAE'])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [56]:
#Helper functions
def encode_categorical_variables(df_to_encode):
    encoded_df = df_to_encode
    categorical_columns = df_to_encode.select_dtypes(exclude='number')
    for column in categorical_columns:
        values = df_to_encode[column].unique()
        encoded_values = [i for i in range(1, len(values) + 1)]
        encoded_df[column].replace(values, encoded_values, inplace=True)
    return encoded_df

def get_scores(y_pred, y_true):
    mae = metrics.mean_absolute_error(y_true, y_pred)
    mse = metrics.mean_squared_error(y_true, y_pred)
    r2 = metrics.r2_score(y_true, y_pred)
    return [r2, mse, mae]


Step 1: Building default model with all features and label encoding for categorical variables, trained on entire population in dataset

In [57]:
#Reduce sample for training
sample_df = df.sample(2000000)

#Fill na with most common values
sample_df = sample_df.fillna(sample_df.mode().iloc[0])

X = sample_df.loc[:, ~sample_df.columns.isin(['Length of Stay', 'Total Costs', 'Total Charges'])]
y = sample_df[['Length of Stay']]

X_train_all_features, X_test_all_features, y_train_all_features, y_test_all_features = train_test_split(X, y, test_size = 0.25)

X_train_all_features = encode_categorical_variables(X_train_all_features)
X_test_all_features = encode_categorical_variables(X_test_all_features)

lr = LinearRegression()
lr.fit(X_train_all_features, y_train_all_features)
preds = lr.predict(X_test_all_features)

scores = get_scores(preds, y_test_all_features)
evaluation_df = evaluation_df.append({'Model': 'Linear Regression', 'Features': 'All Features', 'Population': 'All', 'Target Variable Normalized': 'No', 'Encoding Type': 'Label Encoding', 'R2':scores[0], 'MSE':scores[1], 'MAE': scores[2]}, ignore_index=True)
evaluation_df.head()

Unnamed: 0,Model,Features,Population,Target Variable Normalized,Encoding Type,R2,MSE,MAE
0,Linear Regression,All Features,All,No,Label Encoding,0.137862,45.465472,3.557802


Step 2: Try to improve the model by only training and predicting for a subgroup of the population. In this case, any patients who have heart failure when they get to the hospital

In [58]:
sample_df_heart_failure = sample_df[(sample_df['APR DRG Description'].str.startswith('Heart')) | (sample_df['APR DRG Code'] == 194)]

X = sample_df_heart_failure.loc[:, ~sample_df.columns.isin(['Length of Stay', 'Total Costs', 'Total Charges'])]
y = sample_df_heart_failure[['Length of Stay']]

X_train_all_features, X_test_all_features, y_train_all_features, y_test_all_features = train_test_split(X, y, test_size = 0.25)

X_train_all_features = encode_categorical_variables(X_train_all_features)
X_test_all_features = encode_categorical_variables(X_test_all_features)

lr = LinearRegression()
lr.fit(X_train_all_features, y_train_all_features)
preds = lr.predict(X_test_all_features)

scores = get_scores(preds, y_test_all_features)
evaluation_df = evaluation_df.append({'Model': 'Linear Regression', 'Features': 'All Features', 'Population': 'Subgroup - Patients w/Heart Failure', 'Target Variable Normalized': 'No', 'Encoding Type': 'Label Encoding', 'R2':scores[0], 'MSE':scores[1], 'MAE': scores[2]}, ignore_index=True)
evaluation_df.head()

Unnamed: 0,Model,Features,Population,Target Variable Normalized,Encoding Type,R2,MSE,MAE
0,Linear Regression,All Features,All,No,Label Encoding,0.137862,45.465472,3.557802
1,Linear Regression,Subgroup - Patients w/Heart Failure,All,No,Label Encoding,0.244538,28.799774,3.395329


Step 3: Building a model just for a subpopulation (patients with heart failure) improved the R2, MSE, and MAE scores. To try to improve the scores further, the next step is to try the model with a normalized target variable

In [59]:
sample_df_heart_failure = sample_df[(sample_df['APR DRG Description'].str.startswith('Heart')) | (sample_df['APR DRG Code'] == 194)]

X = sample_df_heart_failure.loc[:, ~sample_df.columns.isin(['Length of Stay', 'Total Costs', 'Total Charges'])]
y = sample_df_heart_failure[['Length of Stay']]

X_train_all_features, X_test_all_features, y_train_all_features, y_test_all_features = train_test_split(X, y, test_size = 0.25)

X_train_all_features = encode_categorical_variables(X_train_all_features)
X_test_all_features = encode_categorical_variables(X_test_all_features)

scaler = MinMaxScaler()
scaler.fit(y_train_all_features)

tt = TransformedTargetRegressor(regressor=LinearRegression(), transformer=MinMaxScaler())
tt.fit(X_train_all_features, y_train_all_features)
preds = tt.predict(X_test_all_features)
score = tt.score(X_test_all_features, y_test_all_features)

scores = get_scores(preds, scaler.transform(y_test_all_features))
evaluation_df = evaluation_df.append({'Model': 'Linear Regression', 'Features': 'All Features', 'Population': 'Subgroup - Patients w/Heart Failure', 'Target Variable Normalized': 'Yes - MinMaxScaler', 'Encoding Type': 'Label Encoding', 'R2':score, 'MSE':scores[1], 'MAE': scores[2]}, ignore_index=True)
evaluation_df.head()


tt2 = TransformedTargetRegressor(regressor=LinearRegression(), func=np.log, inverse_func=np.exp)
tt2.fit(X_train_all_features, y_train_all_features)
preds = tt2.predict(X_test_all_features)
score = tt2.score(X_test_all_features, y_test_all_features)

scores = get_scores(preds, np.log(y_test_all_features))
evaluation_df = evaluation_df.append({'Model': 'Linear Regression', 'Features': 'All Features', 'Population': 'Subgroup - Patients w/Heart Failure', 'Target Variable Normalized': 'Yes - Log Transform', 'Encoding Type': 'Label Encoding', 'R2':score, 'MSE':scores[1], 'MAE': scores[2]}, ignore_index=True)
evaluation_df.head()

Unnamed: 0,Model,Features,Population,Target Variable Normalized,Encoding Type,R2,MSE,MAE
0,Linear Regression,All Features,All,No,Label Encoding,0.137862,45.465472,3.557802
1,Linear Regression,Subgroup - Patients w/Heart Failure,All,No,Label Encoding,0.244538,28.799774,3.395329
2,Linear Regression,All Features,Subgroup - Patients w/Heart Failure,Yes - MinMaxScaler,Label Encoding,0.274588,42.647823,5.652492
3,Linear Regression,All Features,Subgroup - Patients w/Heart Failure,Yes - Log Transform,Label Encoding,-0.261394,39.072295,3.289621


Step 4: Transformation of the target variable reduced the MAE and MSE, but did not improve the R2 score. Next, try different encoding methods such as OneHotEncoder

In [64]:
X = sample_df_heart_failure.loc[:, ~sample_df.columns.isin(['Length of Stay', 'Total Costs', 'Total Charges'])]
y = sample_df_heart_failure[['Length of Stay']]

curr_columns = X.columns
print(curr_columns)

X = pd.get_dummies(X)

X = X.loc[:, ~X.columns.isin(curr_columns)]

X_train_all_features, X_test_all_features, y_train_all_features, y_test_all_features = train_test_split(X, y, test_size = 0.25)
print(np.where(np.isinf(X_test_all_features)))
print(np.where(np.isinf(y_test_all_features)))
#X_train_all_features = encode_categorical_variables(X_train_all_features)
#X_test_all_features = encode_categorical_variables(X_test_all_features)

tt2 = TransformedTargetRegressor(regressor=LinearRegression(), func=np.log, inverse_func=np.exp)
tt2.fit(X_train_all_features, y_train_all_features)
preds = tt2.predict(X_test_all_features)
score = tt2.score(X_test_all_features, y_test_all_features)

scores = get_scores(preds, np.log(y_test_all_features))

scores = get_scores(preds, y_test_all_features)
evaluation_df = evaluation_df.append({'Model': 'Linear Regression', 'Features': 'Manual Selection', 'Population': 'Subgroup - Patients w/Heart Failure', 'Target Variable Normalized': 'MinMaxScaler()', 'Encoding Type': 'OneHotEncoding', 'R2':scores[0], 'MSE':scores[1], 'MAE': scores[2]}, ignore_index=True)
evaluation_df.head(10)


Index(['Unnamed: 0', 'year', 'Hospital Service Area', 'Hospital County',
       'Operating Certificate Number', 'Permanent Facility Id',
       'Facility Name', 'Age Group', 'Zip Code - 3 digits', 'Gender', 'Race',
       'Ethnicity', 'Type of Admission', 'Patient Disposition',
       'Discharge Year', 'CCSR Diagnosis Code', 'CCSR Diagnosis Description',
       'CCSR Procedure Code', 'CCSR Procedure Description', 'APR DRG Code',
       'APR DRG Description', 'APR MDC Code', 'APR MDC Description',
       'APR Severity of Illness Code', 'APR Severity of Illness Description',
       'APR Risk of Mortality', 'APR Medical Surgical Description',
       'Payment Typology 1', 'Emergency Department Indicator'],
      dtype='object')


  return func(X, **(kw_args if kw_args else {}))
  return func(X, **(kw_args if kw_args else {}))


ValueError: Input contains infinity or a value too large for dtype('float64').