## Part 1: Get the Data
**The data used for this final project comes from:** https://data.cms.gov/provider-data/dataset/9n3s-kdb3

**The data description, provided by Data.cms.gov, is as follows:**<br/>
In October 2012, CMS began reducing Medicare payments for subsection(d) hospitals with excess readmissions under the Hospital Readmissions Reduction Program (HRRP). Excess readmissions are measured by a ratio, calculated by dividing a hospital's predicted rate of readmissions for heart attack (AMI), heart failure (HF), pneumonia, chronic obstructive pulmonary disease (COPD), hip/knee replacement (THA/TKA), and coronary artery bypass graft surgery (CABG) by the expected rate of readmissions, based on an average hospital with similar patients.


In [18]:
import pandas as pd

In [2]:
data = pd.read_csv('raw-data/FY_2024_Hospital_Readmissions_Reduction_Program_Hospital.csv')

## Part 2: Preprocessing the Data
* Handling missing/unhelpful values
    * 
    * Date for `Start Date` and `End Date` are all the same, just drop
* Encode categorical variables
    * `Facility Name`, `State`, `Measure Name`
* Standardize numerical variables (0-1)
    * `Number of Discharges`, `Excess Readmission Ratio`, `Predicted Readmission Rate`, `Expected Readmission Rate`, `Number of Readmissions`, 
* Define the target variable
* General data cleaning
    * Remove leading "READM-30-" and trailing "-HRRP" from `Measure Name`
    * Make a new column `Length of Stay` from `Start Date` and `End Date`

#### Part 2.1: General Data Cleaning

In [3]:
# Cleaning the Measure Name column
data['Measure Name'] = data['Measure Name'].str.replace('READM-30-', '')
data['Measure Name'] = data['Measure Name'].str.replace('-HRRP', '')

# Determining the length of stay
data['Length of Stay'] = pd.to_datetime(data['End Date']) - pd.to_datetime(data['Start Date'])
data['Length of Stay'] = data['Length of Stay'].dt.days

#### Part 2.2: Remove Unhelpful Data
* `StartDate` and `EndDate` are the same, so we can drop one of them
* `Facility Name` is captured numerically by `Facility ID`

In [4]:
# Drop the `Start Date` and `End Date` and 'Facility Name' columns
data = data.drop(columns=['Start Date', 'End Date', 'Facility Name'])

#### Part 2.3: Encode Categorical Variables

In [5]:
# Encode categorical columns (Facility Name, State, Measure Name)
data = pd.get_dummies(data, columns=['State', 'Measure Name'])

#### Part 2.4: Standardize Numerical Variables

In [6]:
# Standardize numerical values; 0-1 (Number of Discharges, Excess Readmission Ratio, Predicted Readmission Rate, Expected Readmission Rate, Number of Readmissions)

# Convert any non-numeric values in above columns to NaN
data['Number of Discharges'] = pd.to_numeric(data['Number of Discharges'], errors='coerce')
data['Excess Readmission Ratio'] = pd.to_numeric(data['Excess Readmission Ratio'], errors='coerce')
data['Predicted Readmission Rate'] = pd.to_numeric(data['Predicted Readmission Rate'], errors='coerce')
data['Expected Readmission Rate'] = pd.to_numeric(data['Expected Readmission Rate'], errors='coerce')
data['Number of Readmissions'] = pd.to_numeric(data['Number of Readmissions'], errors='coerce')

# Standardize
data['Number of Discharges'] = (data['Number of Discharges'] - data['Number of Discharges'].min()) / (data['Number of Discharges'].max() - data['Number of Discharges'].min())
data['Excess Readmission Ratio'] = (data['Excess Readmission Ratio'] - data['Excess Readmission Ratio'].min()) / (data['Excess Readmission Ratio'].max() - data['Excess Readmission Ratio'].min())
data['Predicted Readmission Rate'] = (data['Predicted Readmission Rate'] - data['Predicted Readmission Rate'].min()) / (data['Predicted Readmission Rate'].max() - data['Predicted Readmission Rate'].min())
data['Expected Readmission Rate'] = (data['Expected Readmission Rate'] - data['Expected Readmission Rate'].min()) / (data['Expected Readmission Rate'].max() - data['Expected Readmission Rate'].min())
data['Number of Readmissions'] = (data['Number of Readmissions'] - data['Number of Readmissions'].min()) / (data['Number of Readmissions'].max() - data['Number of Readmissions'].min())

#### Part 2.5: Define the Target Variable

In [7]:
# Identify the target variable
data.rename(columns={'Predicted Readmission Rate': 'output'}, inplace=True)

#### Part 2.6: Determine What to do with NaN Values

In [21]:
# Identify the NaN values in the dataset
hit_list = []
for column_name in list(data.columns):
    data_without_na = data.dropna(subset=[column_name])
    percentage = len(data_without_na) / len(data) * 100
    if percentage != 100:
        hit_list.append((column_name, percentage))

hit_list

[('Number of Discharges', 43.11281559603707),
 ('Footnote', 35.671673591136674),
 ('Excess Readmission Ratio', 64.32832640886332),
 ('output', 64.32832640886332),
 ('Expected Readmission Rate', 64.32832640886332),
 ('Number of Readmissions', 42.026206455736656)]

#### Part 2.7: Get Data in State for SKLearn Models

In [23]:
# Format data for sklearn model training
data.to_csv('cleaned_data.csv', index=False)

## Part 2: Random Forest Predictions

In [24]:
# Import random forest from sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error


In [33]:
data = pd.read_csv('cleaned_data.csv')

# For `Number of Discharges', `Footnote`, `Excess Radmission Ratio`, `Expected Readmission Rate`, `Number of Readmissions` columns, we will replace with 0
data['Number of Discharges'] = data['Number of Discharges'].fillna(0)
data['Footnote'] = data['Footnote'].fillna(0)
data['Excess Readmission Ratio'] = data['Excess Readmission Ratio'].fillna(0)
data['Expected Readmission Rate'] = data['Expected Readmission Rate'].fillna(0)
data['Number of Readmissions'] = data['Number of Readmissions'].fillna(0)

# Drop rows in `output` which contain nan
data = data.dropna(subset=['output'])

In [34]:
# Instantiate a Random Forest Model
model = RandomForestRegressor(n_estimators=100, max_depth=10)

# Split the data into training and testing sets
X = data.drop(columns=['output'])
y = data['output']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train the model
model.fit(X_train, y_train)

In [35]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(mse)

2.58725979705879e-05
