<a href="https://colab.research.google.com/github/rfaraz/shiftsc/blob/main/final/shiftsc_healthfinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏥 Health Track: Predicting Hospital Resource Utilization
You are a hospital manager trying to determine how you can allocate resources in your hospital to admitted patients. To solve this task, you decided that you will write a model that can predict the Admission Type of each patient (Urgent, Emergency, Elective, etc.) to determine how urgently the patient needs to be treated.

**But, you need to watch out!**. The data set looks at insurance to determine the final prediction, but you do not want to discriminate against patients who come from different financial backgrounds. You must also be careful that patients do not have their identity revealed. Finally, you should make sure your algorithm is safe from any cyber attacks, as that will put the lives of the patients at severe risk.

So, let's get started!

---
# 📊 Your Records

To start you off, we have provided the patient records. To learn more about the complete data set, feel free to refer to this link: [Healthcare Dataset](https://www.kaggle.com/datasets/prasad22/healthcare-dataset)

The dataset has the following variables:
* Name (Categorical)
* Age (Numerical)
* Gender (Categorical)
* Blood Type (Categorical)
* Medical Condition (Categorical)
* Date of Admission (Numerical)
* Doctor Name (Categorical)
* Hospital (Categorical)
* Insurance Provider (Categorical)
* Billing Amount (Numerical)
* Room Number (Numerical)
* Discharge Date (Categorical)
* Medication (Categorical)
* Test Results (Categorical)
* **Target Variable:** Admission Type (Categorical)

It is up to you to determine which variables will be needed to make an accurate prediction. As a hint, you can most likely remove input variables such as Room Number that do not indicate anything about the patient's situation. Remember, the goal is to make a strong model, so the way that you manipulate you decision is up to you!



In [1]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = 'https://raw.githubusercontent.com/rfaraz/shiftsc/main/healthcare_dataset.csv'
df = pd.read_csv(url)  # Reads the dataset from the defined URL
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


# 🔎 Exploring the Data
Before you begin manipulating the data, here are some commands that will help you better understand the data. Specifically, it will help to know how many different types of categories exist within the categorical variables. This will help determine how many distinct age groups exist within the data, as well as what method would be ideal to encode all of the categorical inputs.

In [None]:
# Inspect missing values and duplicates
df.info()
df.isnull().sum()

# Explore the number of categories within categorical variables (Variables that cannot be quantified by a number)

# Create some sort of visualization to tell you more about the data
# Use matplotlib to help you out

# 🧼 Cleaning the Data
Below, we have included some commands you will need to prepare the data. We have already covered the coding you would need to drop null, or missing, values in your data and standardizing all numerical values so they use the same scale. There are at least two more crucial steps you need to take to prepare your dataset:

* Encode categorical variables
* Drop unnecessary columns

**Encoding:** This step is a bit complicated, as there are some columns that have far too many categories to be effectively used as encoded numbers. There are some ways to work around this such as combining rare values into one category. This needs to be done carefully, however, as this will dirrect affect the accuracy of the final prediction. Use your exploration to guide your decisions.

As you continue working on improving the model, you may find the need to do some other data preprocessing, so feel free to add any other commands you feel are necessary.

In [2]:
# Drop or fill missing values
df = df.dropna()  # Or fill with mean/median as needed

# Normalize or scale numerical columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Encode categorical variables (Change strings to numeric data)
# Use pd.get_dummies() or LabelEncoder/OneHotEncoder


# Drop any values that are unnecessary to the final model training

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

In [3]:
# Remove direct identifiers like name (Uncomment if you haven't done this step earlier)
# df = df.drop(columns=['Name'], errors='ignore')

# Generalize features (e.g., age → age group)
# df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['0-18', '19-35', '36-60', '60+'])
# df.drop(columns='age', inplace=True)

# Add noise to sensitive columns
# def add_noise(col, epsilon=0.1):
#     return col + np.random.normal(0, epsilon, size=len(col))
# df['income'] = add_noise(df['income'])

X = df.drop(columns=['Admission Type'])
y = df['Admission Type']

In [4]:
df.head()

Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,-1.098824,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,-0.470261,0.23312,Urgent,2024-02-02,Paracetamol,Normal
1,0.533639,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,0.57025,-0.313556,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,1.247842,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,0.16999,-0.834199,Emergency,2022-10-07,Aspirin,Normal
3,-1.200853,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,0.870465,1.291761,Elective,2020-12-18,Ibuprofen,Abnormal
4,-0.435636,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,-0.795211,1.36118,Urgent,2022-10-09,Penicillin,Abnormal


In [1]:
# Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Train model (Random Forest)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# # Write a neural network
from sklearn.neural_network import MLPClassifier # Here is one example

# ⚖️ Fairness Metrics
Now, run some formulas to evaluate the fairness of your algorithm. One common approach is to simply go in and measure the accuracy based on different sensitive variables. For example, to evaluate gender bias, you would go in and compare accuracy for Male patients with Female patients. In our case, the sensitive variables are those associated with financial situations.

We have already provided one approach. We also recommend going in and running your own fairness evaluations, such as the ones we went over in previous modules. If you find that there is some bias present in the system, go back to the model training step and implement the approaches you learned in Module 2: Bias and Fairness.

In [None]:
# Split predictions by group and compare accuracy or other metrics based on billing amount/insurance

# Example pseudocode:
# for group in df['gender'].unique():
#     idx = df['gender'] == group
#     print(group)
#     print(classification_report(y_test[idx], y_pred[idx]))

# You can also compute fairness metrics manually

# 🛟 Safety Measurements
Similar to the fairness metrics, go through and test out the safety of your model. Is it prone to adversarial attacks?

To test the robustness of your system, one common approach is adding noise to the test data. If the model performs well on the noisier data, it is a robust system.

We have already provided the first approach. Now, go ahead and add one more testing method. You can refer back to Module 3: Safety and Robustness for some more formal mathematical definitions. If you find that the system is failing these robustness tests, go back to your model training step and add measures such as random smoothing or noisier training data to improve the robustness of the model.

In [None]:
# Add noise to test data and compare performance
# def add_random_noise(X, epsilon=0.05):
#     return X + np.random.normal(0, epsilon, size=X.shape)

# X_test_noisy = add_random_noise(X_test)
# y_pred_noisy = model.predict(X_test_noisy)
# print(classification_report(y_test, y_pred_noisy))