Hospital readmissions within 30 days pose substantial challenges for healthcare systems—both in terms of clinical quality and financial cost. This project develops a machine-learning–based framework to predict early readmissions using the Diabetic Readmission Dataset from the UCI Machine Learning Repository.
The workflow includes data cleaning, feature engineering, ordinal and one-hot encoding, SMOTE oversampling to address extreme class imbalance, and training/evaluating five models: Logistic Regression, Random Forest, CatBoost, Artificial Neural Network (ANN), and Naive Bayes.
Results show that while accuracy remains high for most models, recall—which is crucial for clinical risk prediction—varies widely. CatBoost consistently achieves the strongest recall and AUC, making it the most effective model for detecting true readmissions. The project emphasizes the importance of evaluating models beyond accuracy when working with highly imbalanced healthcare datasets.
This project uses the Diabetes 130-US Hospitals dataset from UCI: https://archive.ics.uci.edu/dataset/296/diabetes+130+us+hospitals+for+years+1999+2008
You must download the dataset (diabetic_data.csv) and upload it into your Google Drive before running the Colab notebook.
├── Predicting_30_Days_Hospital_Readmission_ds633_projectcode.ipynb # Main Colab notebook for execution
├── README.md # Project documentation
└── diabetic_data.csv # Stored in your Google Drive, not in this repo
Upload and open: Predicting_30_Days_Hospital_Readmission_ds633_projectcode.ipynb in Google Colab (or use the link provided in the report)
Place the file: diabetic_data.csv
in any folder in your Drive (recommended: MyDrive/).
Inside the notebook, you will find:
file_path = "/content/drive/MyDrive/diabetic_data.csv"Change this path if your dataset is stored elsewhere in Drive.
Start from the top of the notebook and execute each cell sequentially:
from google.colab import drive
drive.mount('/content/drive')Load the dataset
If the df.head() output appears without errors, the dataset loaded correctly.
Perform data cleaning and feature engineering
Split, encode, and apply SMOTE
Train and evaluate all models
Review performance tables and plots
All results will appear directly in the Colab notebook.
Ensure your dataset remains accessible in Google Drive while running the notebook.
The notebook will not run unless the file path is correct.
CatBoost uses raw (non-SMOTE) data; all other models use the SMOTE-balanced data.
Soumya Dayal — DS 633: Foundations of Data Science and Analytics
Rochester Institute of Technology