Predicting 30 Days Hospital Readmission

Project Overview

Hospital readmissions within 30 days pose substantial challenges for healthcare systems—both in terms of clinical quality and financial cost. This project develops a machine-learning–based framework to predict early readmissions using the Diabetic Readmission Dataset from the UCI Machine Learning Repository.

The workflow includes data cleaning, feature engineering, ordinal and one-hot encoding, SMOTE oversampling to address extreme class imbalance, and training/evaluating five models: Logistic Regression, Random Forest, CatBoost, Artificial Neural Network (ANN), and Naive Bayes.

Results show that while accuracy remains high for most models, recall—which is crucial for clinical risk prediction—varies widely. CatBoost consistently achieves the strongest recall and AUC, making it the most effective model for detecting true readmissions. The project emphasizes the importance of evaluating models beyond accuracy when working with highly imbalanced healthcare datasets.

Dataset Source

This project uses the Diabetes 130-US Hospitals dataset from UCI: https://archive.ics.uci.edu/dataset/296/diabetes+130+us+hospitals+for+years+1999+2008

You must download the dataset (diabetic_data.csv) and upload it into your Google Drive before running the Colab notebook.

Repository Contents

├── Predicting_30_Days_Hospital_Readmission_ds633_projectcode.ipynb # Main Colab notebook for execution
├── README.md # Project documentation
└── diabetic_data.csv # Stored in your Google Drive, not in this repo

How to Run This Project in Google Colab

1. Open the Colab Notebook

Upload and open: Predicting_30_Days_Hospital_Readmission_ds633_projectcode.ipynb in Google Colab (or use the link provided in the report)

2. Upload the Dataset to Your Google Drive

Place the file: diabetic_data.csv

in any folder in your Drive (recommended: MyDrive/).

3. Update the File Path in the Notebook (If Needed)

Inside the notebook, you will find:

file_path = "/content/drive/MyDrive/diabetic_data.csv"

Change this path if your dataset is stored elsewhere in Drive.

4. Run with these steps

Start from the top of the notebook and execute each cell sequentially:

from google.colab import drive
drive.mount('/content/drive')

Load the dataset

If the df.head() output appears without errors, the dataset loaded correctly.

Perform data cleaning and feature engineering
Split, encode, and apply SMOTE
Train and evaluate all models
Review performance tables and plots

All results will appear directly in the Colab notebook.

Notes

Ensure your dataset remains accessible in Google Drive while running the notebook.

The notebook will not run unless the file path is correct.

CatBoost uses raw (non-SMOTE) data; all other models use the SMOTE-balanced data.

Author

Soumya Dayal — DS 633: Foundations of Data Science and Analytics
Rochester Institute of Technology

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting 30 Days Hospital Readmission

Project Overview

Dataset Source

Repository Contents

How to Run This Project in Google Colab

1. Open the Colab Notebook

2. Upload the Dataset to Your Google Drive

3. Update the File Path in the Notebook (If Needed)

4. Run with these steps

Notes

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Predicting 30 Days Hospital Readmission

Project Overview

Dataset Source

Repository Contents

How to Run This Project in Google Colab

1. Open the Colab Notebook

2. Upload the Dataset to Your Google Drive

3. Update the File Path in the Notebook (If Needed)

4. Run with these steps

Notes

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages