🩺 Predicting Hospital Readmission for Diabetic Patients
📘 Project Description
This project uses a dataset of over 100,000 records from diabetic patients in 130 U.S. hospitals (1999–2008) to predict whether a patient will be readmitted within 30 days of discharge. Hospital readmissions are costly and may indicate poor quality of care. By identifying key risk factors and predicting readmissions, we can help healthcare providers improve outcomes and reduce costs.

Steps:

Data Cleaning
Exploratory Data Analysis (EDA)
Model Building (Logistic Regression, Random Forest, XGBoost)
Evaluation & Interpretability
Recommendations

## 🔄 Data Preprocessing

🩺 Project Title: Predicting Hospital Readmission for Diabetic Patients 📘 Project Description This project leverages the Diabetes 130-US hospitals for years 1999–2008 dataset to explore the patterns and factors associated with hospital readmission in diabetic patients. The primary objective is to build a predictive model to determine whether a patient is likely to be readmitted to the hospital within 30 days of discharge—a critical issue in healthcare management due to its implications on patient outcomes and hospital costs.

The dataset includes over 100,000 medical records from diabetic patients across 130 hospitals in the United States, spanning a period of 10 years. It contains demographic information, diagnostic details, treatment regimens, lab results, and hospitalization histories.

🎯 Project Objectives Understand and preprocess the dataset to handle missing values, anomalies, and categorical variables.

Perform exploratory data analysis (EDA) to uncover significant patterns and correlations.

Build and evaluate classification models (e.g., logistic regression, random forest, XGBoost) to predict 30-day readmission.

Identify key factors contributing to high readmission risk.

Provide actionable insights to healthcare providers for improving discharge planning and patient monitoring.

📦 Deliverables Cleaned and well-documented dataset.

EDA visualizations and statistical summaries.

Machine learning models with evaluation metrics.

A final report or dashboard summarizing findings and recommendations.

In [3]:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, roc_curve
from sklearn.metrics import precision_recall_curve
from imblearn.over_sampling import SMOTE

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Set pandas display options for better viewing
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.max_rows', 100)     # Show more rows if needed

In [4]:
# File paths
data_path = r'C:\Users\Sam_Ke\Downloads\diabetic_data.csv'       # Main data
ids_mapping_path = r'C:\Users\Sam_Ke\Downloads\IDs_mapping.csv'  # If you have a separate ID mapping file

# --- Step 1: Load the data ---
print(f"Loading main data from: {data_path}")
try:
    df = pd.read_csv(data_path)
    print("Main data loaded successfully!")

    # Try to load ID mappings (optional)
    try:
        id_map = pd.read_csv(ids_mapping_path)
        print("ID mapping file loaded successfully!")
    except FileNotFoundError:
        print("ID mapping file not found. Continuing without it.")
        id_map = None

except FileNotFoundError:
    print(f"Error: File not found at {data_path}")
    print("Please ensure the file path is correct.")

Loading main data from: C:\Users\Sam_Ke\Downloads\diabetic_data.csv
Main data loaded successfully!
ID mapping file loaded successfully!


In [5]:
#Explore Data
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,?,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,?,?,1,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,?,?,59,0,18,0,0,0,276.0,250.01,255,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,?,?,11,5,13,2,0,1,648.0,250,V27,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,?,?,44,1,16,0,0,0,8.0,250.43,403,7,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,?,?,51,0,8,0,0,0,197.0,157,250,5,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [7]:
# Replace all '?' values with NaN
df.replace('?', np.nan, inplace=True)

print("Replaced '?' with NaN in the DataFrame.")

# Now, drop columns with high missingness as decided
columns_to_drop = ['weight', 'max_glu_serum', 'A1Cresult']
df.drop(columns=columns_to_drop, inplace=True)

print(f"Dropped columns: {columns_to_drop}")
print("DataFrame shape after dropping columns:", df.shape)

# For the remaining columns that had '?', replace NaN with specific categories
# Based on our earlier analysis, these were 'race', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'
df['race'].fillna('Unknown', inplace=True)
df['payer_code'].fillna('Unknown', inplace=True)
df['medical_specialty'].fillna('Unknown', inplace=True)
df['diag_1'].fillna('Missing Diagnosis', inplace=True)
df['diag_2'].fillna('Missing Diagnosis', inplace=True)
df['diag_3'].fillna('Missing Diagnosis', inplace=True)


print("Handled missing values in remaining columns.")

# Display info to check non-null counts after handling missing values
print("\nDataFrame Info after handling initial missing values:")
df.info()

Replaced '?' with NaN in the DataFrame.
Dropped columns: ['weight', 'max_glu_serum', 'A1Cresult']
DataFrame shape after dropping columns: (101766, 47)
Handled missing values in remaining columns.

DataFrame Info after handling initial missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 47 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   admission_type_id         101766 non-null  int64 
 6   discharge_disposition_id  101766 non-null  int64 
 7   admission_source_id       101766 non-null  int64 
 8   time_in_hospital          101766 non-null  int64 
 9   payer_code                10

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['race'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['payer_code'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values a

In [8]:
# Drop identifier columns
df= df.drop(['encounter_id', 'patient_nbr'], axis=1)

print("Dropped identifier columns.")
print("DataFrame shape after dropping identifiers:", df.shape)

Dropped identifier columns.
DataFrame shape after dropping identifiers: (101766, 45)


In [9]:
df.shape

(101766, 45)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 45 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   race                      101766 non-null  object
 1   gender                    101766 non-null  object
 2   age                       101766 non-null  object
 3   admission_type_id         101766 non-null  int64 
 4   discharge_disposition_id  101766 non-null  int64 
 5   admission_source_id       101766 non-null  int64 
 6   time_in_hospital          101766 non-null  int64 
 7   payer_code                101766 non-null  object
 8   medical_specialty         101766 non-null  object
 9   num_lab_procedures        101766 non-null  int64 
 10  num_procedures            101766 non-null  int64 
 11  num_medications           101766 non-null  int64 
 12  number_outpatient         101766 non-null  int64 
 13  number_emergency          101766 non-null  int64 
 14  numb

In [11]:
df['glimepiride'].nunique()

4

In [12]:
df['glimepiride'].value_counts()

glimepiride
No        96575
Steady     4670
Up          327
Down        194
Name: count, dtype: int64

In [13]:
df['glimepiride-pioglitazone'].value_counts()

glimepiride-pioglitazone
No        101765
Steady         1
Name: count, dtype: int64

In [14]:
df['metformin-pioglitazone'].value_counts()

metformin-pioglitazone
No        101765
Steady         1
Name: count, dtype: int64

In [15]:
# Define a threshold for rare categories (e.g., less than 1% of the total number of records)
threshold = len(df) * 0.01 # 1%

# Columns to check for rare categories
cols_to_group_rare = ['medical_specialty', 'payer_code', 'discharge_disposition_id']

for col in cols_to_group_rare:
    # Get value counts for the column
    value_counts = df[col].value_counts()
    # Identify rare categories
    rare_categories = value_counts[value_counts < threshold].index
    # Replace rare categories with 'Other'
    df[col] = df[col].replace(rare_categories, 'Other')

    print(f"\nValue counts for '{col}' after grouping rare categories (showing top 10):")
    print(df[col].value_counts().head(10).to_markdown()) # Displaying top 10
    print(f"Number of unique values in '{col}' after grouping:", df[col].nunique())


Value counts for 'medical_specialty' after grouping rare categories (showing top 10):
| medical_specialty          |   count |
|:---------------------------|--------:|
| Unknown                    |   49949 |
| InternalMedicine           |   14635 |
| Other                      |    8340 |
| Emergency/Trauma           |    7565 |
| Family/GeneralPractice     |    7440 |
| Cardiology                 |    5352 |
| Surgery-General            |    3099 |
| Nephrology                 |    1613 |
| Orthopedics                |    1400 |
| Orthopedics-Reconstructive |    1233 |
Number of unique values in 'medical_specialty' after grouping: 11

Value counts for 'payer_code' after grouping rare categories (showing top 10):
| payer_code   |   count |
|:-------------|--------:|
| Unknown      |   40256 |
| MC           |   32439 |
| HM           |    6274 |
| SP           |    5007 |
| BC           |    4655 |
| MD           |    3532 |
| CP           |    2533 |
| UN           |    2448 |
| CM 

In [16]:
df['diag_1'].value_counts()

diag_1
428    6862
414    6581
786    4016
410    3614
486    3508
       ... 
373       1
314       1
684       1
217       1
V51       1
Name: count, Length: 717, dtype: int64

In [17]:
df['diag_2'].value_counts()

diag_2
276     6752
428     6662
250     6071
427     5036
401     3736
        ... 
E918       1
46         1
V13        1
E850       1
927        1
Name: count, Length: 749, dtype: int64

In [18]:
df[['diag_1', 'diag_2', 'diag_3' ]].nunique()

diag_1    717
diag_2    749
diag_3    790
dtype: int64

In [19]:
# Columns for diagnosis codes
diag_cols = ['diag_1', 'diag_2', 'diag_3']

# Define a threshold for rare individual diagnosis codes (e.g., less than 50 occurrences)
# You can adjust this threshold
diag_threshold = 50

for col in diag_cols:
    # Get value counts for the column
    value_counts = df[col].value_counts()
    # Identify rare individual codes (excluding the 'Missing Diagnosis' category)
    rare_codes = value_counts[(value_counts < diag_threshold) & (value_counts.index != 'Missing Diagnosis')].index
    # Replace rare codes with 'Rare Diagnosis'
    df[col] = df[col].replace(rare_codes, 'Rare Diagnosis')

    print(f"\nValue counts for '{col}' after grouping rare individual codes (showing top 10):")
    print(df[col].value_counts().head(10).to_markdown()) # Displaying top 10
    print(f"Number of unique values in '{col}' after grouping:", df[col].nunique())


Value counts for 'diag_1' after grouping rare individual codes (showing top 10):
| diag_1         |   count |
|:---------------|--------:|
| 428            |    6862 |
| 414            |    6581 |
| Rare Diagnosis |    5563 |
| 786            |    4016 |
| 410            |    3614 |
| 486            |    3508 |
| 427            |    2766 |
| 491            |    2275 |
| 715            |    2151 |
| 682            |    2042 |
Number of unique values in 'diag_1' after grouping: 212

Value counts for 'diag_2' after grouping rare individual codes (showing top 10):
| diag_2         |   count |
|:---------------|--------:|
| 276            |    6752 |
| 428            |    6662 |
| 250            |    6071 |
| Rare Diagnosis |    5549 |
| 427            |    5036 |
| 401            |    3736 |
| 496            |    3305 |
| 599            |    3288 |
| 403            |    2823 |
| 414            |    2650 |
Number of unique values in 'diag_2' after grouping: 192

Value counts for 'diag_3' a

In [20]:
# List of all individual medication columns
medication_cols = [
    'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
    'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
    'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
]

# Create aggregate medication features (as discussed)
df['num_active_meds'] = df[medication_cols].apply(lambda row: row.isin(['Steady', 'Up', 'Down']).sum(), axis=1)
df['num_med_changes'] = df[medication_cols].apply(lambda row: row.isin(['Up', 'Down']).sum(), axis=1)

print("Created aggregate medication features: 'num_active_meds' and 'num_med_changes'.")


# --- Selective Keeping/Dropping of Individual Medication Columns ---

# Define a threshold for keeping individual medication columns
# Let's set a threshold, e.g., a medication must have at least 100 non-'No' instances to be kept
med_threshold = 100

# Identify individual medication columns to keep based on the threshold
med_cols_to_keep = []
med_cols_to_drop = []

for col in medication_cols:
    # Count non-'No' and non-'?' instances for the current medication column
    # Assuming '?' was replaced by Unknown earlier, but if it's still '?', count accordingly
    # Let's count any status other than 'No' or 'Unknown' (if '?' was mapped to Unknown)
    # Or, more simply, count instances of 'Steady', 'Up', 'Down'
    non_no_count = df[col].isin(['Steady', 'Up', 'Down']).sum() # Count active statuses

    if non_no_count >= med_threshold:
        med_cols_to_keep.append(col)
    else:
        med_cols_to_drop.append(col)

# Drop the individual medication columns that do not meet the threshold
df.drop(med_cols_to_drop, axis=1, inplace=True)

print(f"\nIndividual medication columns kept (>= {med_threshold} active instances): {med_cols_to_keep}")
print(f"Individual medication columns dropped (< {med_threshold} active instances): {med_cols_to_drop}")
print("DataFrame shape after handling individual medication columns:", df.shape)

# Display value counts for the kept individual medication columns to verify
print("\nValue counts for kept individual medication columns:")
for col in med_cols_to_keep:
    print(f"\nValue counts for '{col}':")
    print(df[col].value_counts().to_markdown())

Created aggregate medication features: 'num_active_meds' and 'num_med_changes'.

Individual medication columns kept (>= 100 active instances): ['metformin', 'repaglinide', 'nateglinide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'insulin', 'glyburide-metformin']
Individual medication columns dropped (< 100 active instances): ['chlorpropamide', 'acetohexamide', 'tolbutamide', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']
DataFrame shape after handling individual medication columns: (101766, 35)

Value counts for kept individual medication columns:

Value counts for 'metformin':
| metformin   |   count |
|:------------|--------:|
| No          |   81778 |
| Steady      |   18346 |
| Up          |    1067 |
| Down        |     575 |

Value counts for 'repaglinide':
| repaglinide   |   count |
|:--------------|--------:|
|

In [21]:
import numpy as np

# Create 'total_prior_visits' feature by summing the three columns
df['total_prior_visits'] = df['number_outpatient'] + df['number_emergency'] + df['number_inpatient']

# Create binary features for having any prior visit of each type
df['has_outpatient_prior'] = (df['number_outpatient'] > 0).astype(int)
df['has_emergency_prior'] = (df['number_emergency'] > 0).astype(int)
df['has_inpatient_prior'] = (df['number_inpatient'] > 0).astype(int)

print("Created engineered prior visit features: 'total_prior_visits', 'has_outpatient_prior', 'has_emergency_prior', 'has_inpatient_prior'.")

# Optional: Display the first few rows with the new features
print("\nDataFrame head with new prior visit features:")
print(df[['number_outpatient', 'number_emergency', 'number_inpatient',
          'total_prior_visits', 'has_outpatient_prior', 'has_emergency_prior',
          'has_inpatient_prior']].head().to_markdown(index=False, numalign="left", stralign="left"))

# Drop the original individual prior visit count columns
df=df.drop(['number_outpatient', 'number_emergency', 'number_inpatient'], axis=1)

print("\nDropped original prior visit count columns.")
print("DataFrame shape after handling prior visit columns:", df.shape)

Created engineered prior visit features: 'total_prior_visits', 'has_outpatient_prior', 'has_emergency_prior', 'has_inpatient_prior'.

DataFrame head with new prior visit features:
| number_outpatient   | number_emergency   | number_inpatient   | total_prior_visits   | has_outpatient_prior   | has_emergency_prior   | has_inpatient_prior   |
|:--------------------|:-------------------|:-------------------|:---------------------|:-----------------------|:----------------------|:----------------------|
| 0                   | 0                  | 0                  | 0                    | 0                      | 0                     | 0                     |
| 0                   | 0                  | 0                  | 0                    | 0                      | 0                     | 0                     |
| 2                   | 0                  | 1                  | 3                    | 1                      | 0                     | 1                     |
| 0       

In [22]:
df['diabetesMed'].value_counts()

diabetesMed
Yes    78363
No     23403
Name: count, dtype: int64

In [23]:
df['change'].value_counts()

change
No    54755
Ch    47011
Name: count, dtype: int64

In [24]:
# Map binary-like columns to 0 and 1
df['diabetesMed'] = df['diabetesMed'].map({'Yes': 1, 'No': 0})
df['change'] = df['change'].map({'Ch': 1, 'No': 0})

print("Mapped 'diabetesMed' and 'change' columns to 0/1.")

# Map ordinal 'age' column to numerical values
age_mapping = {'[0-10)': 5, '[10-20)': 15, '[20-30)': 25, '[30-40)': 35, '[40-50)': 45,
               '[50-60)': 55, '[60-70)': 65, '[70-80)': 75, '[80-90)': 85, '[90-100)': 95}
df['age'] = df['age'].map(age_mapping)

print("Mapped 'age' column to numerical values.")

# Display the first few rows of these columns to verify
print("\nDataFrame head showing mapped columns:")
print(df[['diabetesMed', 'change', 'age']].head().to_markdown(index=False, numalign="left", stralign="left"))

# Display info to check data types after mapping
print("\nDataFrame Info after mapping:")
df.info()

Mapped 'diabetesMed' and 'change' columns to 0/1.
Mapped 'age' column to numerical values.

DataFrame head showing mapped columns:
| diabetesMed   | change   | age   |
|:--------------|:---------|:------|
| 0             | 0        | 5     |
| 1             | 1        | 15    |
| 1             | 0        | 25    |
| 1             | 1        | 35    |
| 1             | 1        | 45    |

DataFrame Info after mapping:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   race                      101766 non-null  object
 1   gender                    101766 non-null  object
 2   age                       101766 non-null  int64 
 3   admission_type_id         101766 non-null  int64 
 4   discharge_disposition_id  101766 non-null  object
 5   admission_source_id       101766 non-null  int64 
 6   time_in_hospital       

In [25]:
# Define the binary target variable
# Create a new column 'readmitted_within_30_days' which is 1 if 'readmitted' is '<30', and 0 otherwise
df['readmitted_within_30_days'] = (df['readmitted'] == '<30').astype(int)

print("Created binary target variable 'readmitted_within_30_days'.")

# Display the value counts of the new target variable to verify the imbalance
print("\nValue counts for the binary target variable:")
print(df['readmitted_within_30_days'].value_counts().to_markdown())

# The original 'readmitted' column is no longer needed in our feature set
# We will drop it before performing the final encoding on the features

Created binary target variable 'readmitted_within_30_days'.

Value counts for the binary target variable:
|   readmitted_within_30_days |   count |
|----------------------------:|--------:|
|                           0 |   90409 |
|                           1 |   11357 |


In [26]:
# Drop the original 'readmitted' column
df.drop('readmitted', axis=1, inplace=True)

print("Dropped the original 'readmitted' column.")
print("DataFrame shape after dropping original target:", df.shape)

# Identify all remaining categorical columns to encode
# These are the columns that are still of 'object' dtype
nominal_cols_to_ohe = df.select_dtypes(include='object').columns.tolist()


print(f"\nColumns to one-hot encode: {nominal_cols_to_ohe}")

# Perform one-hot encoding on the remaining nominal columns
# The binary target 'readmitted_within_30_days' is numerical and will not be encoded
df_encoded_refined = pd.get_dummies(df, columns=nominal_cols_to_ohe, dummy_na=False)


# Display the first 5 rows and the column information of the refined encoded DataFrame
print("\nFirst 5 rows of the refined encoded DataFrame (showing some columns):")
# Display a subset of columns for readability, including original numerical and some new one-hot encoded ones
sample_cols_to_show = ['age', 'time_in_hospital', 'num_lab_procedures', 'total_prior_visits',
                       'has_outpatient_prior', 'num_active_meds',
                       'race_Caucasian', 'gender_Female', 'medical_specialty_InternalMedicine',
                       'payer_code_MC', 'diag_1_428', 'insulin_Steady', 'readmitted_within_30_days'] # Include the new binary target


# Filter for columns that actually exist in the dataframe after encoding
existing_cols_to_show = [col for col in sample_cols_to_show if col in df_encoded_refined.columns]

# Add any columns that were kept from individual medications if they are not already in sample_cols_to_show
# Retrieve the list of kept medication columns from a previous turn or re-identify them if needed
# Assuming 'med_cols_to_keep' list from Step 6 is available
try:
    kept_med_cols_after_handling = med_cols_to_keep # Use the list from the previous step
except NameError:
    # If med_cols_to_keep is not defined, re-identify based on columns NOT dropped from the original med list
    all_med_cols = [
        'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
        'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
        'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
        'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
        'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
    ]
    # Find which of the original med cols are still in df after dropping some in Step 6
    kept_med_cols_after_handling = [col for col in all_med_cols if col in df_encoded_refined.columns]


existing_cols_to_show.extend([col for col in kept_med_cols_after_handling if col not in existing_cols_to_show])

# Ensure 'readmitted_within_30_days' is definitely in the list to show if it exists
if 'readmitted_within_30_days' in df_encoded_refined.columns and 'readmitted_within_30_days' not in existing_cols_to_show:
     existing_cols_to_show.append('readmitted_within_30_days')


# Ensure columns actually exist in the *final* df_encoded_refined before trying to show them
final_cols_to_show = [col for col in existing_cols_to_show if col in df_encoded_refined.columns]


print(df_encoded_refined[final_cols_to_show].head().to_markdown(index=False, numalign="left", stralign="left"))


print("\nColumn information of the refined encoded DataFrame:")
# Print info in chunks due to large number of columns
# Adjust the range based on the actual number of columns in df_encoded_refined
df_encoded_refined.info()

Dropped the original 'readmitted' column.
DataFrame shape after dropping original target: (101766, 36)

Columns to one-hot encode: ['race', 'gender', 'discharge_disposition_id', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'metformin', 'repaglinide', 'nateglinide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'insulin', 'glyburide-metformin']

First 5 rows of the refined encoded DataFrame (showing some columns):
| age   | time_in_hospital   | num_lab_procedures   | total_prior_visits   | has_outpatient_prior   | num_active_meds   | race_Caucasian   | gender_Female   | medical_specialty_InternalMedicine   | payer_code_MC   | diag_1_428   | insulin_Steady   | readmitted_within_30_days   |
|:------|:-------------------|:---------------------|:---------------------|:-----------------------|:------------------|:-----------------|:----------------|:-------------------------------------|:----------------|:-------------|:------------

In [27]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
# X includes all columns except the binary target variable
X = df_encoded_refined.drop('readmitted_within_30_days', axis=1)
# y is our binary target variable
y = df_encoded_refined['readmitted_within_30_days']

print("Separated features (X) and target (y).")
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


# Perform stratified train-test split
# We'll use a test set size of 20% (common practice) and a random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("\nPerformed stratified train-test split.")
# Print the shapes of the resulting sets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# Print the value counts of the target in train and test sets to verify stratification
print("\nValue counts for y_train:")
print(y_train.value_counts().to_markdown())

print("\nValue counts for y_test:")
print(y_test.value_counts().to_markdown())

Separated features (X) and target (y).
Shape of X: (101766, 702)
Shape of y: (101766,)

Performed stratified train-test split.
Shape of X_train: (81412, 702)
Shape of X_test: (20354, 702)
Shape of y_train: (81412,)
Shape of y_test: (20354,)

Value counts for y_train:
|   readmitted_within_30_days |   count |
|----------------------------:|--------:|
|                           0 |   72326 |
|                           1 |    9086 |

Value counts for y_test:
|   readmitted_within_30_days |   count |
|----------------------------:|--------:|
|                           0 |   18083 |
|                           1 |    2271 |


In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, f1_score

# Initialize the Logistic Regression model
# We set class_weight='balanced' to handle the class imbalance
# max_iter is increased to ensure convergence with potentially complex data
log_reg = LogisticRegression(solver='liblinear', class_weight='balanced', max_iter=1000, random_state=42)

print("Initialized Logistic Regression model with class_weight='balanced'.")

# Train the model on the training data
log_reg.fit(X_train, y_train)

print("Trained the Logistic Regression model.")

# Make predictions on the testing data
y_pred = log_reg.predict(X_test)
# Get predicted probabilities for ROC AUC
y_pred_proba = log_reg.predict_proba(X_test)[:, 1] # Probability of the positive class (class 1)

print("Made predictions on the test set.")

# Evaluate the model
print("\nLogistic Regression Model Evaluation (with class_weight='balanced'):")

# Classification Report provides Precision, Recall, F1-score for both classes
print(classification_report(y_test, y_pred))

# ROC AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC Score: {roc_auc:.4f}")

# F1-score specifically for the minority class (class 1)
f1_minority = f1_score(y_test, y_pred, pos_label=1)
print(f"F1-score for minority class (Readmitted < 30): {f1_minority:.4f}")

Initialized Logistic Regression model with class_weight='balanced'.
Trained the Logistic Regression model.
Made predictions on the test set.

Logistic Regression Model Evaluation (with class_weight='balanced'):
              precision    recall  f1-score   support

           0       0.93      0.65      0.76     18083
           1       0.18      0.60      0.27      2271

    accuracy                           0.64     20354
   macro avg       0.55      0.62      0.52     20354
weighted avg       0.84      0.64      0.71     20354

ROC AUC Score: 0.6684
F1-score for minority class (Readmitted < 30): 0.2713


In [29]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Display the confusion matrix using a pandas DataFrame for better readability
# Rows represent actual classes, columns represent predicted classes
cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

print("\nConfusion Matrix for Logistic Regression (with class_weight='balanced'):")
print(cm_df.to_markdown())

# You can also extract the counts directly
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives (TN): {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Positives (TP): {tp}")


Confusion Matrix for Logistic Regression (with class_weight='balanced'):
|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         11709 |          6374 |
| Actual 1 |           914 |          1357 |

True Negatives (TN): 11709
False Positives (FP): 6374
False Negatives (FN): 914
True Positives (TP): 1357


In [30]:
# If you don't have imbalanced-learn installed, uncomment and run the line below
# !pip install imbalanced-learn

from imblearn.over_sampling import SMOTE
import pandas as pd

print("Applying SMOTE to the training data...")

# Initialize SMOTE
# random_state for reproducibility
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("SMOTE applied.")
print("Shape of original y_train:", y_train.shape)
print("Value counts of original y_train:")
print(y_train.value_counts().to_markdown())

print("\nShape of y_train_resampled:", y_train_resampled.shape)
print("Value counts of y_train_resampled after SMOTE:")
print(y_train_resampled.value_counts().to_markdown())

Applying SMOTE to the training data...
SMOTE applied.
Shape of original y_train: (81412,)
Value counts of original y_train:
|   readmitted_within_30_days |   count |
|----------------------------:|--------:|
|                           0 |   72326 |
|                           1 |    9086 |

Shape of y_train_resampled: (144652,)
Value counts of y_train_resampled after SMOTE:
|   readmitted_within_30_days |   count |
|----------------------------:|--------:|
|                           0 |   72326 |
|                           1 |   72326 |


In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, f1_score, confusion_matrix
import pandas as pd

print("Retraining Logistic Regression model on SMOTE-resampled data...")

# Initialize the Logistic Regression model WITHOUT class_weight='balanced'
# because the data is already balanced
log_reg_smote = LogisticRegression(solver='liblinear', max_iter=1000, random_state=42)

# Train the model on the RESAMPLED training data
log_reg_smote.fit(X_train_resampled, y_train_resampled)

print("Trained Logistic Regression model on resampled data.")

# Make predictions on the ORIGINAL testing data
y_pred_smote = log_reg_smote.predict(X_test)
# Get predicted probabilities for ROC AUC on the ORIGINAL testing data
y_pred_proba_smote = log_reg_smote.predict_proba(X_test)[:, 1] # Probability of the positive class (class 1)

print("Made predictions on the original test set.")

# Evaluate the model
print("\nLogistic Regression Model Evaluation (trained with SMOTE):")

# Classification Report
print(classification_report(y_test, y_pred_smote))

# ROC AUC Score
roc_auc_smote = roc_auc_score(y_test, y_pred_proba_smote)
print(f"ROC AUC Score: {roc_auc_smote:.4f}")

# F1-score for the minority class (class 1)
f1_minority_smote = f1_score(y_test, y_pred_smote, pos_label=1)
print(f"F1-score for minority class (Readmitted < 30): {f1_minority_smote:.4f}")

# Confusion Matrix
cm_smote = confusion_matrix(y_test, y_pred_smote)
cm_smote_df = pd.DataFrame(cm_smote, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

print("\nConfusion Matrix for Logistic Regression (trained with SMOTE):")
print(cm_smote_df.to_markdown())

# Extract and print individual counts
tn_smote, fp_smote, fn_smote, tp_smote = cm_smote.ravel()
print(f"\nTrue Negatives (TN): {tn_smote}")
print(f"False Positives (FP): {fp_smote}")
print(f"False Negatives (FN): {fn_smote}")
print(f"True Positives (TP): {tp_smote}")

Retraining Logistic Regression model on SMOTE-resampled data...
Trained Logistic Regression model on resampled data.
Made predictions on the original test set.

Logistic Regression Model Evaluation (trained with SMOTE):
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     18083
           1       0.42      0.01      0.02      2271

    accuracy                           0.89     20354
   macro avg       0.65      0.50      0.48     20354
weighted avg       0.84      0.89      0.84     20354

ROC AUC Score: 0.6653
F1-score for minority class (Readmitted < 30): 0.0172

Confusion Matrix for Logistic Regression (trained with SMOTE):
|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         18055 |            28 |
| Actual 1 |          2251 |            20 |

True Negatives (TN): 18055
False Positives (FP): 28
False Negatives (FN): 2251
True Positives (TP): 20


In [32]:
import lightgbm as lgb
from sklearn.metrics import classification_report, roc_auc_score, f1_score, confusion_matrix
import pandas as pd

print("Initializing and training LightGBM model...")

# Calculate scale_pos_weight for class imbalance handling
# It's the ratio of the number of negative class instances to the number of positive class instances in the training data
scale_pos_weight_value = (y_train == 0).sum() / (y_train == 1).sum()
print(f"Calculated scale_pos_weight: {scale_pos_weight_value:.2f}")

# Initialize the LightGBM Classifier
# objective='binary' for binary classification
# metric='auc' is a common evaluation metric
# scale_pos_weight addresses class imbalance
# random_state for reproducibility
# n_estimators is set to a reasonable starting point, can be tuned
# learning_rate controls the step size, can be tuned
lgb_clf = lgb.LGBMClassifier(objective='binary',
                             metric='auc',
                             scale_pos_weight=scale_pos_weight_value,
                             n_estimators=1000, # Increase n_estimators for potentially better performance
                             learning_rate=0.05, # Start with a moderate learning rate
                             num_leaves=31, # Default, can be tuned
                             random_state=42,
                             n_jobs=-1) # Use all available cores

# Train the model on the ORIGINAL training data (LightGBM handles imbalance with scale_pos_weight)
lgb_clf.fit(X_train, y_train)

print("Trained the LightGBM model.")

# Make predictions on the ORIGINAL testing data
y_pred_lgbm = lgb_clf.predict(X_test)
# Get predicted probabilities for ROC AUC on the ORIGINAL testing data
y_pred_proba_lgbm = lgb_clf.predict_proba(X_test)[:, 1] # Probability of the positive class (class 1)

print("Made predictions on the original test set.")

# Evaluate the model
print("\nLightGBM Model Evaluation (with scale_pos_weight):")

# Classification Report
print(classification_report(y_test, y_pred_lgbm))

# ROC AUC Score
roc_auc_lgbm = roc_auc_score(y_test, y_pred_proba_lgbm)
print(f"ROC AUC Score: {roc_auc_lgbm:.4f}")

# F1-score for the minority class (class 1)
f1_minority_lgbm = f1_score(y_test, y_pred_lgbm, pos_label=1)
print(f"F1-score for minority class (Readmitted < 30): {f1_minority_lgbm:.4f}")

# Confusion Matrix
cm_lgbm = confusion_matrix(y_test, y_pred_lgbm)
cm_lgbm_df = pd.DataFrame(cm_lgbm, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

print("\nConfusion Matrix for LightGBM (with scale_pos_weight):")
print(cm_lgbm_df.to_markdown())

# Extract and print individual counts
tn_lgbm, fp_lgbm, fn_lgbm, tp_lgbm = cm_lgbm.ravel()
print(f"\nTrue Negatives (TN): {tn_lgbm}")
print(f"False Positives (FP): {fp_lgbm}")
print(f"False Negatives (FN): {fn_lgbm}")
print(f"True Positives (TP): {tp_lgbm}")

Initializing and training LightGBM model...
Calculated scale_pos_weight: 7.96
[LightGBM] [Info] Number of positive: 9086, number of negative: 72326
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016257 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1676
[LightGBM] [Info] Number of data points in the train set: 81412, number of used features: 694
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.111605 -> initscore=-2.074449
[LightGBM] [Info] Start training from score -2.074449
Trained the LightGBM model.
Made predictions on the original test set.

LightGBM Model Evaluation (with scale_pos_weight):
              precision    recall  f1-score   support

           0       0.92      0.74      0.82     18083
           1       0.19      0.49      0.28      2271

    accuracy                           0.71     20354
   macro avg       0.56

Metric	               LR w/ class_weight	   LR w/ SMOTE	             LightGBM w/ scale_pos_weight
Precision (Class 1)	     0.18	                   0.42	                         0.19
Recall (Class 1)	     0.60	                   0.01	                         0.49
F1-score (Class 1)	     0.27	                   0.02	                         0.28
ROC AUC	                 0.6684	                   0.6653	                     0.6642
Accuracy	             0.64	                   0.89	                         0.71
True Positives (TP)	     1357	                   20	                         1109
False Positives (FP)	 6374	                   28	                         4653
False Negatives (FN)	 914	                   2251	                         1162
True Negatives (TN)	     11709	                   18055	                     13430

Let's use RandomizedSearchCV to tune some key hyperparameters of the LightGBM model, aiming to optimize the F1-score for the minority class.

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, f1_score, confusion_matrix
import pandas as pd

print("Initializing and training Random Forest model...")

# Initialize the Random Forest Classifier
# n_estimators is set to a reasonable starting point
# class_weight='balanced' addresses class imbalance
# random_state for reproducibility
# n_jobs=-1 to use all available cores
rf_clf = RandomForestClassifier(n_estimators=200, # Number of trees in the forest
                              class_weight='balanced', # Handles class imbalance
                              random_state=42,
                              n_jobs=-1)

# Train the model on the ORIGINAL training data
# Using class_weight='balanced' is generally preferred over simple oversampling like SMOTE for tree ensembles
rf_clf.fit(X_train, y_train)

print("Trained the Random Forest model.")

# Make predictions on the ORIGINAL testing data
y_pred_rf = rf_clf.predict(X_test)
# Get predicted probabilities for ROC AUC on the ORIGINAL testing data
y_pred_proba_rf = rf_clf.predict_proba(X_test)[:, 1] # Probability of the positive class (class 1)

print("Made predictions on the original test set.")

# Evaluate the model
print("\nRandom Forest Model Evaluation (with class_weight='balanced'):")

# Classification Report
print(classification_report(y_test, y_pred_rf))

# ROC AUC Score
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
print(f"ROC AUC Score: {roc_auc_rf:.4f}")

# F1-score for the minority class (class 1)
f1_minority_rf = f1_score(y_test, y_pred_rf, pos_label=1)
print(f"F1-score for minority class (Readmitted < 30): {f1_minority_rf:.4f}")

# Confusion Matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_rf_df = pd.DataFrame(cm_rf, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

print("\nConfusion Matrix for Random Forest (with class_weight='balanced'):")
print(cm_rf_df.to_markdown())

# Extract and print individual counts
tn_rf, fp_rf, fn_rf, tp_rf = cm_rf.ravel()
print(f"\nTrue Negatives (TN): {tn_rf}")
print(f"False Positives (FP): {fp_rf}")
print(f"False Negatives (FN): {fn_rf}")
print(f"True Positives (TP): {tp_rf}")

Initializing and training Random Forest model...
Trained the Random Forest model.
Made predictions on the original test set.

Random Forest Model Evaluation (with class_weight='balanced'):
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     18083
           1       0.70      0.01      0.01      2271

    accuracy                           0.89     20354
   macro avg       0.79      0.50      0.48     20354
weighted avg       0.87      0.89      0.84     20354

ROC AUC Score: 0.6763
F1-score for minority class (Readmitted < 30): 0.0122

Confusion Matrix for Random Forest (with class_weight='balanced'):
|          |   Predicted 0 |   Predicted 1 |
|:---------|--------------:|--------------:|
| Actual 0 |         18077 |             6 |
| Actual 1 |          2257 |            14 |

True Negatives (TN): 18077
False Positives (FP): 6
False Negatives (FN): 2257
True Positives (TP): 14


Metric                        LR w/ class_weight         LR w/ SMOTE     LightGBM w/ scale_pos_weight        Random Forest w/ class_weight

Precision (Class 1)          0.18                          0.42                 0.19                               0.70

Recall (Class 1)             0.60                          0.01                 0.49                               0.01

F1-score (Class 1)           0.27                          0.02                 0.28                               0.01

ROC AUC                      0.6684                        0.6653               0.6642                             0.6763

Accuracy                     0.64                          0.89                 0.71                               0.89

True Positives (TP)         1357                           20                   1109                               14

False Positives (FP)        6374                           28                   4653                               6

False Negatives (FN)        914                            2251                 1162                               2257

True Negatives (TN)         11709                          18055                13430                              18077