## üßæ 1. Introduction

This notebook analyses the Diabetes 130-US Hospitals dataset from the UCI Machine Learning Repository. The primary goal is to analyze patient readmission patterns and identify key predictors using Decision Tree model.

Dataset Source: UCI ML Repository (ID: 296)  
Domain: Healthcare  
Focus: 30-day hospital readmission for diabetic 

## üì¶ 2. Dataset Import

We use the cleaned dataset from EDA for this model building.

In [None]:
import pandas as pd
import numpy as np

# Load the cleaned dataset
df = pd.read_csv('cleaned_diabetes_data.csv')
print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
df.head()

## üîç 3. Model Building

**Target Binarization**

1 = Readmitted in <30 days

0 = Not Readmitted or Readmitted >30 days

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df['Readmitted_30_Days'] = df['readmitted'].apply(
    lambda x: 1 if x == '<30' else 0
)

# Drop the original multi-class 'readmitted' column
df_model = df.drop('readmitted', axis=1)

# Separate Features (X) and Target (Y)
X = df_model.drop('Readmitted_30_Days', axis=1)
Y = df_model['Readmitted_30_Days']

print("Data separated into X and Y, and target binarized.")

# --- Train-Test Split ---
# Use a 70/30 split and stratify to maintain the readmission ratio in both sets
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.3, random_state=42, stratify=Y
)

print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")

**Model Training & Prediction**

In [None]:
from sklearn.tree import DecisionTreeClassifier

# --- Model Training ---
dt_classifier = DecisionTreeClassifier(
    criterion='gini',
    max_depth=5,  # Limiting depth for interpretability and avoiding overfitting
    random_state=42
)

# Fit the model to the training data
dt_classifier.fit(X_train, Y_train)

print("Decision Tree Model Training Complete.")

# --- Prediction ---
Y_pred = dt_classifier.predict(X_test)

**Evaluation & Feature Importance**

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# --- Evaluation ---
print("\n--- Model Performance on Test Set ---")
print(f"Accuracy: {accuracy_score(Y_test, Y_pred):.4f}")

# Classification Report
print("\nClassification Report:")
# Note: Target names 0='Not Readmitted', 1='Readmitted'
print(classification_report(Y_test, Y_pred, target_names=['Not Readmitted', 'Readmitted']))

# Confusion Matrix
conf_matrix = confusion_matrix(Y_test, Y_pred)
print("\nConfusion Matrix (Rows=True, Cols=Predicted):\n", conf_matrix)


# --- Feature Importance ---
feature_importances = pd.Series(
    dt_classifier.feature_importances_, 
    index=X.columns
).sort_values(ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importances.head(10))