# Part 2: Case Study Application – Hospital Readmission Prediction

## Problem Scope

**Problem Definition**
- Develop a predictive system to identify patients at high risk of unplanned hospital readmission within 30 days of discharge.
- Enable proactive discharge planning and post-discharge care coordination.

**Objectives**
- Achieve 75% recall in identifying patients who will be readmitted (minimize missed high-risk patients).
- Reduce 30-day readmission rates by 20% through targeted interventions.
- Provide interpretable risk scores and contributing factors to clinicians within 24 hours before discharge.
- Maintain fairness across demographic groups (maximum 5% difference in false negative rates).

**Stakeholders**
- Clinical teams (physicians, nurses, discharge planners): Use predictions for personalized discharge plans and follow-up.
- Hospital administrators: Monitor readmission rates for quality metrics and reimbursement optimization (CMS penalties).
- Patients and caregivers: Receive enhanced support if flagged high-risk.
- Insurance payers: Benefit from reduced costs associated with preventable readmissions.

---

## Data Strategy

**Proposed Data Sources**
- Electronic Health Records (EHR)
  - Demographics: age, gender, race, ZIP code
  - Medical history: diagnoses, comorbidities, previous hospitalizations
  - Current admission: primary diagnosis, length of stay, procedures, medications
  - Vital signs, lab results, discharge disposition, follow-up appointments
- Social Determinants of Health (SDOH)
  - Insurance type and coverage
  - Transportation access
  - Food security and housing stability
  - Social support network, health literacy
- Claims and Administrative Data
  - Prior healthcare utilization, medication adherence, post-discharge services

**Ethical Concerns**
- Patient Privacy (HIPAA Compliance)
  - Risk of re-identification with multiple data sources
  - Unauthorized access to sensitive data
  - Mitigation: differential privacy, secure encryption, privacy audits
- Algorithmic Bias and Health Equity
  - Historical disparities may under-identify risk in underserved populations
  - Mitigation: fairness audits, fairness constraints, stakeholder inclusion

**Preprocessing Pipeline**
- Data Integration & Cleaning
  - Merge EHR, pharmacy, and claims data
  - Remove duplicates, reconcile conflicts
  - Handle missing values: forward-fill vital signs, "unknown" for categorical >10%, drop critical missing data
- Feature Engineering
  - Temporal features: days since last admission, number of admissions in past year, length of current stay
  - Clinical complexity: Charlson Comorbidity Index, medication count, number of active diagnoses
  - Utilization patterns: ER visits past 6 months, missed appointments ratio
  - Discharge readiness: follow-up appointment scheduled, medication reconciliation
  - Risk flags: abnormal lab values, early discharge against advice
- Handling Imbalanced Data
  - Apply SMOTE on training data or class weights to penalize false negatives
- Feature Scaling & Encoding
  - Standardize continuous variables
  - One-hot encode categorical variables with <10 categories
  - Target encode high-cardinality variables
- Feature Selection
  - Remove highly correlated features (>0.85)
  - Recursive Feature Elimination for top 50 predictors
  - Retain interpretable clinical features

---

## Model Development

**Selected Model:** Logistic Regression with L1 Regularization (LASSO)

**Justification**
- Interpretability: Clinicians understand why a patient is flagged
- Regulatory compliance: Easier validation under FDA/healthcare rules
- Clinical trust: Physicians more likely to act on understandable predictions
- Computational efficiency: Fast real-time inference at discharge
- L1 Regularization: Automatic feature selection, identifies strongest predictors
- Alternative considered: Random Forest (less interpretable for individual predictions)

**Hypothetical Confusion Matrix (1,000 patients; 180 readmitted)**
- Actual: No Readmit / Predicted: No Readmit – 730 (TN)
- Actual: No Readmit / Predicted: Readmit – 90 (FP)
- Actual: Readmit / Predicted: No Readmit – 45 (FN)
- Actual: Readmit / Predicted: Readmit – 135 (TP)

**Precision & Recall**
- Precision = 135 / (135 + 90) ≈ 60% → 40% false positives
- Recall = 135 / (135 + 45) = 75% → 25% missed high-risk patients
- Trade-off: Prioritize recall to avoid missing at-risk patients

---

## Deployment

**Integration Steps**
- System Architecture
  - RESTful API using Flask/FastAPI
  - Host on HIPAA-compliant cloud (AWS GovCloud / Azure Healthcare)
  - API authentication via OAuth 2.0
- EHR Integration
  - HL7 FHIR interface for patient data
  - Run predictions 24 hours before discharge
  - Output: risk score (0–100) + top 5 contributing factors
- Clinical Workflow
  - Display risk score in discharge module
  - Tiered interventions:
    - High (>70): case manager, home health, 48-hour follow-up
    - Medium (40–70): standard education, 7-day follow-up
    - Low (<40): standard care pathway
  - Clinician override allowed with documentation
- Monitoring Dashboard
  - Track predictions, readmissions, performance
  - Alerts for concept drift or failures
  - Audit logs
- Feedback Loop
  - Capture outcomes for continuous validation
  - Quarterly retraining, A/B testing

**HIPAA Compliance Measures**
- Encrypt data at rest (AES-256) & in transit (TLS 1.3)
- Role-based access control
- Minimum necessary data collection
- De-identification for research/model training
- BAAs with vendors
- Patient consent, access, and correction mechanisms

---

## Optimization

**Overfitting Solution:** K-Fold Cross-Validation with Regularization
- Cross-validation: 5-fold stratified
- L1 Regularization: tune λ = [0.001, 0.01, 0.1, 1, 10]
- Early stopping if validation performance plateaus
- Feature selection: top 30–40 features, retain clinical interpretability
- Expected Outcome: improved generalization, better test set performance


In [1]:
# ======================================
# Part 2: Hospital Readmission Prediction
# Practical Starter Notebook
# ======================================

# =============================
# 1. Import Libraries
# =============================
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, classification_report

# =============================
# 2. Load Dataset
# =============================
# Replace the path with the location of your CSV file
df = pd.read_csv("diabetic_data.csv")

# Quick overview
print("Dataset Shape:", df.shape)
print(df.head())
print(df.info())

# =============================
# 3. Target Variable
# =============================
# For 30-day readmission prediction, convert readmitted column to binary
# 1 = readmitted within 30 days, 0 = otherwise
df['readmitted_30'] = df['readmitted'].apply(lambda x: 1 if x == '<30' else 0)

# Drop original readmitted column (optional)
df.drop('readmitted', axis=1, inplace=True)

# =============================
# 4. Feature Selection (Example)
# =============================
# Select some features for demo; you can expand based on your theoretical plan
features = [
    'race', 'gender', 'age', 'admission_type_id', 'time_in_hospital',
    'num_lab_procedures', 'num_medications', 'number_outpatient',
    'number_inpatient', 'number_emergency', 'diag_1', 'diag_2', 'diag_3'
]

X = df[features]
y = df['readmitted_30']

# =============================
# 5. Preprocessing Pipeline
# =============================
# Identify categorical and numerical columns
categorical_features = ['race', 'gender', 'age', 'diag_1', 'diag_2', 'diag_3']
numerical_features = ['admission_type_id', 'time_in_hospital', 'num_lab_procedures',
                      'num_medications', 'number_outpatient', 'number_inpatient', 'number_emergency']

# Create preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# =============================
# 6. Train-Test Split
# =============================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# =============================
# 7. Model: Logistic Regression
# =============================
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, class_weight='balanced'))
])

# Train the model
clf.fit(X_train, y_train)

# =============================
# 8. Predictions & Evaluation
# =============================
y_pred = clf.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Precision & Recall
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("Precision:", precision)
print("Recall:", recall)

# Full Classification Report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# =============================
# 9. Optional: Feature Importance
# (Only works with linear models on numeric features)
# =============================
# coef = clf.named_steps['classifier'].coef_[0]
# print("Feature coefficients:\n", coef)


Dataset Shape: (101766, 50)
   encounter_id  patient_nbr             race  gender      age weight  \
0       2278392      8222157        Caucasian  Female   [0-10)      ?   
1        149190     55629189        Caucasian  Female  [10-20)      ?   
2         64410     86047875  AfricanAmerican  Female  [20-30)      ?   
3        500364     82442376        Caucasian    Male  [30-40)      ?   
4         16680     42519267        Caucasian    Male  [40-50)      ?   

   admission_type_id  discharge_disposition_id  admission_source_id  \
0                  6                        25                    1   
1                  1                         1                    7   
2                  1                         1                    7   
3                  1                         1                    7   
4                  1                         1                    7   

   time_in_hospital  ... citoglipton insulin  glyburide-metformin  \
0                 1  ...          No 