## Step A1: ML Task Definition & Output

### Objective
The objective of this project is to predict the **future risk of adverse events** caused by polypharmacy (use of multiple medications) in hospitalized patients. The focus is on identifying high-risk patients early so that preventive actions can be considered before serious outcomes occur.

### ML Task Type
This is a **binary classification problem with a probabilistic output**.  
The model predicts whether a patient is likely to experience an adverse event (such as a fall or medication-related complication) within a **28-day prediction window**.

### Why Machine Learning (Not Rule-Based)
Polypharmacy risks are driven by **complex interactions between multiple drugs, patient conditions, and treatment duration**. These patterns are difficult to capture using fixed rules.  
Machine learning allows the system to learn risk patterns from historical data and adapt to combinations that may not be obvious or explicitly defined.

### Model Inputs
The model uses features engineered from structured healthcare data, including:
- Patient demographics (age, gender)
- Medication count and combinations
- Duration of medication exposure
- Historical clinical events
- Polypharmacy-related risk indicators

### Model Output
The model produces a **risk probability score between 0 and 1**, representing the likelihood of a patient experiencing an adverse event within the next 28 days.

### Success Criteria
Model performance and usefulness are evaluated using:
- AUC-ROC to measure predictive quality
- Stability of risk predictions
- Ability to translate predictions into meaningful risk levels for decision support


## Step A2: Risk Stratification & Decision Logic

### Purpose
While the machine learning model outputs a probability score, probabilities alone are not actionable for decision-makers. This step converts model outputs into **clear risk categories** that can support intervention planning and monitoring.

### Risk Stratification Logic
The predicted probability from the Logistic Regression model is mapped into three risk levels:

- **High Risk (Red)**: Probability ≥ 0.70  
- **Medium Risk (Amber)**: Probability between 0.40 and 0.70  
- **Low Risk (Green)**: Probability < 0.40  

These thresholds are chosen to balance sensitivity and interpretability, and can be adjusted based on operational requirements.

### Output Representation
Each patient record in the Gold layer contains:
- Risk probability score
- Risk category (Red / Amber / Green)
- Associated patient and medication context

This makes the output easily consumable by downstream systems such as dashboards or alerting workflows.

### Decision Support Use Case
- **High Risk** patients can be prioritized for medication review or closer monitoring
- **Medium Risk** patients can be flagged for follow-up assessments
- **Low Risk** patients continue under routine care

### Why This Matters
This step transforms raw ML predictions into **actionable insights**, moving the system beyond model training and into real-world decision support.


## Step A3: Model Evaluation & Interpretation

### Evaluation Approach
The model was evaluated using a train-test split and assessed with **AUC-ROC** via Spark’s `BinaryClassificationEvaluator`.  
AUC was chosen because the problem involves **risk ranking**, not just binary accuracy.

### Evaluation Result
- **AUC-ROC Score:** 0.59

### Interpretation of Results
An AUC of 0.59 indicates that the model performs **better than random guessing**, but is not highly predictive.  
This result is expected due to:
- High noise and sparsity in healthcare event data
- Limited feature depth in the free Databricks environment
- Absence of temporal and clinical severity signals available in real hospital systems

### Why the Model Is Still Useful
The goal of this system is **early risk signaling**, not medical diagnosis.  
Even modest predictive power is valuable when:
- Used to rank patients by relative risk
- Combined with human review or rule-based safeguards
- Integrated into a broader decision-support workflow

### Limitations
- Limited feature richness from public datasets
- No longitudinal patient history beyond available records
- Thresholds are heuristic and not clinically validated

### Future Improvements
- Incorporate richer temporal features
- Add interaction features between medication classes
- Calibrate risk thresholds using domain feedback

This evaluation confirms that the model is suitable for **decision support and risk prioritization**, not automated clinical decisions.


## Step A4: End-to-End Database ↔ AI Workflow

### Overview
This project is designed as a full **database-to-AI-to-decision** pipeline using Databricks Lakehouse principles. Each stage builds on the previous one, ensuring traceability, reproducibility, and scalability.

### Data Ingestion (Bronze Layer)
Raw healthcare datasets (patients, prescriptions, and events) are ingested into Databricks as **Bronze tables**.  
These tables represent the source data with minimal transformation and act as the system of record.

### Data Cleaning & Transformation (Silver Layer)
Silver tables are created by:
- Cleaning invalid or missing values
- Standardizing formats (dates, identifiers)
- Joining patient, medication, and event data
- Preparing structured inputs suitable for feature engineering

This layer ensures data quality and consistency for downstream ML tasks.

### Feature Engineering & Aggregation (Gold Layer)
The Gold layer contains **ML-ready features**, including:
- Polypharmacy indicators
- Medication exposure duration
- Patient-level aggregates
- Event frequency and history signals

These curated tables are optimized for analytics, modeling, and reporting.

### Machine Learning Integration
- Features are read directly from Gold tables
- A Logistic Regression model is trained using PySpark ML
- Predictions and probability scores are generated
- Risk levels (Red / Amber / Green) are derived from probabilities

### Storing Results Back to the Database
Final risk scores and categories are written back as **Gold output tables**, closing the loop between:
**Data → AI → Decision Support**

This enables:
- Dashboarding
- Monitoring
- Future automation via Databricks Jobs

### Why This Workflow Matters
This architecture demonstrates:
- Clear separation of concerns across data layers
- Tight integration between data engineering and AI
- A realistic production-style workflow using only free Databricks capabilities


## Step A5: Business Impact & Practical Use

### Problem Impact
Polypharmacy is a common issue in healthcare, especially among elderly and chronically ill patients. Adverse events such as dizziness, falls, and fractures often occur **after a cascade of medication interactions**, making them difficult to detect early using static rules.

### Who This System Helps
- **Clinicians**: Identify patients who may need early medication review
- **Hospital Operations Teams**: Prioritize high-risk patients for monitoring
- **Quality & Safety Teams**: Reduce preventable adverse events
- **Patients**: Lower risk of medication-related harm

### How the Output Is Used
The system produces a **risk score and risk category** for each patient:
- **High Risk (Red)**: Immediate review or intervention recommended
- **Medium Risk (Amber)**: Increased monitoring and follow-up
- **Low Risk (Green)**: Routine care

This allows teams to focus effort where it matters most.

### Decision Support, Not Automation
This project is explicitly designed as a **decision-support system**, not a medical device.  
Predictions are intended to **assist human judgment**, not replace it.

### Practical Value
Even with a moderate AUC score, the model provides value by:
- Ranking patients by relative risk
- Surfacing hidden risk patterns from medication combinations
- Enabling proactive care rather than reactive treatment

### Real-World Applicability
The system can be integrated with:
- Hospital dashboards
- Daily risk reports
- Automated alerts for care teams

This demonstrates how AI can be responsibly applied to improve patient safety using existing data.


In [0]:
import pyspark.sql.functions as F

In [0]:
data = spark.table("silver_features")
data = data.fillna(0)

In [0]:
train, test = data.randomSplit([0.8, 0.2], seed = 42)

In [0]:
drop_cols = [
    "gender", "drug_name","start_date","end_date"
]

ml_data = data.drop(*drop_cols)

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
import mlflow
import mlflow.spark

features = [c for c in ml_data.columns if c not in ["patient_id", "label"]]

assembler = VectorAssembler(
    inputCols=features,
    outputCol="features"
)

train_vec = assembler.transform(train.select(*ml_data.columns))
test_vec = assembler.transform(test.select(*ml_data.columns))

lr = LogisticRegression(
    featuresCol="features",
    labelCol="label",
    probabilityCol="risk_score"
)

with mlflow.start_run():
    model = lr.fit(train_vec)
    predictions = model.transform(test_vec)
    # mlflow.spark.log_model(model, "risk_model")

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    labelCol="label",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

auc = evaluator.evaluate(predictions)
print("AUC:", auc)

In [0]:
# import mlflow
# from pyspark.ml.evaluation import BinaryClassificationEvaluator


mlflow.set_experiment("/Shared/Smart_Pill_Danger_Forecaster")

with mlflow.start_run(run_name="LogisticRegression_Baseline"):
    
    # Log params
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("features", "demographics + prescriptions + events")
    
    # Log metric
    mlflow.log_metric("AUC", auc)
    
print(f"AUC logged to MLflow: {auc}")

In [0]:
from pyspark.ml.functions import vector_to_array
from pyspark.sql import functions as F

gold_risk = (
    predictions
    .withColumn("risk_array", vector_to_array("risk_score"))
    .select(
        "patient_id",
        F.col("risk_array")[1].alias("risk_score")
    )
    .withColumn(
        "risk_bucket",
        F.when(F.col("risk_score") >= 0.7, "High")
         .when(F.col("risk_score") >= 0.4, "Medium")
         .otherwise("Low")
    )
)

gold_risk.write.mode("overwrite").saveAsTable("gold_patient_risk")

In [0]:
spark.table("gold_patient_risk").display()