In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("ds701_midterm25_notebook.ipynb")

## Stroke Risk Prediction - DS701 Midterm Challenge

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context("talk")
sns.set_style("whitegrid")

# Load Dataset
df = pd.read_csv("strokeX.csv")
print("Dataset Shape:", df.shape)
display(df.head())

# Store the original columns
original_columns = df.columns.tolist()

# Unique patient record
print(f"Unique patients: {df.shape[0]} (1 record per patient assumed)")

# Basic descriptive stats
df.describe(include='all').T

### Part 1 - Exploratory Feature Analysis & Risk Engineering

<!-- BEGIN QUESTION -->

#### Q1.1 Age-Normalized Risk Index (ANRI) (5 points)

The **Age-Normalized Risk Index (ANRI)** identifies patients whose stroke risk is unusually high for their age.  
A higher ANRI indicates a higher stroke risk relative to age.

*Hint: ANRI is calculated by dividing each patient’s `stroke_risk_pct` by their `age`.*

**Tasks:**
1. Compute a new column `ANRI` (handle divide-by-zero safely).  
2. Plot the distribution of ANRI using a **histogram** (with KDE).   
3. Show the **top 10 patients** with the highest ANRI values and briefly interpret the results.

In [None]:
def compute_anri(df):
    """
    Compute and visualize the Age-Normalized Risk Index (ANRI)
    """
    
    ...

df = compute_anri(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.2 Chronic Condition Score (CCS) (5 points)

The **Chronic Condition Score (CCS)** represents the number of chronic cardiovascular conditions a patient has, specifically high blood pressure and irregular heartbeat.  
A higher CCS indicates greater chronic disease burden.

*Hint: CCS is the sum of the binary columns `high_bp` and `irregular_heartbeat`.*

**Tasks:**
1. Add a new column `CCS` to the DataFrame.
2. Compute the **average stroke risk percentage** for each CCS level (0, 1, 2).  
3. Visualize the results using a **bar plot**, and interpret how stroke risk changes with higher CCS.

In [None]:
def compute_ccs(df):
    """
    Compute and visualize the Chronic Condition Score (CCS)
    """
    
    ...

ccs_summary = compute_ccs(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.3 Symptom Burden Index (SBI) (5 points)

The **Symptom Burden Index (SBI)** quantifies the total number of symptoms reported by each patient.  
A higher SBI indicates a greater overall symptom load, which is expected to relate to higher stroke risk.

*Hint: SBI is the sum of all symptom indicator columns (values 0 or 1).*

**Tasks:**
1. Compute the total `SBI` for each patient using all symptom columns. Categorize patients into three groups based on SBI - *Low (0–3)*; *Moderate (4–6)*; *High (7+)*
2. Analyze how the **average stroke risk percentage** changes across these SBI groups 
3. Compute the **correlation** between `SBI` and `stroke_risk_pct`. 

In [None]:
def compute_sbi(df):
    """
    Compute the Symptom Burden Index (SBI)
    """
   
    ...

sbi_summary = compute_sbi(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.4 Symptom Predictive Power using Mutual Information (MI) (5 points)

**Mutual Information (MI)** measures how informative a feature is for predicting a target variable.  
Here, it shows how strongly each symptom relates to the stroke risk label (`at_risk`).

*Hint: Use *`mutual_info_classif`* from *`sklearn.feature_selection`* to compute MI scores.*

**Tasks:**
1. Compute the 'MI' score between each symptom and `at_risk`.  
2. Sort symptoms by MI score and list the **Top 10** most predictive ones.  

In [None]:
def compute_mi(df):
    """
    Compute the Mutual Information (MI) score
    """
    from sklearn.feature_selection import mutual_info_classif

    ...

mi_summary = compute_mi(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.5 Age-Adjusted Risk Z-Scores (AARZ) (5 points)

The **Age-Adjusted Risk Z-Score (AARZ)** compares a patient’s stroke risk to others in the same age group.  
It highlights patients whose risk levels are unusually high or low relative to their peers.

*Hint: For each 10-year age group, calculate how many standard deviations a patient’s stroke risk percentage is above or below the group’s mean value.*

**Tasks:**
1. Create 10-year age bins, within each age group, compute the *Z-score* for stroke risk (`stroke_risk_pct`).  
2. Identify the **Top 5 patients** with the highest Z-scores (highest relative risk).  
3. For **each age group**, list the **Top 2 patients** with the highest Z-scores. 

In [None]:
def compute_aarz(df):
    """
    Compute Age-Adjusted Risk Z-Scores (AARZ)
    """

    ...
    
top5_outliers, top_by_group = compute_aarz(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.6 Risk Consistency Index (RCI) (5 points)

The **Risk Consistency Index (RCI)** measures how well the continuous stroke risk scores align with the binary label `at_risk`.  
It captures how distinctly the two groups: *at risk* and *not at risk*, differ in their average stroke risk percentage.
Higher RCI values indicate stronger consistency between risk scores and the `at_risk` label.

*Hint: RCI is calculated by taking the absolute difference between the group means and dividing it by the pooled standard deviation.*  

**Tasks:** 
1. Compute the mean and standard deviation of `stroke_risk_pct` for both groups and 
2. Calculate the `RCI`.
2. Interpret whether stroke risk percentages are consistent with the binary classification labels.

In [None]:
def compute_rci(df):
    """
    Compute the Risk Consistency Index (RCI) 
    """
    ...

RCI = compute_rci(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.7 Composite Health Risk Index (CHRI) (5 points)

The **Composite Health Risk Index (CHRI)** combines multiple cardiovascular indicators into a single, weighted risk score.  
It integrates *high blood pressure*, *irregular heartbeat*, and *age-normalized risk (ANRI)* to capture an individual’s overall health vulnerability.

*Hint:* Compute CHRI using the weighted formula:  *CHRI = 0.4 * High Blood Pressure + 0.4 * Irregular Heartbeat + 0.2 * ANRI*

**Tasks:**
1. Calculate the **Composite Health Risk Index (CHRI)** for each patient.   
2. Find the **Top 5 patients** with the highest CHRI values.
3.   Compute correlations between `CHRI` and both `stroke_risk_pct` and `at_risk`.

In [None]:
def compute_chri(df):
    """
    Compute the Composite Health Risk Index (CHRI) 
    """
    
    ...

df, top_chri, corr_chri_risk, corr_chri_label = compute_chri(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q1.8 Risk Stratification Buckets (5 points)

The **Risk Stratification Buckets** divide patients into *Low*, *Moderate*, and *High* risk groups based on their normalized stroke risk percentage.  
This helps compare how *age*, *symptom burden (SBI)*, and *chronic conditions (CCS)* vary across different risk levels.

*Hint: Normalize `stroke_risk_pct` by dividing by 100, then assign each patient to a risk category.*

**Tasks:**
1. Categorize patients into three buckets — *Low (0–0.3)*, *Moderate (0.3–0.7)*, *High (0.7–1.0)*. For each group, compute the average **age**, **SBI**, and **CCS**.  
2. Visualize these averages using a bar plot with a log scale on the y-axis and interpret how patient characteristics differ by risk level.

In [None]:
def compute_risk_buckets(df):
    """
    Create risk stratification buckets based on normalized stroke risk percentage
    """
    ...

bucket_summary = compute_risk_buckets(df)

<!-- END QUESTION -->



In [None]:
# Newly Created Columns
current_columns = df.columns.tolist()
new_columns = [col for col in current_columns if col not in original_columns]

print("Columns present in the final DataFrame:\n")
print(current_columns)

print("\n Newly created columns:")
if new_columns:
    print(new_columns)
else:
    print("No new columns were created")

### Part 2 - Clustering: Patient Risk Profiles

<!-- BEGIN QUESTION -->

#### Q2.1 Feature Preparation for Clustering (5 points)

Before applying clustering algorithms, it is important to prepare the dataset by selecting relevant continuous features and standardizing them.

**Tasks:**
1. Select continuous features relevant to clustering: `age`, `SBI`, `CCS`, `ANRI`, and `stroke_risk_pct`.  
2. Scale the selected features using `StandardScaler` so that each has mean = 0 and standard deviation = 1.  
3. Display the shape of the scaled matrix and the first few transformed rows.

In [None]:
from sklearn.preprocessing import StandardScaler

def prepare_clustering_features(df):
    """
    Select and standardize relevant continuous features for clustering.
    Returns the scaled feature matrix and its corresponding DataFrame.
    """
    
    ...
    
X_scaled = prepare_clustering_features(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q2.2 Optimal Number of Clusters (Elbow & Silhouette Methods) (20 points)

To determine the appropriate number of clusters (**k**), we use two evaluation methods:

- **Elbow Method:** observes the point where inertia (within-cluster variance) stops decreasing sharply.  
- **Silhouette Score:** measures how well clusters are separated (higher = better).  

**Tasks:**
1. Run **K-Means** clustering for values of \(k = 2\) to \(10\).  
2. Compute and store **Inertia** and **Silhouette Score** for each k.  
3. Visualize both metrics to identify the k that balances compactness and separation.  
4. Display the numeric summary table.

In [None]:
...

def evaluate_kmeans_clusters(X_scaled, k_min=2, k_max=10):
    """
    Evaluate optimal number of clusters (k) using Elbow and Silhouette methods.
    Computes inertia and silhouette scores for k in [k_min, k_max].
    """
    
    ...

opt_df = evaluate_kmeans_clusters(X_scaled)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q2.3 K-Means Clustering and Visualization (15 points)

Using the results from the previous analysis, choose an appropriate number of clusters (**k**, typically 3–5).

**K-Means** partitions patients into groups such that individuals in the same cluster are more similar to each other based on selected numeric features, while those in different clusters are more dissimilar.

**Tasks:**
1. Fit **K-Means** on the standardized feature matrix (`X_scaled`) with the chosen value of **k**.  
2. Reduce the dataset to 2 dimensions using **PCA** for visualization.  
3. Assign cluster labels to each patient and map descriptive names (e.g., *Low Risk*, *Moderate Risk*, *High Risk*).  
4. Plot the clusters in 2D PCA space and highlight the cluster centroids.

In [None]:
...

def perform_kmeans_clustering(df, X_scaled, k=3):
    """
    Fit K-Means clustering, project to 2D PCA space, and visualize clusters.
    Returns DataFrame with cluster labels and centroids in PCA space.
    """
    
    ...

df, centroids_pca = perform_kmeans_clustering(df, X_scaled, k=3)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q2.4 Cluster Profiling and Interpretation (10 points)

Once the clusters are formed, it is important to interpret what each group represents.  
We can do this by computing the **average feature values** for key indicators such as  
`age`, `SBI`, `CCS`, `ANRI`, and `stroke_risk_pct`.

**Tasks:**
1. Compute the mean values of these features for each cluster.  
2. Create a summary table showing the average characteristics per cluster.  
3. Briefly interpret what each cluster might represent (e.g., *young–low-risk*, *older–high-burden*).

In [None]:
def cluster_profiling(df):
    """
    Summarize and display average feature values for each cluster.
    """
    
    ...

cluster_profile = cluster_profiling(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q2.5 Cluster Risk Comparison (10 points)

To evaluate the clinical significance of each cluster, we compare their overall stroke risk levels.  
This helps determine whether certain clusters represent higher medical vulnerability.

**Tasks:**
1. For each cluster, compute:  
   - Mean `stroke_risk_pct`  
   - Proportion of patients where `at_risk = 1`  
2. Create a summary table showing these statistics.  
3. Visualize both metrics using side-by-side bar plots.

In [None]:
def cluster_risk_comparison(df):
    """
    Compare average stroke risk and at-risk proportion across clusters.
    """
    
    ...

cluster_risk = cluster_risk_comparison(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q2.6 Gaussian Mixture Model (GMM) Comparison (25 points)

While K-Means assigns each point to exactly one cluster,  
a **Gaussian Mixture Model (GMM)** allows *probabilistic* membership, capturing overlap between patient groups.

**Tasks:**
1. Fit a GMM with the same number of clusters (**k**) used in K-Means.  
2. Align GMM cluster labels with K-Means labels for consistency.  
3. Compute the **Adjusted Rand Index (ARI)** to quantify agreement between the two models.  
4. Visualize both K-Means and GMM results side by side in PCA space.

In [None]:
...

def compare_kmeans_gmm(df, X_scaled, k=3):
    """
    Compare K-Means and Gaussian Mixture Model (GMM) clustering.
    Computes Adjusted Rand Index (ARI) and visualizes both cluster assignments.
    """
    
    ...

df, ari_aligned = compare_kmeans_gmm(df, X_scaled, k=3)

<!-- END QUESTION -->

### Part 3 - Predictive Modeling: Stroke Risk Classification & Regression

<!-- BEGIN QUESTION -->

#### Q3.1 Feature Preparation (Symptoms Only) (5 points)

In this step, we identify and prepare symptom-based features for later modeling.  
These features capture physical and cardiovascular symptoms which may contribute to stroke risk.

**Tasks:**
1. Select all **symptom-related binary columns** from the dataset.  
2. Initialize a preprocessing pipeline using `StandardScaler` for numeric scaling.  
3. Display the list of selected features to confirm correct feature selection.

In [None]:
...

def prepare_symptom_features(df):
    """
    Select and scale symptom-related features for modeling.
    """
    ...

symptom_cols, preprocessor = prepare_symptom_features(df)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q3.2 Regression – Predict Stroke Risk Percentage (20 points + bonus points)

Train regression models using symptom-only features to predict the continuous variable `stroke_risk_pct`.  
Evaluate them using **RMSE**, **MAE**, and **R²**.

At the minimum use the following models:
* k-nearest neighbors
* random forest
* linear regression

5 bonus points for each additional model you use, for up to 2 additional models.

Of course you are free to explore even more models.

In [None]:
...

def train_regressors(df, symptom_cols, preprocessor):
    """
    Train and evaluate multiple regression models for stroke risk prediction.
    """
    
    ...

reg_df = train_regressors(df, symptom_cols, preprocessor)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q3.3 Best Regression Model – Feature Importance (20 points)

For the regression model evaluated, visualize its **most important features** influencing the predicted stroke risk percentage.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def analyze_best_regressor(df, reg_df, symptom_cols, preprocessor):
    """
    Display top features for the best-performing regression model.
    """
    
    ...

best_reg_name = analyze_best_regressor(df, reg_df, symptom_cols, preprocessor)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q3.4 Classification – Predict Stroke Risk Category (20 points + bonus points)

Using the same symptom-only features, train classification models to predict whether a patient is **at risk of stroke (1)** or **not at risk (0)**.  
Evaluate using **AUC**, **Accuracy**, **F1**, and **Balanced Accuracy**.

Use the following models:
* Random Forest
* k-nearest neighbors
* logistic regression

5 bonus points for each additional model you use, for up to 2 additional models.

Of course you are free to explore even more models.

In [None]:
...

def train_classifiers(df, symptom_cols, preprocessor):
    """
    Train and evaluate multiple classification models to predict at-risk status.
    """
    
    ...

cls_df = train_classifiers(df, symptom_cols, preprocessor)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q3.5 Best Classification Model – Confusion Matrix & Feature Importance (15 points)

For the model evaluated, display its **confusion matrix** and visualize its **most important features** influencing the stroke-risk classification.

In [None]:
...

def analyze_best_classifier(df, cls_df, symptom_cols, preprocessor):
    """
    Analyze and visualize the best classification model.
    """
    
    ...

best_cls_name = analyze_best_classifier(df, cls_df, symptom_cols, preprocessor)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Q3.6 Kaggle Submission (15 points + bonus points)

In the Kaggle competition, you will generate predictions for an unseen test dataset.  
Since the hidden test file is **not provided**, we will simulate this process using our model’s predictions on the validation (test) split.

**Tasks:**
1. Use your **best classification model** to predict stroke risk (`at_risk`) on the test set.  
2. Create a **submission DataFrame** with columns:  
   - `id` (sequential from 1 to n)  
   - `at_risk` (predicted 0 or 1)  
3. Save it as `sample_submission.csv` to simulate a Kaggle submission file.

The top 10 finishers get an additional 10 bonus points.

The 11-20 finishers get an additional 5 bonus points.

In [None]:
def create_kaggle_submission_simulated(df, symptom_cols, preprocessor, cls_df, filename="kaggle_submission.csv"):
    """
    Create Kaggle-style submission using the best classification model from cls_df.
    """
    
    ...

submission = create_kaggle_submission_simulated(df, symptom_cols, preprocessor, cls_df)

<!-- END QUESTION -->

