# Machine Learning Course: Model Development & Risk Classification

## Notebook 4: Comprehensive Model Training, Evaluation, and Risk Stratification

### Learning Objectives
By the end of this notebook, you will be able to:
1. Implement time-independent classification models for risk prediction
2. Develop time-dependent models using survival analysis approaches
3. Evaluate models using appropriate metrics (C-index, precision, recall, calibration)
4. Stratify patients into meaningful risk groups (low, medium, high risk)
5. Compare model performance across different algorithms and feature sets
6. Validate model robustness using cross-validation and bootstrap methods
7. Interpret model predictions and feature importance
8. Select and export final production-ready models

### Prerequisites
- Completed `01_data_exploration.ipynb`, `02_data_preprocessing.ipynb`, and `03_feature_selection.ipynb`
- Understanding of machine learning concepts and evaluation metrics
- Basic knowledge of survival analysis and risk stratification

### Model Development Framework
This notebook implements a comprehensive modeling pipeline:

**1. Time-Independent Classification**
- Logistic Regression with regularization
- Support Vector Machines (SVM)
- Random Forest and Gradient Boosting
- Neural Networks (Multi-layer Perceptron)

**2. Time-Dependent Classification**
- Cox Proportional Hazards models
- Random Survival Forests
- Time-varying coefficient models
- Accelerated Failure Time models

**3. Evaluation Metrics**
- **C-index (Concordance Index)**: Primary metric for ranking predictions
- **Precision/Recall**: Focus on precision in low-risk group identification
- **Calibration**: How well predicted probabilities match actual outcomes
- **Risk Stratification**: Ability to separate patients into meaningful groups

**4. Risk Stratification Analysis**
- Patient classification into risk groups (low, medium, high)
- Survival curve analysis by risk group
- Clinical utility assessment
- Decision curve analysis

**5. Model Selection & Validation**
- Cross-validation strategies
- Bootstrap validation
- Model ensemble approaches
- Final model selection and export

---

## 1. Setup and Imports

Let's import all necessary libraries for comprehensive model development, evaluation, and risk stratification.

In [None]:
# 📝 ACTIVITY 1: Import Libraries for Model Development and Risk Classification
#
# Your task: Import comprehensive libraries for machine learning, survival analysis, and evaluation
#
# TODO: Import core data manipulation and visualization libraries:
# 1. pandas (as pd) - for data manipulation and analysis
# 2. numpy (as np) - for numerical operations and array handling
# 3. matplotlib.pyplot (as plt) - for plotting and visualization
# 4. seaborn (as sns) - for statistical visualization
# 5. os, warnings, json - for system operations and data handling
#
# TODO: Import machine learning libraries:
# 6. From sklearn.linear_model import: LogisticRegression, ElasticNet
# 7. From sklearn.svm import: SVC
# 8. From sklearn.ensemble import: RandomForestClassifier, GradientBoostingClassifier
# 9. From sklearn.neural_network import: MLPClassifier
# 10. From sklearn.model_selection import: cross_val_score, StratifiedKFold, GridSearchCV
# 11. From sklearn.metrics import: roc_auc_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
# 12. From sklearn.calibration import: calibration_curve, CalibratedClassifierCV
#
# TODO: Import survival analysis libraries:
# 13. Try to import lifelines:
#     - from lifelines import CoxPHFitter, KaplanMeierFitter
#     - from lifelines.utils import concordance_index
#     - from lifelines.statistics import logrank_test
# 14. Try to import scikit-survival (alternative):
#     - from sksurv.linear_model import CoxPHRegressor
#     - from sksurv.ensemble import RandomSurvivalForest
#     - from sksurv.metrics import concordance_index_censored
#
# TODO: Import additional statistical and evaluation libraries:
# 15. From scipy.stats import: chi2_contingency, mannwhitneyu
# 16. From scipy import stats
# 17. import joblib - for model saving/loading
# 18. import time - for timing model training
#
# TODO: Configure environment:
# 19. Suppress warnings: warnings.filterwarnings('ignore')
# 20. Set matplotlib and seaborn styles
# 21. Set random seeds for reproducibility (np.random.seed(42))
# 22. Set pandas display options for better output
#
# TODO: Check library availability:
# 23. Test survival analysis library imports
# 24. Print versions of key libraries
# 25. Provide installation instructions for missing libraries
# 26. Set flags for available functionality
#
# Expected output: Successfully imported libraries with availability status for survival analysis

# Write your code below:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import os
# import warnings
# import json
# import joblib
# import time
# 
# # Machine Learning libraries
# from sklearn.linear_model import LogisticRegression, ElasticNet
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.neural_network import MLPClassifier
# from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV, train_test_split
# from sklearn.metrics import (roc_auc_score, precision_score, recall_score, f1_score, 
#                              classification_report, confusion_matrix, roc_curve, auc)
# from sklearn.calibration import calibration_curve, CalibratedClassifierCV
# 
# # Statistical libraries
# from scipy.stats import chi2_contingency, mannwhitneyu
# from scipy import stats
# 
# # Check survival analysis libraries availability
# lifelines_available = False
# sksurv_available = False
# 
# try:
#     from lifelines import CoxPHFitter, KaplanMeierFitter
#     from lifelines.utils import concordance_index
#     from lifelines.statistics import logrank_test
#     lifelines_available = True
#     print("✓ Lifelines available for survival analysis")
# except ImportError:
#     print("⚠️  Lifelines not available. Install with: pip install lifelines")
# 
# try:
#     from sksurv.linear_model import CoxPHRegressor
#     from sksurv.ensemble import RandomSurvivalForest
#     from sksurv.metrics import concordance_index_censored
#     sksurv_available = True
#     print("✓ Scikit-survival available for survival analysis")
# except ImportError:
#     print("⚠️  Scikit-survival not available. Install with: pip install scikit-survival")
# 
# # Configure environment
# warnings.filterwarnings('ignore')
# plt.style.use('default')
# sns.set_palette("Set2")
# np.random.seed(42)
# pd.set_option('display.max_columns', 20)
# 
# print(f"\\n📦 Library versions:")
# print(f"Pandas: {pd.__version__}")
# print(f"NumPy: {np.__version__}")
# print(f"Scikit-learn: {sklearn.__version__}")
# print(f"\\n🎯 Model development environment ready!")

## 2. Load Data and Feature Sets

Load the preprocessed data and selected feature sets from previous notebooks.

In [None]:
# 📝 ACTIVITY 2: Load Preprocessed Data and Selected Feature Sets
#
# Your task: Load clean datasets and optimized feature sets for model development
#
# TODO: Set up data paths:
# 1. Define PREPROCESSED_PATH = '../results/preprocessed/'
# 2. Define FEATURE_SELECTION_PATH = '../results/feature_selection/'
# 3. Define MODELS_PATH = '../results/models/'
# 4. Create models directory: os.makedirs(MODELS_PATH, exist_ok=True)
#
# TODO: Load training and validation data:
# 5. Load training features: X_train = pd.read_csv(PREPROCESSED_PATH + 'splits/X_train.csv', index_col=0)
# 6. Load training target: y_train = pd.read_csv(PREPROCESSED_PATH + 'splits/y_train.csv', index_col=0)['target']
# 7. Load validation data: X_val, y_val
# 8. Load test data: X_test, y_test (for final evaluation)
# 9. Print data dimensions and class distributions
#
# TODO: Load selected feature sets:
# 10. Load feature selection results (if available):
#     - Load ensemble feature sets from feature selection notebook
#     - Load method-specific feature sets (RF, Boruta, correlation, etc.)
#     - Handle case where feature selection results might not exist
# 11. If feature selection results not available, use all features
#
# TODO: Prepare multiple feature sets for comparison:
# 12. Create different feature sets for model comparison:
#     - full_features: all available features
#     - top_100_features: top 100 by variance or correlation
#     - top_50_features: top 50 most important features
#     - ensemble_features: if available from feature selection
# 13. Print feature set sizes and overlap analysis
#
# TODO: Load survival data (if available):
# 14. Check if survival information is available in clinical data
# 15. If available, prepare survival DataFrame with:
#     - duration: survival time (OS_MONTHS or similar)
#     - event: event indicator (1 if event occurred, 0 if censored)
# 16. Align survival data with feature matrices
# 17. Handle missing survival data appropriately
#
# TODO: Create data summary:
# 18. Print comprehensive data loading summary:
#     - Sample counts for train/validation/test
#     - Feature set sizes and descriptions
#     - Class balance in each split
#     - Survival data availability
#     - Memory usage statistics
#
# TODO: Validate data integrity:
# 19. Check for any remaining missing values
# 20. Verify that indices align across X and y
# 21. Confirm data types are appropriate for modeling
# 22. Check for any data leakage between splits
#
# Expected output: Loaded datasets with multiple feature sets ready for model development

# Write your code below:
# PREPROCESSED_PATH = '../results/preprocessed/'
# FEATURE_SELECTION_PATH = '../results/feature_selection/'
# MODELS_PATH = '../results/models/'
# os.makedirs(MODELS_PATH, exist_ok=True)
# 
# print("LOADING PREPROCESSED DATA AND FEATURE SETS")
# print("="*50)
# ...

## 3. Time-Independent Classification Models

Develop classification models that predict risk without considering time-to-event information.

In [None]:
# 📝 ACTIVITY 3: Train Time-Independent Classification Models
#
# Your task: Implement multiple classification algorithms for risk prediction
#
# TODO: Set up model training framework:
# 1. Print "TIME-INDEPENDENT CLASSIFICATION MODELS" header with separators
# 2. Initialize dictionary to store model results: model_results = {}
# 3. Set up cross-validation strategy: cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#
# TODO: Define model configurations:
# 4. Create dictionary of models to evaluate:
#    models = {
#        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
#        'Logistic Regression (L1)': LogisticRegression(penalty='l1', solver='liblinear', random_state=42),
#        'Logistic Regression (L2)': LogisticRegression(penalty='l2', random_state=42, max_iter=1000),
#        'SVM (RBF)': SVC(probability=True, random_state=42),
#        'SVM (Linear)': SVC(kernel='linear', probability=True, random_state=42),
#        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
#        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
#        'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
#    }
#
# TODO: Train models on each feature set:
# 5. For each feature set (full_features, top_100, top_50, ensemble):
#    - Create feature subset: X_train_subset = X_train[feature_set]
#    - For each model in models dictionary:
#      - Train model using cross-validation
#      - Calculate performance metrics
#      - Store results in model_results
#
# TODO: Implement comprehensive model evaluation:
# 6. For each model and feature set combination:
#    - Calculate cross-validation scores: cv_scores = cross_val_score(model, X_train_subset, y_train, cv=cv, scoring='roc_auc')
#    - Calculate mean and std of CV scores
#    - Fit model on full training set: model.fit(X_train_subset, y_train)
#    - Make predictions on validation set: y_val_pred = model.predict_proba(X_val_subset)[:, 1]
#    - Calculate validation metrics: AUC, precision, recall, F1-score
#
# TODO: Calculate additional evaluation metrics:
# 7. For each trained model:
#    - Calculate C-index (concordance index): c_index = roc_auc_score(y_val, y_val_pred)
#    - Calculate precision in low-risk group: identify patients predicted as low risk, calculate precision
#    - Calculate recall for high-risk group: sensitivity for identifying high-risk patients
#    - Calculate calibration metrics: calibration_curve(y_val, y_val_pred, n_bins=10)
#
# TODO: Implement model calibration:
# 8. For each model, create calibrated version:
#    - calibrated_model = CalibratedClassifierCV(base_model, method='platt', cv=3)
#    - Fit calibrated model and evaluate calibration improvement
#    - Store calibrated predictions for comparison
#
# TODO: Create performance comparison:
# 9. Create summary DataFrame with all model results:
#    - Columns: Model, Feature Set, CV AUC (mean±std), Validation AUC, Precision, Recall, F1, C-index
#    - Sort by validation AUC or C-index
#    - Identify top-performing models
#
# TODO: Visualize model performance:
# 10. Create ROC curves for top models
# 11. Create calibration plots comparing models
# 12. Create feature importance plots for tree-based models
# 13. Create performance comparison bar plots
#
# TODO: Analyze feature importance:
# 14. For models that provide feature importance (RF, GB):
#     - Extract feature importance scores
#     - Create feature importance rankings
#     - Compare feature importance across models
#     - Identify most consistently important features
#
# TODO: Save model results:
# 15. Save trained models: joblib.dump(model, MODELS_PATH + f'{model_name}_{feature_set}.pkl')
# 16. Save model performance summary
# 17. Save predictions on validation set
# 18. Create model training report
#
# Expected output: Trained time-independent models with comprehensive performance evaluation

# Write your code below:
# print("TIME-INDEPENDENT CLASSIFICATION MODELS")
# print("="*50)
# 
# # Initialize results storage
# model_results = {}
# cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ...

## 4. Time-Dependent Classification Models

Develop survival analysis models that incorporate time-to-event information for risk prediction.

In [None]:
# 📝 ACTIVITY 4: Train Time-Dependent Classification Models
#
# Your task: Implement survival analysis models for time-dependent risk prediction
#
# TODO: Set up survival analysis framework:
# 1. Print "TIME-DEPENDENT CLASSIFICATION MODELS" header with separators
# 2. Check if survival data is available from data loading
# 3. Initialize dictionary to store survival model results: survival_results = {}
#
# TODO: Prepare survival data:
# 4. If survival data available:
#    - Create survival DataFrame with duration, event, and features
#    - Ensure proper data types: duration as float, event as bool/int
#    - Handle missing or negative survival times
# 5. If survival data not available:
#    - Create proxy survival data using target variable and synthetic times
#    - Note limitations of proxy data approach
#
# TODO: Implement Cox Proportional Hazards models (if lifelines available):
# 6. For each feature set:
#    - Create Cox model: cph = CoxPHFitter()
#    - Prepare data: survival_df = pd.concat([X_train_subset, survival_data], axis=1)
#    - Fit model: cph.fit(survival_df, duration_col='duration', event_col='event')
#    - Calculate C-index: c_index = cph.concordance_index_
#    - Extract hazard ratios and confidence intervals
#    - Make risk predictions: risk_scores = cph.predict_partial_hazard(X_val_subset)
#
# TODO: Implement alternative Cox models (if scikit-survival available):
# 7. Use scikit-survival for Cox regression:
#    - Create structured array for survival: y_structured = np.array([(event, duration) for event, duration in zip(events, durations)], dtype=[('event', bool), ('duration', float)])
#    - Fit Cox model: cox_model = CoxPHRegressor().fit(X_train_subset, y_structured)
#    - Calculate C-index: c_index = concordance_index_censored(events_val, durations_val, risk_scores)[0]
#
# TODO: Implement Random Survival Forest (if available):
# 8. Train RSF model:
#    - rsf_model = RandomSurvivalForest(n_estimators=100, random_state=42)
#    - rsf_model.fit(X_train_subset, y_structured)
#    - Get risk scores: risk_scores = rsf_model.predict(X_val_subset)
#    - Calculate C-index and other metrics
#
# TODO: Alternative time-dependent approach (if survival libraries unavailable):
# 9. Implement time-stratified classification:
#    - Create time-based risk groups using target and available time information
#    - Train time-specific classifiers for different time horizons
#    - Combine predictions across time points
#    - Calculate time-dependent AUC
#
# TODO: Evaluate survival model performance:
# 10. For each survival model:
#     - Calculate C-index (primary metric for survival models)
#     - Calculate time-dependent AUC at different time points (if possible)
#     - Assess proportional hazards assumption (for Cox models)
#     - Calculate log-likelihood and AIC for model comparison
#
# TODO: Create risk stratification using survival models:
# 11. Use model predictions to create risk groups:
#     - Calculate risk score percentiles (tertiles or quartiles)
#     - Create low, medium, high risk groups based on score thresholds
#     - Validate risk group separation using log-rank tests
#
# TODO: Visualize survival model results:
# 12. Create Kaplan-Meier curves by risk group
# 13. Plot hazard ratios with confidence intervals (for Cox models)
# 14. Create time-dependent ROC curves (if possible)
# 15. Show survival probability predictions
#
# TODO: Compare time-dependent vs time-independent models:
# 16. Compare C-index between survival models and classification models
# 17. Assess which approach better stratifies patients
# 18. Analyze advantages and limitations of each approach
#
# TODO: Handle missing survival analysis libraries:
# 19. If neither lifelines nor scikit-survival available:
#     - Implement simplified time-dependent analysis
#     - Use logistic regression with time-stratified outcomes
#     - Provide guidance on installing survival analysis libraries
#     - Note limitations of simplified approach
#
# TODO: Save survival model results:
# 20. Save trained survival models
# 21. Save risk scores and group assignments
# 22. Create survival analysis report
# 23. Document model assumptions and limitations
#
# Expected output: Time-dependent models with C-index evaluation and risk stratification

# Write your code below:
# print("TIME-DEPENDENT CLASSIFICATION MODELS")
# print("="*50)
# 
# # Check survival data availability
# if 'survival_data' in locals() and survival_data is not None:
#     print("✓ Survival data available for time-dependent modeling")
#     survival_available = True
# else:
#     print("⚠️  No survival data available - using proxy approach")
#     survival_available = False
# 
# # Check survival analysis libraries
# if lifelines_available:
#     print("✓ Using lifelines for Cox regression")
# elif sksurv_available:
#     print("✓ Using scikit-survival for survival analysis")
# else:
#     print("⚠️  No survival analysis libraries - using time-stratified classification")
# ...

## 5. Comprehensive Model Evaluation

Evaluate all models using appropriate metrics including C-index, precision in low risk, and calibration.

In [None]:
# 📝 ACTIVITY 5: Comprehensive Model Evaluation with Clinical Metrics
#
# Your task: Evaluate all models using clinically relevant metrics and assessment frameworks
#
# TODO: Set up comprehensive evaluation framework:
# 1. Print "COMPREHENSIVE MODEL EVALUATION" header with separators
# 2. Combine results from time-independent and time-dependent models
# 3. Create unified evaluation DataFrame: all_model_results = pd.DataFrame()
#
# TODO: Calculate C-index for all models:
# 4. For time-independent models:
#    - C-index = AUC (for binary classification, C-index equals AUC)
#    - Calculate using: c_index = roc_auc_score(y_true, y_pred_proba)
# 5. For time-dependent models:
#    - Use survival-specific concordance index
#    - If lifelines: cph.concordance_index_
#    - If scikit-survival: concordance_index_censored()
# 6. Compare C-index values across all models
#
# TODO: Calculate precision in low-risk group:
# 7. For each model's risk predictions:
#    - Define low-risk threshold (e.g., bottom 25% or 33% of risk scores)
#    - low_risk_threshold = np.percentile(risk_scores, 25)
#    - low_risk_patients = risk_scores <= low_risk_threshold
#    - Calculate precision in low-risk group: precision_low_risk = (true_negatives) / (predicted_low_risk)
#    - This measures how well the model identifies truly low-risk patients
#
# TODO: Calculate precision in high-risk group:
# 8. Define high-risk threshold (e.g., top 25% of risk scores)
# 9. high_risk_threshold = np.percentile(risk_scores, 75)
# 10. Calculate precision in high-risk group: how many predicted high-risk are truly high-risk
# 11. Calculate recall for high-risk: sensitivity for detecting high-risk patients
#
# TODO: Assess model calibration:
# 12. For each model with probability outputs:
#     - Calculate calibration curve: fraction_of_positives, mean_predicted_value = calibration_curve(y_true, y_prob, n_bins=10)
#     - Calculate Brier score: brier_score = np.mean((y_prob - y_true) ** 2)
#     - Calculate calibration slope and intercept
#     - Assess reliability diagram visually
#
# TODO: Calculate additional clinical metrics:
# 13. Sensitivity (recall): ability to identify high-risk patients
# 14. Specificity: ability to identify low-risk patients correctly
# 15. Positive Predictive Value (PPV): precision for high-risk predictions
# 16. Negative Predictive Value (NPV): precision for low-risk predictions
# 17. Number Needed to Treat (NNT) estimates (if applicable)
#
# TODO: Implement risk stratification evaluation:
# 18. For each model, create risk groups (tertiles or quartiles):
#     - low_risk = bottom 33%, medium_risk = middle 33%, high_risk = top 33%
#     - Calculate event rates in each risk group
#     - Test statistical separation between groups using chi-square test
#     - Calculate hazard ratios between risk groups (if survival data available)
#
# TODO: Create comprehensive evaluation metrics table:
# 19. Create evaluation summary with columns:
#     - Model Name, Feature Set, C-index, AUC, Precision (Low Risk), Precision (High Risk)
#     - Sensitivity, Specificity, PPV, NPV, Brier Score, Calibration Slope
# 20. Sort by C-index or clinical utility metrics
# 21. Highlight top-performing models
#
# TODO: Visualize model evaluation results:
# 22. Create ROC curves comparison plot
# 23. Create calibration plots for all models
# 24. Create precision-recall curves
# 25. Create C-index comparison bar plot
# 26. Create risk group separation visualization
#
# TODO: Assess model clinical utility:
# 27. Calculate decision curve analysis (if possible):
#     - Plot net benefit across probability thresholds
#     - Compare with treat-all and treat-none strategies
# 28. Calculate area under the decision curve
# 29. Identify optimal probability thresholds for clinical decisions
#
# TODO: Evaluate model stability and robustness:
# 30. Bootstrap evaluation (if computationally feasible):
#     - Calculate confidence intervals for C-index and other metrics
#     - Assess metric stability across bootstrap samples
# 31. Cross-validation consistency analysis
# 32. Feature importance stability assessment
#
# TODO: Create model ranking and selection criteria:
# 33. Rank models by multiple criteria:
#     - Primary: C-index or AUC
#     - Secondary: Precision in low-risk group
#     - Tertiary: Calibration quality
#     - Clinical utility: Decision curve analysis
# 34. Create weighted scoring system combining multiple metrics
# 35. Identify top 3-5 models for further validation
#
# Expected output: Comprehensive evaluation report with clinical relevance assessment

# Write your code below:
# print("COMPREHENSIVE MODEL EVALUATION")
# print("="*50)
# 
# def calculate_precision_in_risk_group(y_true, risk_scores, percentile_low=25, percentile_high=75):
#     \"\"\"Calculate precision in low and high risk groups\"\"\"
#     low_threshold = np.percentile(risk_scores, percentile_low)
#     high_threshold = np.percentile(risk_scores, percentile_high)
#     
#     # Low risk group precision (NPV-like metric)
#     low_risk_mask = risk_scores <= low_threshold
#     if np.sum(low_risk_mask) > 0:
#         precision_low = np.sum((y_true == 0) & low_risk_mask) / np.sum(low_risk_mask)
#     else:
#         precision_low = np.nan
#     
#     # High risk group precision (PPV-like metric)
#     high_risk_mask = risk_scores >= high_threshold
#     if np.sum(high_risk_mask) > 0:
#         precision_high = np.sum((y_true == 1) & high_risk_mask) / np.sum(high_risk_mask)
#     else:
#         precision_high = np.nan
#     
#     return precision_low, precision_high
# ...

## 6. Risk Stratification Analysis

Stratify patients into meaningful risk groups and evaluate the clinical utility of these classifications.

In [None]:
# 📝 ACTIVITY 6: Patient Risk Stratification and Clinical Utility Analysis
#
# Your task: Create meaningful patient risk groups and evaluate their clinical utility
#
# TODO: Set up risk stratification framework:
# 1. Print "PATIENT RISK STRATIFICATION ANALYSIS" header with separators
# 2. Select top-performing models from evaluation results
# 3. Initialize risk stratification results storage: stratification_results = {}
#
# TODO: Create risk group definitions:
# 4. For each top model, define risk groups using different strategies:
#    - Tertiles: low (0-33%), medium (33-67%), high (67-100%)
#    - Quartiles: low (0-25%), low-medium (25-50%), medium-high (50-75%), high (75-100%)
#    - Clinical thresholds: if known risk score cutoffs exist
#    - Equal-sized groups vs outcome-based groups
#
# TODO: Implement risk group assignment:
# 5. For each model and stratification strategy:
#    - Calculate risk score percentiles: risk_percentiles = np.percentile(risk_scores, [33, 67])
#    - Assign patients to groups: 
#      * low_risk = risk_scores <= risk_percentiles[0]
#      * medium_risk = (risk_scores > risk_percentiles[0]) & (risk_scores <= risk_percentiles[1])
#      * high_risk = risk_scores > risk_percentiles[1]
#    - Store group assignments and score thresholds
#
# TODO: Evaluate risk group separation:
# 6. For each risk stratification:
#    - Calculate event rates in each group: event_rate_low, event_rate_medium, event_rate_high
#    - Test statistical significance: chi2, p_value = chi2_contingency([[events_low, non_events_low], [events_high, non_events_high]])
#    - Calculate odds ratios between groups
#    - Assess monotonic relationship: low < medium < high event rates
#
# TODO: Survival analysis by risk group (if survival data available):
# 7. Create Kaplan-Meier survival curves for each risk group:
#    - Use KaplanMeierFitter for each group
#    - kmf_low.fit(durations_low, events_low, label='Low Risk')
#    - Plot survival curves with confidence intervals
# 8. Perform log-rank tests between groups:
#    - Test low vs high: logrank_test(durations_low, durations_high, events_low, events_high)
#    - Test pairwise differences between all groups
# 9. Calculate hazard ratios between risk groups
#
# TODO: Calculate clinical utility metrics:
# 10. For each risk stratification strategy:
#     - Net Reclassification Improvement (NRI): improvement in patient classification
#     - Integrated Discrimination Improvement (IDI): improvement in risk discrimination
#     - Decision Curve Analysis: net benefit across probability thresholds
#     - Number needed to screen/treat for each risk group
#
# TODO: Assess risk group characteristics:
# 11. For each risk group, analyze:
#     - Demographic characteristics (age, stage, etc.)
#     - Biomarker profiles (if available)
#     - Treatment response patterns (if available)
#     - Survival outcomes and follow-up times
#
# TODO: Create comprehensive risk stratification visualization:
# 12. Create multi-panel figure showing:
#     - Risk score distributions by group
#     - Event rates by risk group (bar plot with confidence intervals)
#     - Kaplan-Meier survival curves (if applicable)
#     - ROC curves with group-specific performance
#
# TODO: Validate risk group stability:
# 13. Bootstrap analysis of risk group assignments:
#     - Calculate stability of group membership across bootstrap samples
#     - Assess consistency of event rate differences
#     - Evaluate robustness of survival curve separation
#
# TODO: Compare risk stratification approaches:
# 14. Compare different models' risk stratification:
#     - Agreement in risk group assignments between models
#     - Correlation of risk scores across models
#     - Clinical utility comparison (which provides better stratification)
#
# TODO: Create clinical interpretation:
# 15. For the best risk stratification approach:
#     - Define clinical meaning of each risk group
#     - Suggest treatment/monitoring strategies for each group
#     - Calculate absolute risk estimates with confidence intervals
#     - Provide risk communication guidelines
#
# TODO: Evaluate practical implementation:
# 16. Assess feasibility of risk stratification in clinical practice:
#     - Feature availability and measurement costs
#     - Model complexity and interpretability
#     - Integration with existing clinical workflows
#     - Potential for clinical decision support
#
# TODO: Save risk stratification results:
# 17. Export risk group assignments for validation set
# 18. Save risk score thresholds and group definitions
# 19. Create clinical implementation guide
# 20. Generate risk stratification report with interpretation
#
# Expected output: Clinically meaningful risk groups with comprehensive evaluation and implementation guidance

# Write your code below:
# print("PATIENT RISK STRATIFICATION ANALYSIS")
# print("="*50)
# 
# def create_risk_groups(risk_scores, strategy='tertiles'):
#     \"\"\"Create risk groups using different strategies\"\"\"
#     if strategy == 'tertiles':
#         thresholds = np.percentile(risk_scores, [33.33, 66.67])
#         groups = np.where(risk_scores <= thresholds[0], 'Low',
#                          np.where(risk_scores <= thresholds[1], 'Medium', 'High'))
#     elif strategy == 'quartiles':
#         thresholds = np.percentile(risk_scores, [25, 50, 75])
#         groups = np.where(risk_scores <= thresholds[0], 'Low',
#                          np.where(risk_scores <= thresholds[1], 'Low-Medium',
#                                  np.where(risk_scores <= thresholds[2], 'Medium-High', 'High')))
#     else:  # binary
#         threshold = np.percentile(risk_scores, 50)
#         groups = np.where(risk_scores <= threshold, 'Low', 'High')
#         thresholds = [threshold]
#     
#     return groups, thresholds
# 
# def evaluate_risk_group_separation(y_true, risk_groups):
#     \"\"\"Evaluate statistical separation between risk groups\"\"\"
#     unique_groups = np.unique(risk_groups)
#     event_rates = {}
#     
#     for group in unique_groups:
#         mask = risk_groups == group
#         event_rate = np.mean(y_true[mask])
#         n_events = np.sum(y_true[mask])
#         n_total = np.sum(mask)
#         
#         # Calculate confidence interval for event rate
#         se = np.sqrt(event_rate * (1 - event_rate) / n_total)
#         ci_lower = max(0, event_rate - 1.96 * se)
#         ci_upper = min(1, event_rate + 1.96 * se)
#         
#         event_rates[group] = {
#             'rate': event_rate,
#             'events': n_events,
#             'total': n_total,
#             'ci_lower': ci_lower,
#             'ci_upper': ci_upper
#         }
#     
#     return event_rates
# ...

## 7. Model Comparison and Selection

Compare all models comprehensively and select the best performing model for final validation.

In [None]:
# 📝 ACTIVITY 7: Comprehensive Model Comparison and Final Selection
#
# Your task: Compare all models systematically and select the best model for clinical use
#
# TODO: Set up model comparison framework:
# 1. Print "MODEL COMPARISON AND SELECTION" header with separators
# 2. Consolidate results from all models (time-independent + time-dependent)
# 3. Create comprehensive comparison DataFrame with all metrics
#
# TODO: Create multi-criteria evaluation matrix:
# 4. Define evaluation criteria with weights:
#    - Primary criterion: C-index/AUC (weight: 0.3)
#    - Clinical utility: Precision in low-risk group (weight: 0.25)
#    - Discrimination: Risk group separation (weight: 0.2)
#    - Calibration: Brier score or calibration slope (weight: 0.15)
#    - Interpretability: Model complexity score (weight: 0.1)
# 5. Normalize all metrics to 0-1 scale for comparison
#
# TODO: Calculate composite performance scores:
# 6. For each model, calculate weighted composite score:
#    - composite_score = Σ(weight_i × normalized_metric_i)
# 7. Rank models by composite score
# 8. Identify top 5 models for detailed comparison
#
# TODO: Perform statistical significance testing:
# 9. Compare top models using statistical tests:
#    - Compare C-index differences using DeLong test (if available)
#    - Bootstrap confidence intervals for metric differences
#    - McNemar test for paired model comparisons (if applicable)
# 10. Identify models with significantly better performance
#
# TODO: Analyze model agreement and diversity:
# 11. Calculate prediction correlations between top models:
#     - Pearson correlation of risk scores
#     - Agreement in risk group assignments (kappa statistic)
#     - Identify complementary vs redundant models
# 12. Consider ensemble potential based on model diversity
#
# TODO: Evaluate model ensemble approaches:
# 13. Create ensemble models using top performers:
#     - Simple averaging: ensemble_score = mean(model_scores)
#     - Weighted averaging: based on individual model performance
#     - Stacking: train meta-model on model predictions
# 14. Evaluate ensemble performance vs individual models
# 15. Test ensemble stability and overfitting
#
# TODO: Assess practical considerations:
# 16. For each top model, evaluate:
#     - Feature requirements and data availability
#     - Computational complexity and inference time
#     - Model interpretability and explainability
#     - Robustness to missing data
#     - Calibration across different subgroups
#
# TODO: Create comprehensive comparison visualization:
# 17. Create radar chart showing model performance across metrics
# 18. Create scatter plot of C-index vs precision in low-risk
# 19. Create model ranking visualization with confidence intervals
# 20. Show feature importance comparison for top models
#
# TODO: Validate model selection criteria:
# 21. Sensitivity analysis: how does ranking change with different weights?
# 22. Cross-validation stability: consistent top performers across CV folds?
# 23. Subgroup analysis: performance consistency across patient subgroups
#
# TODO: Clinical decision-making analysis:
# 24. Decision curve analysis for top models:
#     - Net benefit across probability thresholds
#     - Clinical utility comparison
#     - Optimal decision thresholds for each model
# 25. Cost-effectiveness considerations (if cost data available)
#
# TODO: Final model selection:
# 26. Apply selection criteria in priority order:
#     - Statistical performance (C-index, calibration)
#     - Clinical utility (risk stratification quality)
#     - Practical implementation (interpretability, robustness)
#     - Regulatory considerations (if applicable)
# 27. Select final model with justification
# 28. Select backup model in case of implementation issues
#
# TODO: Document selection rationale:
# 29. Create detailed model selection report with:
#     - Comparison methodology and criteria
#     - Statistical analysis results
#     - Clinical utility assessment
#     - Implementation considerations
#     - Final recommendation with rationale
#
# Expected output: Selected final model with comprehensive justification and backup options

# Write your code below:
# print("MODEL COMPARISON AND SELECTION")
# print("="*50)
# 
# def normalize_metric(metric_values, higher_is_better=True):
#     \"\"\"Normalize metrics to 0-1 scale\"\"\"
#     min_val, max_val = np.min(metric_values), np.max(metric_values)
#     if max_val == min_val:
#         return np.ones_like(metric_values)
#     
#     normalized = (metric_values - min_val) / (max_val - min_val)
#     if not higher_is_better:
#         normalized = 1 - normalized
#     return normalized
# 
# def calculate_composite_score(metrics_df, weights):
#     \"\"\"Calculate weighted composite performance score\"\"\"
#     composite_scores = np.zeros(len(metrics_df))
#     
#     for metric, weight in weights.items():
#         if metric in metrics_df.columns:
#             # Determine if higher is better for this metric
#             higher_is_better = metric not in ['brier_score', 'complexity_score']
#             normalized_values = normalize_metric(metrics_df[metric].values, higher_is_better)
#             composite_scores += weight * normalized_values
#     
#     return composite_scores
# 
# # Define evaluation criteria and weights
# evaluation_criteria = {
#     'c_index': 0.30,
#     'precision_low_risk': 0.25,
#     'risk_group_separation': 0.20,
#     'calibration_score': 0.15,
#     'interpretability_score': 0.10
# }
# ...

## 8. Final Model Validation and Export

Validate the selected model on test data and prepare for clinical implementation.

In [None]:
# 📝 ACTIVITY 8: Final Model Validation and Clinical Implementation
#
# Your task: Validate the selected model on test data and prepare for clinical deployment
#
# TODO: Set up final validation framework:
# 1. Print "FINAL MODEL VALIDATION AND EXPORT" header with separators
# 2. Load the selected final model from previous activity
# 3. Prepare test data with selected features: X_test_final = X_test[selected_features]
#
# TODO: Perform final model validation on test set:
# 4. Make predictions on test set: test_predictions = final_model.predict_proba(X_test_final)[:, 1]
# 5. Calculate final performance metrics:
#    - Test C-index: test_c_index = roc_auc_score(y_test, test_predictions)
#    - Test precision in low-risk: calculate using test predictions
#    - Test calibration: calibration_curve(y_test, test_predictions)
#    - Test risk stratification: create risk groups and evaluate separation
#
# TODO: Compare test performance to validation performance:
# 6. Calculate performance differences:
#    - c_index_difference = test_c_index - validation_c_index
#    - precision_difference = test_precision_low - validation_precision_low
# 7. Assess overfitting: significant drops indicate overfitting
# 8. Calculate confidence intervals for test metrics using bootstrap
#
# TODO: Create final risk stratification on test set:
# 9. Apply risk group thresholds from validation to test set
# 10. Calculate event rates in each test risk group
# 11. Perform survival analysis on test risk groups (if applicable)
# 12. Validate risk group separation using statistical tests
#
# TODO: Model interpretation and feature importance:
# 13. Extract final feature importance/coefficients:
#     - For linear models: coefficient values and confidence intervals
#     - For tree-based models: feature importance scores
#     - For neural networks: permutation importance
# 14. Create feature importance visualization
# 15. Identify top 10 most important features with interpretation
#
# TODO: Create clinical implementation materials:
# 16. Generate risk calculator/nomogram (for interpretable models):
#     - Point-based scoring system
#     - Risk probability tables
#     - Clinical decision thresholds
# 17. Create feature collection guidelines:
#     - Required clinical variables
#     - Laboratory test requirements
#     - Data quality specifications
#
# TODO: Model robustness and sensitivity analysis:
# 18. Test model performance on subgroups:
#     - Performance by age groups, disease stage, etc.
#     - Identify potential bias or performance disparities
# 19. Missing data sensitivity:
#     - Test performance with simulated missing features
#     - Evaluate graceful degradation
# 20. Feature perturbation analysis:
#     - How sensitive are predictions to small feature changes?
#
# TODO: Create comprehensive validation report:
# 21. Compile final model report including:
#     - Model architecture and hyperparameters
#     - Training/validation/test performance
#     - Feature importance and interpretation
#     - Clinical utility and risk stratification results
#     - Implementation guidelines and requirements
#
# TODO: Export final model and artifacts:
# 22. Save final trained model: joblib.dump(final_model, MODELS_PATH + 'final_risk_model.pkl')
# 23. Save preprocessing pipeline: joblib.dump(preprocessing_pipeline, MODELS_PATH + 'preprocessing_pipeline.pkl')
# 24. Export selected features list: pd.Series(selected_features).to_csv(MODELS_PATH + 'selected_features.csv')
# 25. Save risk group thresholds: json.dump(risk_thresholds, open(MODELS_PATH + 'risk_thresholds.json', 'w'))
#
# TODO: Create model deployment package:
# 26. Create prediction function template:
#     def predict_risk(patient_features):
#         # Load model and preprocessing
#         # Apply preprocessing
#         # Make prediction
#         # Return risk score and group
# 27. Create validation dataset for deployment testing
# 28. Generate model card/documentation for clinical users
#
# TODO: Final performance summary:
# 29. Create executive summary with:
#     - Final model performance metrics
#     - Clinical utility assessment
#     - Implementation requirements
#     - Monitoring and maintenance recommendations
# 30. Generate one-page model summary for clinical stakeholders
#
# Expected output: Production-ready model with comprehensive validation and implementation materials

# Write your code below:
# print("FINAL MODEL VALIDATION AND EXPORT")
# print("="*50)
# 
# def create_risk_calculator_points(model, feature_names, reference_values=None):
#     \"\"\"Create point-based risk calculator for linear models\"\"\"
#     if hasattr(model, 'coef_'):
#         coefficients = model.coef_[0] if len(model.coef_.shape) > 1 else model.coef_
#         
#         # Calculate points (typically scaled to make max points = 100)
#         max_coef = np.max(np.abs(coefficients))
#         points = (coefficients / max_coef) * 100
#         
#         risk_calculator = pd.DataFrame({
#             'Feature': feature_names,
#             'Coefficient': coefficients,
#             'Points': points.round(1)
#         })
#         
#         return risk_calculator
#     else:
#         print("Risk calculator only available for linear models")
#         return None
# 
# def bootstrap_test_performance(model, X_test, y_test, n_bootstrap=1000):
#     \"\"\"Calculate bootstrap confidence intervals for test performance\"\"\"
#     bootstrap_scores = []
#     n_samples = len(X_test)
#     
#     for i in range(n_bootstrap):
#         # Bootstrap sample
#         indices = np.random.choice(n_samples, n_samples, replace=True)
#         X_boot = X_test.iloc[indices]
#         y_boot = y_test.iloc[indices]
#         
#         # Calculate performance
#         pred_boot = model.predict_proba(X_boot)[:, 1]
#         score_boot = roc_auc_score(y_boot, pred_boot)
#         bootstrap_scores.append(score_boot)
#     
#     # Calculate confidence intervals
#     ci_lower = np.percentile(bootstrap_scores, 2.5)
#     ci_upper = np.percentile(bootstrap_scores, 97.5)
#     
#     return np.mean(bootstrap_scores), ci_lower, ci_upper
# ...

---

## 📚 Model Development Summary

In this comprehensive notebook, you have successfully completed:

### ✅ **Completed Tasks:**
1. **Library Setup**: Imported ML, survival analysis, and evaluation libraries with availability checking
2. **Data Loading**: Loaded preprocessed data and selected feature sets from previous notebooks
3. **Time-Independent Models**: Trained classification models (Logistic, SVM, RF, GB, Neural Networks)
4. **Time-Dependent Models**: Implemented survival analysis models (Cox regression, RSF)
5. **Comprehensive Evaluation**: Applied C-index, precision in low-risk, calibration metrics
6. **Risk Stratification**: Created meaningful patient risk groups with clinical validation
7. **Model Comparison**: Systematically compared all models using multi-criteria evaluation
8. **Final Validation**: Validated selected model on test data with clinical implementation prep

### 🎯 **Key Modeling Achievements:**
- **Comprehensive Approach**: Both time-independent and time-dependent modeling strategies
- **Clinical Focus**: Emphasis on C-index and precision in low-risk group identification
- **Risk Stratification**: Meaningful patient group classification with statistical validation
- **Robust Evaluation**: Multiple metrics including calibration and clinical utility assessment
- **Production Ready**: Final model with implementation materials and deployment package

### 📊 **Evaluation Metrics Implemented:**
- **C-index (Concordance Index)**: Primary ranking metric for both classification and survival models
- **Precision in Low-Risk**: Clinical utility metric for identifying low-risk patients accurately
- **Risk Group Separation**: Statistical validation of meaningful risk stratification
- **Model Calibration**: Assessment of prediction reliability using calibration curves
- **Clinical Utility**: Decision curve analysis and net benefit evaluation

### 🏥 **Clinical Implementation Features:**
- **Risk Group Definitions**: Low, medium, high risk classifications with clear thresholds
- **Survival Analysis**: Time-to-event modeling with Kaplan-Meier curves and log-rank tests
- **Model Interpretability**: Feature importance analysis and clinical decision support
- **Robustness Testing**: Subgroup analysis and missing data sensitivity assessment

### 🔄 **Next Steps (Optional Advanced Work):**
1. **External Validation**: Validate model on independent datasets from other institutions
2. **Prospective Validation**: Design prospective clinical study for model validation
3. **Clinical Decision Support**: Integrate model into electronic health record systems
4. **Regulatory Submission**: Prepare materials for FDA/CE mark approval (if applicable)
5. **Continuous Learning**: Implement model updating pipeline with new data

### 📁 **Exported Files:**
- `../results/models/final_risk_model.pkl`: Production-ready trained model
- `../results/models/preprocessing_pipeline.pkl`: Data preprocessing pipeline
- `../results/models/selected_features.csv`: Final feature set for model
- `../results/models/risk_thresholds.json`: Risk group classification thresholds
- `../results/models/model_validation_report.json`: Comprehensive model documentation

### 🎯 **Model Performance Summary:**
- **Best Model**: [Selected based on comprehensive evaluation criteria]
- **Test C-index**: [Final performance on held-out test set]
- **Precision in Low-Risk**: [Accuracy of low-risk predictions]
- **Risk Stratification**: [Statistical significance of patient group separation]
- **Clinical Utility**: [Decision curve analysis and net benefit assessment]

---

**Exceptional work completing the comprehensive model development pipeline! 🎉**

You've successfully developed, evaluated, and validated both time-independent and time-dependent models for patient risk classification. The rigorous evaluation framework with clinical metrics (C-index, precision in low-risk) and comprehensive risk stratification analysis provides a solid foundation for clinical implementation. Your final model is production-ready with complete documentation and deployment materials.