# Machine Learning Course: Model Development & Clinical Evaluation

## Notebook 4: Risk Prediction Models and Clinical Validation

> ** Prerequisites:** Complete notebooks 1-3 first. This notebook loads the optimized feature sets from notebook 3 and focuses exclusively on model training, evaluation, and clinical validation.

### Learning Objectives
By the end of this notebook, you will be able to:
1. **Train time-independent classification models** for risk prediction (Logistic, SVM, RF, GBM, NN)
2. **Train time-dependent survival models** using Cox regression and Random Survival Forests
3. **Evaluate models using clinical metrics**: C-index, precision in low-risk, calibration
4. **Stratify patients into risk groups** and validate clinical utility
5. **Compare and select optimal models** for clinical deployment
6. **Export production-ready models** with implementation materials

### Data Flow
- **Input**: Optimized feature sets from `03_feature_selection.ipynb`
- **Processing**: Model training, hyperparameter tuning, clinical evaluation
- **Output**: Production-ready risk prediction models

### Clinical Evaluation Framework
- **C-index (Concordance Index)**: Primary metric for ranking predictions
- **Precision in Low-Risk Group**: Identifying patients who truly don't need aggressive treatment
- **Calibration**: Agreement between predicted and observed event rates
- **Risk Stratification**: Statistical validation of low/medium/high risk groups
- **Survival Curves**: Kaplan-Meier analysis by risk group (if survival data available)

---

## 1. Load Data and Selected Features

Load the preprocessed data and optimized feature sets from notebook 3.

In [None]:
# 📝 ACTIVITY 1: Load Selected Features and Preprocessed Data
#
# Your task: Load the optimized feature sets from notebook 3 and preprocessed data
#
# TODO: Import required libraries:
# 1. Import pandas, numpy, matplotlib.pyplot, seaborn
# 2. Import sklearn models: LogisticRegression, SVC, RandomForestClassifier, GradientBoostingClassifier
# 3. Import sklearn.neural_network: MLPClassifier
# 4. Import sklearn.model_selection: cross_val_score, StratifiedKFold, GridSearchCV
# 5. Import sklearn.metrics: roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix
# 6. Import sklearn.calibration: calibration_curve, CalibratedClassifierCV
# 7. Try importing lifelines: CoxPHFitter, KaplanMeierFitter, concordance_index, logrank_test
# 8. Import scipy.stats: chi2_contingency
# 9. Import joblib, time
#
# TODO: Load feature selection results from notebook 3:
# 10. Load from '../results/feature_selection/':
#     - ensemble_minimal.csv (high-confidence features)
#     - ensemble_balanced.csv (recommended feature set)
#     - ensemble_comprehensive.csv (maximum coverage)
# 11. Choose which feature set to use for modeling (typically balanced)
#
# TODO: Load preprocessed data from notebook 2:
# 12. Load from '../results/preprocessing/':
#     - X_train_scaled.csv, X_val_scaled.csv, X_test_scaled.csv
#     - y_train.csv, y_val.csv, y_test.csv
# 13. Filter X datasets to only include selected features
#
# TODO: Verify data alignment:
# 14. Check shapes: X_train, X_val, X_test, y_train, y_val, y_test
# 15. Confirm target variables include RFS_MONTHS and RFS_STATUS (from preprocessing)
# 16. Verify no missing values in final datasets
#
# TODO: Display data summary:
# 17. Print number of features selected
# 18. Print dataset sizes (train/val/test)
# 19. Print target distribution
# 20. Confirm data is ready for modeling
#
# Expected output: Loaded feature sets and data ready for model training

# Write your code below:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# 
# # Load selected features from notebook 3
# feature_selection_dir = '../results/feature_selection/'
# selected_features_df = pd.read_csv(f'{feature_selection_dir}ensemble_balanced.csv')
# selected_features = selected_features_df['feature'].tolist() if 'feature' in selected_features_df.columns else selected_features_df.iloc[:, 0].tolist()
# 
# # Load preprocessed data from notebook 2
# preprocessing_dir = '../results/preprocessing/'
# X_train = pd.read_csv(f'{preprocessing_dir}X_train_scaled.csv', index_col=0)[selected_features]
# X_val = pd.read_csv(f'{preprocessing_dir}X_val_scaled.csv', index_col=0)[selected_features]
# X_test = pd.read_csv(f'{preprocessing_dir}X_test_scaled.csv', index_col=0)[selected_features]
# y_train = pd.read_csv(f'{preprocessing_dir}y_train.csv', index_col=0)
# y_val = pd.read_csv(f'{preprocessing_dir}y_val.csv', index_col=0)
# y_test = pd.read_csv(f'{preprocessing_dir}y_test.csv', index_col=0)
# ...

## 2. Time-Independent Classification Models

Train multiple classification algorithms to predict risk without time-to-event information.

In [None]:
# 📝 ACTIVITY 2: Train Time-Independent Classification Models
#
# Your task: Train multiple classification algorithms for risk prediction
#
# TODO: Set up model training framework:
# 1. Define target variable: y_binary = y_train['RFS_STATUS'] (0=no recurrence, 1=recurrence)
# 2. Initialize results storage: model_results = {}
# 3. Set up cross-validation: cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#
# TODO: Define models to train:
# 4. Create dictionary of models:
#    models = {
#        'Logistic Regression (L1)': LogisticRegression(penalty='l1', solver='liblinear', C=1.0, random_state=42),
#        'Logistic Regression (L2)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42),
#        'SVM (RBF)': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
#        'SVM (Linear)': SVC(kernel='linear', C=1.0, probability=True, random_state=42),
#        'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
#        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
#        'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
#    }
#
# TODO: Train and evaluate each model:
# 5. For each model in models:
#    - Fit on training data: model.fit(X_train, y_binary)
#    - Get training predictions: train_pred_proba = model.predict_proba(X_train)[:, 1]
#    - Get validation predictions: val_pred_proba = model.predict_proba(X_val)[:, 1]
#    - Calculate training AUC: train_auc = roc_auc_score(y_binary, train_pred_proba)
#    - Calculate validation AUC: val_auc = roc_auc_score(y_val['RFS_STATUS'], val_pred_proba)
#    - Perform cross-validation: cv_scores = cross_val_score(model, X_train, y_binary, cv=cv, scoring='roc_auc')
#    - Store results: model_results[model_name] = {'train_auc': ..., 'val_auc': ..., 'cv_mean': ..., 'cv_std': ...}
#
# TODO: Calculate additional metrics:
# 6. For each model's validation predictions:
#    - Calculate precision, recall, F1 at optimal threshold
#    - Calculate confusion matrix
#    - Store metrics in model_results
#
# TODO: Visualize model performance:
# 7. Create comparison plot: model performance (AUC) with error bars
# 8. Create ROC curves for top 3 models on validation set
# 9. Print performance summary table
#
# TODO: Identify top models:
# 10. Sort models by validation AUC
# 11. Select top 3 models for further analysis
# 12. Print top model summary
#
# Expected output: Trained classification models with performance comparison

# Write your code below:
# print("TIME-INDEPENDENT CLASSIFICATION MODELS")
# print("="*60)
# 
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.neural_network import MLPClassifier
# from sklearn.model_selection import cross_val_score, StratifiedKFold
# from sklearn.metrics import roc_auc_score
# ...

## 3. Time-Dependent Survival Models

Train survival analysis models that incorporate time-to-event information using RFS_MONTHS and RFS_STATUS.

In [None]:
# 📝 ACTIVITY 3: Train Time-Dependent Survival Models
#
# Your task: Train Cox regression and survival models using RFS data from preprocessing
#
# TODO: Check survival library availability:
# 1. Try importing: from lifelines import CoxPHFitter, KaplanMeierFitter
# 2. Try importing: from lifelines.utils import concordance_index
# 3. Set flag: lifelines_available = True/False
# 4. If not available, provide installation guidance
#
# TODO: Prepare survival data:
# 5. Extract survival variables from y_train:
#    - duration_train = y_train['RFS_MONTHS']
#    - event_train = y_train['RFS_STATUS']
# 6. Create survival DataFrame: survival_df_train = pd.concat([X_train, duration_train, event_train], axis=1)
# 7. Handle any missing or zero survival times
# 8. Verify data format: duration > 0, event in {0, 1}
#
# TODO: Train Cox Proportional Hazards model (if lifelines available):
# 9. Create Cox model: cph = CoxPHFitter()
# 10. Fit model: cph.fit(survival_df_train, duration_col='RFS_MONTHS', event_col='RFS_STATUS')
# 11. Print model summary: cph.print_summary()
# 12. Get concordance index: c_index_train = cph.concordance_index_
#
# TODO: Evaluate Cox model on validation set:
# 13. Prepare validation survival data
# 14. Calculate risk scores: risk_scores_val = cph.predict_partial_hazard(X_val)
# 15. Calculate validation C-index: c_index_val = concordance_index(y_val['RFS_MONTHS'], -risk_scores_val, y_val['RFS_STATUS'])
# 16. Compare train vs validation C-index
#
# TODO: Extract hazard ratios for top features:
# 17. Get hazard ratios: hazard_ratios = cph.hazard_ratios_
# 18. Sort by magnitude: top_hr = hazard_ratios.abs().sort_values(ascending=False).head(20)
# 19. Visualize top hazard ratios with confidence intervals
# 20. Interpret: HR > 1 means increased risk, HR < 1 means decreased risk
#
# TODO: Alternative if lifelines not available:
# 21. Use logistic regression with RFS_STATUS as proxy
# 22. Note limitations of this approach
# 23. Recommend installing lifelines for proper survival analysis
#
# TODO: Save Cox model results:
# 24. Store C-index, hazard ratios, and predictions
# 25. Add to model_results dictionary
# 26. Compare C-index with classification model AUCs
#
# Expected output: Cox regression model with C-index and hazard ratio analysis

# Write your code below:
# print("TIME-DEPENDENT SURVIVAL MODELS")
# print("="*60)
# 
# try:
#     from lifelines import CoxPHFitter, KaplanMeierFitter
#     from lifelines.utils import concordance_index
#     lifelines_available = True
#     print("✓ Lifelines library available")
# except ImportError:
#     lifelines_available = False
#     print("⚠️  Lifelines not available")
#     print("Install with: pip install lifelines")
# ...

## 4. Clinical Evaluation: C-index and Precision in Low-Risk

Calculate clinical metrics including C-index and precision in low-risk group identification.

In [None]:
# 📝 ACTIVITY 4: Clinical Evaluation Metrics
#
# Your task: Calculate C-index and precision in low-risk group for all models
#
# TODO: Set up clinical evaluation framework:
# 1. Create results DataFrame: clinical_metrics = pd.DataFrame(columns=['Model', 'C-index', 'Precision_Low_Risk', 'Precision_High_Risk'])
# 2. Define low-risk threshold: typically bottom 25% or 33% of risk scores
# 3. Define high-risk threshold: typically top 25% or 33% of risk scores
#
# TODO: Calculate C-index for all models:
# 4. For classification models: C-index = AUC (equivalent for binary outcomes)
# 5. For Cox models: use concordance_index from lifelines
# 6. Store C-index values for each model
#
# TODO: Calculate precision in low-risk group:
# 7. For each model's validation predictions:
#    - Define low-risk patients: bottom 25% of risk scores
#    - low_risk_threshold = np.percentile(risk_scores, 25)
#    - low_risk_mask = risk_scores <= low_risk_threshold
#    - Calculate precision: true_negatives / predicted_low_risk
#    - precision_low = np.sum((y_val['RFS_STATUS'] == 0) & low_risk_mask) / np.sum(low_risk_mask)
# 8. This measures: "Of patients predicted as low-risk, what % truly have no recurrence?"
#
# TODO: Calculate precision in high-risk group:
# 9. Define high-risk patients: top 25% of risk scores
# 10. high_risk_threshold = np.percentile(risk_scores, 75)
# 11. high_risk_mask = risk_scores >= high_risk_threshold
# 12. Calculate precision: true_positives / predicted_high_risk
# 13. precision_high = np.sum((y_val['RFS_STATUS'] == 1) & high_risk_mask) / np.sum(high_risk_mask)
#
# TODO: Calculate calibration metrics:
# 14. For models with probability outputs:
#     - Use calibration_curve: fraction_of_positives, mean_predicted_value = calibration_curve(y_true, y_prob, n_bins=10)
#     - Calculate Brier score: brier_score = np.mean((y_prob - y_true) ** 2)
#     - Lower Brier score = better calibration
#
# TODO: Create comprehensive metrics table:
# 15. Combine all metrics for each model:
#     - Model name, C-index, Precision (Low), Precision (High), Brier Score
# 16. Sort by C-index or precision in low-risk
# 17. Identify best models for different objectives
#
# TODO: Visualize clinical metrics:
# 18. Create bar plot: C-index comparison across models
# 19. Create scatter plot: C-index vs Precision in Low-Risk
# 20. Create calibration plots for top 3 models
# 21. Highlight clinical utility trade-offs
#
# TODO: Interpret clinical relevance:
# 22. Explain what C-index values mean (>0.7 good, >0.8 excellent)
# 23. Explain precision in low-risk importance for treatment de-escalation
# 24. Identify models suitable for clinical decision support
#
# Expected output: Clinical metrics table with C-index and precision analysis

# Write your code below:
# print("CLINICAL EVALUATION METRICS")
# print("="*60)
# 
# from sklearn.calibration import calibration_curve
# 
# def calculate_precision_in_risk_groups(y_true, risk_scores, low_percentile=25, high_percentile=75):
#     """Calculate precision in low and high risk groups"""
#     low_threshold = np.percentile(risk_scores, low_percentile)
#     high_threshold = np.percentile(risk_scores, high_percentile)
#     
#     # Low risk precision (Negative Predictive Value-like)
#     low_risk_mask = risk_scores <= low_threshold
#     precision_low = np.sum((y_true == 0) & low_risk_mask) / np.sum(low_risk_mask) if np.sum(low_risk_mask) > 0 else 0
#     
#     # High risk precision (Positive Predictive Value)
#     high_risk_mask = risk_scores >= high_threshold
#     precision_high = np.sum((y_true == 1) & high_risk_mask) / np.sum(high_risk_mask) if np.sum(high_risk_mask) > 0 else 0
#     
#     return precision_low, precision_high
# ...

## 5. Risk Stratification and Survival Analysis

Create low/medium/high risk groups and validate their clinical utility using survival curves.

In [None]:
# 📝 ACTIVITY 5: Risk Stratification and Kaplan-Meier Analysis
#
# Your task: Create risk groups and validate using survival curves
#
# TODO: Select best model for risk stratification:
# 1. Based on Activity 4, select model with best balance of C-index and precision
# 2. Get risk scores on validation set: risk_scores = best_model.predict_proba(X_val)[:, 1]
#
# TODO: Create risk groups using tertiles:
# 3. Calculate tertile thresholds: percentiles = np.percentile(risk_scores, [33.33, 66.67])
# 4. Assign risk groups:
#    - risk_groups = np.where(risk_scores <= percentiles[0], 'Low',
#                             np.where(risk_scores <= percentiles[1], 'Medium', 'High'))
# 5. Create risk group DataFrame with scores and groups
#
# TODO: Analyze risk group characteristics:
# 6. Calculate event rates by risk group:
#    - For each group: event_rate = mean(y_val['RFS_STATUS'])
# 7. Test statistical significance: chi2, p_value = chi2_contingency(contingency_table)
# 8. Calculate sample sizes per group
# 9. Verify monotonic trend: low < medium < high event rates
#
# TODO: Create Kaplan-Meier survival curves (if lifelines available):
# 10. For each risk group:
#     - kmf = KaplanMeierFitter()
#     - kmf.fit(durations=y_val['RFS_MONTHS'][risk_groups=='Low'], 
#               event_observed=y_val['RFS_STATUS'][risk_groups=='Low'],
#               label='Low Risk')
#     - Plot survival curve: kmf.plot()
# 11. Repeat for Medium and High risk groups
# 12. Create combined plot with all three curves
#
# TODO: Perform log-rank test:
# 13. Test survival curve differences:
#     - from lifelines.statistics import logrank_test
#     - Compare Low vs High: logrank_test(durations_low, durations_high, events_low, events_high)
#     - Print p-value and test statistic
# 14. Test pairwise differences: Low vs Medium, Medium vs High
# 15. Confirm significant separation (p < 0.05)
#
# TODO: Calculate hazard ratios between groups:
# 16. Use Cox regression with risk groups as categorical variable
# 17. Calculate HR for Medium vs Low, High vs Low
# 18. Display with confidence intervals
#
# TODO: Visualize risk stratification:
# 19. Create figure with 3 panels:
#     - Panel 1: Risk score distribution by group (violin plot)
#     - Panel 2: Event rates by group (bar plot with CI)
#     - Panel 3: Kaplan-Meier curves by group
# 20. Add statistical test results to plots
#
# TODO: Create risk group summary table:
# 21. For each risk group:
#     - N patients, Event rate, Median survival time
#     - Hazard ratio vs Low risk (reference)
#     - 95% confidence intervals
#
# TODO: Validate risk thresholds:
# 22. Test different stratification strategies (quartiles, quintiles)
# 23. Compare clinical utility of different strategies
# 24. Select optimal stratification for clinical use
#
# Expected output: Risk-stratified patient groups with validated survival curves

# Write your code below:
# print("RISK STRATIFICATION AND SURVIVAL ANALYSIS")
# print("="*60)
# 
# from scipy.stats import chi2_contingency
# 
# # Create risk groups
# best_model_name = 'Random Forest'  # Replace with actual best model
# risk_scores = model_results[best_model_name]['val_predictions']
# 
# percentiles = np.percentile(risk_scores, [33.33, 66.67])
# risk_groups = np.where(risk_scores <= percentiles[0], 'Low',
#                       np.where(risk_scores <= percentiles[1], 'Medium', 'High'))
# ...

## 6. Model Selection and Hyperparameter Tuning

Select the best model and optimize hyperparameters using grid search with clinical metrics.

In [None]:
# 📝 ACTIVITY 6: Model Selection and Hyperparameter Optimization
#
# Your task: Select best model and optimize its hyperparameters
#
# TODO: Review model performance across all activities:
# 1. Create comprehensive comparison table:
#    - Model, Train AUC, Val AUC, C-index, Precision Low-Risk, Precision High-Risk
# 2. Rank models by validation C-index
# 3. Identify top 3 candidates for hyperparameter tuning
#
# TODO: Set up hyperparameter tuning framework:
# 4. Choose top model (e.g., Random Forest or Gradient Boosting)
# 5. Define parameter grid:
#    - For Random Forest: n_estimators=[50, 100, 200], max_depth=[5, 10, 15, None], min_samples_split=[2, 5, 10]
#    - For GBM: n_estimators=[50, 100, 200], learning_rate=[0.01, 0.1, 0.2], max_depth=[3, 5, 7]
#    - For Logistic: C=[0.01, 0.1, 1, 10], penalty=['l1', 'l2']
#
# TODO: Implement GridSearchCV with clinical scoring:
# 6. Create GridSearchCV:
#    - grid_search = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
# 7. Fit on training data: grid_search.fit(X_train, y_train['RFS_STATUS'])
# 8. Get best parameters: best_params = grid_search.best_params_
# 9. Get best score: best_cv_score = grid_search.best_score_
#
# TODO: Evaluate optimized model:
# 10. Get best model: best_model = grid_search.best_estimator_
# 11. Predict on validation: val_pred = best_model.predict_proba(X_val)[:, 1]
# 12. Calculate validation metrics: AUC, C-index, precision in low/high risk
# 13. Compare to baseline (non-tuned) model performance
#
# TODO: Feature importance analysis:
# 14. Extract feature importances from best model
# 15. Rank features by importance
# 16. Visualize top 20 features
# 17. Interpret clinical relevance of top features
#
# TODO: Model calibration:
# 18. Check if model needs calibration:
#     - Plot calibration curve
#     - Calculate calibration slope and intercept
# 19. If poorly calibrated, apply calibration:
#     - calibrated_model = CalibratedClassifierCV(best_model, method='sigmoid', cv=5)
#     - calibrated_model.fit(X_train, y_train['RFS_STATUS'])
#     - Evaluate calibration improvement
#
# TODO: Create final model comparison:
# 20. Compare baseline vs tuned vs calibrated versions
# 21. Select final model based on:
#     - Validation C-index (primary)
#     - Precision in low-risk (secondary)
#     - Calibration quality (tertiary)
# 22. Document selection rationale
#
# Expected output: Optimized final model with feature importance and calibration

# Write your code below:
# print("MODEL SELECTION AND HYPERPARAMETER TUNING")
# print("="*60)
# 
# from sklearn.model_selection import GridSearchCV
# from sklearn.calibration import CalibratedClassifierCV
# 
# # Define parameter grid for top model
# param_grid_rf = {
#     'n_estimators': [50, 100, 200],
#     'max_depth': [5, 10, 15, None],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4]
# }
# ...

## 7. Final Model Validation on Test Set

Validate the selected model on the held-out test set and prepare for clinical deployment.

In [None]:
# 📝 ACTIVITY 7: Final Model Validation and Deployment Preparation
#
# Your task: Validate final model on test set and create deployment package
#
# TODO: Perform final test set evaluation:
# 1. Get final optimized model from Activity 6
# 2. Make predictions on test set: test_pred = final_model.predict_proba(X_test)[:, 1]
# 3. Calculate test metrics:
#    - Test C-index: roc_auc_score(y_test['RFS_STATUS'], test_pred)
#    - Precision in low-risk: using calculate_precision_in_risk_groups()
#    - Test calibration: calibration_curve on test set
#
# TODO: Compare performance across all splits:
# 4. Create summary table:
#    - Metric | Train | Validation | Test
#    - C-index | ... | ... | ...
#    - Precision Low-Risk | ... | ... | ...
#    - Precision High-Risk | ... | ... | ...
# 5. Check for overfitting: large train-test gaps
# 6. Assess generalization: similar val and test performance
#
# TODO: Create final risk stratification on test set:
# 7. Apply risk group thresholds from validation to test data
# 8. Calculate event rates by risk group on test set
# 9. Create Kaplan-Meier curves for test risk groups
# 10. Perform log-rank test to confirm risk group separation
#
# TODO: Bootstrap confidence intervals:
# 11. Calculate bootstrap CI for test C-index:
#     - Resample test set 1000 times
#     - Calculate C-index on each bootstrap sample
#     - Report mean and 95% CI
# 12. Calculate CI for precision metrics
#
# TODO: Feature importance and interpretation:
# 13. Extract final feature importances/coefficients
# 14. Create visualization of top 20 features
# 15. Provide clinical interpretation for each top feature
# 16. Document biological relevance
#
# TODO: Create clinical deployment package:
# 17. Save final model: joblib.dump(final_model, '../results/models/final_model.pkl')
# 18. Save feature list: pd.Series(selected_features).to_csv('../results/models/model_features.csv')
# 19. Save risk thresholds: json.dump(risk_thresholds, open('../results/models/risk_thresholds.json', 'w'))
# 20. Save preprocessing info: document normalization parameters used
#
# TODO: Create prediction function for deployment:
# 21. Write function:
#     def predict_patient_risk(patient_features):
#         # Load model
#         # Apply preprocessing
#         # Make prediction
#         # Assign risk group
#         # Return risk score and group
# 22. Test function with sample patient data
#
# TODO: Generate model documentation:
# 23. Create model card with:
#     - Model type and hyperparameters
#     - Training dataset characteristics
#     - Performance metrics (C-index, precision)
#     - Feature requirements
#     - Risk group definitions
#     - Clinical interpretation guidelines
# 24. Document limitations and appropriate use cases
#
# TODO: Create final report:
# 25. Executive summary: one-page overview
# 26. Performance metrics across all splits
# 27. Risk stratification validation results
# 28. Feature importance and interpretation
# 29. Deployment guidelines
# 30. Monitoring and maintenance recommendations
#
# Expected output: Validated production-ready model with complete deployment package

# Write your code below:
# print("FINAL MODEL VALIDATION AND DEPLOYMENT")
# print("="*60)
# 
# import joblib
# import json
# 
# # Test set evaluation
# test_pred = final_model.predict_proba(X_test)[:, 1]
# test_c_index = roc_auc_score(y_test['RFS_STATUS'], test_pred)
# test_precision_low, test_precision_high = calculate_precision_in_risk_groups(
#     y_test['RFS_STATUS'].values, test_pred
# )
# 
# print(f"\nFinal Test Set Performance:")
# print(f"  C-index: {test_c_index:.3f}")
# print(f"  Precision (Low-Risk): {test_precision_low:.3f}")
# print(f"  Precision (High-Risk): {test_precision_high:.3f}")
# ...

##  Coding Hints and Templates

Need help getting started? Here are code templates for model development:

### 📋 **Template 1: Training Classification Models**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train['RFS_STATUS'])
val_pred = model.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val['RFS_STATUS'], val_pred)
```

### 📋 **Template 2: Cox Regression**
```python
from lifelines import CoxPHFitter

cph = CoxPHFitter()
survival_df = pd.concat([X_train, y_train[['RFS_MONTHS', 'RFS_STATUS']]], axis=1)
cph.fit(survival_df, duration_col='RFS_MONTHS', event_col='RFS_STATUS')
print(f"C-index: {cph.concordance_index_:.3f}")
```

### 📋 **Template 3: Precision in Low-Risk**
```python
def calculate_low_risk_precision(y_true, risk_scores, percentile=25):
    threshold = np.percentile(risk_scores, percentile)
    low_risk_mask = risk_scores <= threshold
    true_negatives = np.sum((y_true == 0) & low_risk_mask)
    predicted_low = np.sum(low_risk_mask)
    return true_negatives / predicted_low if predicted_low > 0 else 0
```

### 📋 **Template 4: Kaplan-Meier Curves**
```python
from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()
for risk_group in ['Low', 'Medium', 'High']:
    mask = risk_groups == risk_group
    kmf.fit(y_val['RFS_MONTHS'][mask], y_val['RFS_STATUS'][mask], label=risk_group)
    kmf.plot()
plt.xlabel('Months')
plt.ylabel('Survival Probability')
```

### 📋 **Template 5: GridSearchCV**
```python
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train['RFS_STATUS'])
best_model = grid_search.best_estimator_
```

---

## 🎯 Learning Assessment

### ✅ **Self-Check Questions**

1. **Classification Models**: How do different algorithms compare on your data? Why might one perform better?
2. **C-index**: What does a C-index of 0.75 mean clinically? How is it different from accuracy?
3. **Precision in Low-Risk**: Why is this metric important for clinical decisions about treatment de-escalation?
4. **Cox Regression**: How do you interpret a hazard ratio of 2.0 for a gene? What about 0.5?
5. **Risk Stratification**: Are your risk groups statistically separated? What's the p-value from log-rank test?
6. **Model Selection**: What criteria did you use to select your final model? How did you balance different metrics?
7. **Calibration**: Is your model well-calibrated? What would you do if it wasn't?

### 🏆 **Success Criteria**

You have successfully completed this notebook if you can:
- ✅ Train and compare multiple classification algorithms
- ✅ Train survival models using Cox regression
- ✅ Calculate and interpret C-index and precision metrics
- ✅ Create statistically validated risk groups
- ✅ Generate Kaplan-Meier curves showing risk separation
- ✅ Select and optimize a final model
- ✅ Validate model on test set
- ✅ Export production-ready model with documentation

---

## 📚 Model Development Summary

### ✅ **Completed Activities:**
1. **Data Loading**: Loaded selected features from notebook 3 and preprocessed data from notebook 2
2. **Classification Models**: Trained 7 different algorithms (Logistic, SVM, RF, GBM, NN)
3. **Survival Models**: Trained Cox regression using RFS_MONTHS and RFS_STATUS
4. **Clinical Metrics**: Calculated C-index and precision in low/high risk groups
5. **Risk Stratification**: Created and validated low/medium/high risk groups with Kaplan-Meier curves
6. **Model Optimization**: Performed hyperparameter tuning on best model
7. **Final Validation**: Tested on held-out test set and created deployment package

### 🎯 **Key Achievements:**
- **Comprehensive Modeling**: Both time-independent and time-dependent approaches
- **Clinical Focus**: Emphasis on C-index and precision in low-risk identification
- **Risk Groups**: Statistically validated patient stratification
- **Production Ready**: Final model with complete documentation

### 📊 **Pipeline Completion:**
✅ **Notebook 1**: Data Exploration  
✅ **Notebook 2**: Data Preprocessing  
✅ **Notebook 3**: Feature Selection  
✅ **Notebook 4**: Model Development ← **YOU ARE HERE**

**Congratulations on completing the entire machine learning pipeline! 🎉**