In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, roc_curve, confusion_matrix, 
                             classification_report)
import pickle
import os

# Load data
X_test = pd.read_csv('data/X_test_scaled.csv')
y_test = pd.read_csv('data/y_test.csv').values.ravel()

# Load tuned model
with open('models/best_model_tuned.pkl', 'rb') as f:
    model = pickle.load(f)

print("="*80)
print("âœ“ PHASE 7: MODEL EVALUATION")
print("="*80)

# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Metrics
print(f"\n1. PERFORMANCE METRICS")
print(f"   Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"   Precision: {precision_score(y_test, y_pred):.4f}")
print(f"   Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"   F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"   ROC-AUC:   {roc_auc_score(y_test, y_pred_proba):.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\n2. CONFUSION MATRIX")
print(f"   TN: {cm[0,0]}  FP: {cm[0,1]}")
print(f"   FN: {cm[1,0]}  TP: {cm[1,1]}")

# Classification Report
print(f"\n3. CLASSIFICATION REPORT")
print(classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve (left subplot)
axes[0].plot(fpr, tpr, label=f'ROC (AUC={auc:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC-AUC Curve')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Confusion Matrix (right subplot)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix')

plt.tight_layout()
os.makedirs('visualizations', exist_ok=True)
plt.savefig('visualizations/14_evaluation.png', dpi=300)
plt.show()

# Feature Importance
try:
    importance_df = pd.DataFrame({
        'Feature': X_test.columns,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)

    print(f"\n4. TOP 10 FEATURES")
    print(importance_df.head(10).to_string(index=False))

    # Visualize top features
    plt.figure(figsize=(10, 6))
    sns.barh(importance_df['Feature'][:15], importance_df['Importance'][:15])
    plt.xlabel('Importance')
    plt.title('Top 15 Most Important Features')
    plt.tight_layout()
    os.makedirs('visualizations', exist_ok=True)
    plt.savefig('visualizations/15_feature_importance.png', dpi=300)
    plt.show()
except AttributeError:
    print('\nModel does not expose feature_importances_. Skipping feature importance.')

print(f"\nâœ“ PHASE 7 COMPLETE!")


## ðŸ“‹ EVALUATION METRICS REFERENCE

| Metric | Formula | Meaning |
|--------|---------|---------|
| **Accuracy** | (TP+TN)/Total | % of correct predictions overall |
| **Precision** | TP/(TP+FP) | % of predicted disease cases that are correct |
| **Recall** | TP/(TP+FN) | % of actual disease cases that are detected |
| **F1-Score** | 2(PÃ—R)/(P+R) | Harmonic mean balancing precision & recall |
| **ROC-AUC** | Area under curve | Model's discrimination ability (0.5=random, 1.0=perfect) |

**Key Definitions:**
- **TP (True Positive):** Correctly predicted disease
- **TN (True Negative):** Correctly predicted no disease
- **FP (False Positive):** Incorrectly predicted disease (Type I error)
- **FN (False Negative):** Incorrectly predicted no disease (Type II error)

**Goal: All > 0.90 (90%)**


## ðŸŽ¯ CONFUSION MATRIX GUIDE

| | **Predicted No** | **Predicted Yes** |
|---|---|---|
| **Actual No** | TN (Correct) âœ“ | FP (Wrong) âœ— |
| **Actual Yes** | FN (Wrong) âœ— | TP (Correct) âœ“ |

**Interpretation:**
- **TN (True Negative):** Model correctly predicts no disease when patient is healthy
- **TP (True Positive):** Model correctly predicts disease when patient has disease
- **FP (False Positive):** Model incorrectly predicts disease (Type I error) â€” patient is healthy but flagged as diseased
- **FN (False Negative):** Model incorrectly predicts no disease (Type II error) â€” patient has disease but not detected

**Clinical Impact:**
- High FP: Unnecessary patient stress and follow-up tests
- High FN: **Dangerous** â€” missed disease diagnosis


In [None]:
# Get one test patient
new_patient = X_test.iloc[0:1]

# Predict
disease_prob = model.predict_proba(new_patient)[0, 1]
prediction = model.predict(new_patient)

# Output
print(f"Disease Probability: {disease_prob:.2%}")
print(f"Risk Level: {'HIGH' if disease_prob > 0.7 else 'MODERATE' if disease_prob > 0.3 else 'LOW'}")
print(f"Prediction: {'Has Disease' if prediction else 'No Disease'}")


## Phase 7 â€” Model Evaluation & Results

This document summarizes the model evaluation and interpretation performed in `Notebooks/Phase7.ipynb`, lists the artifacts produced, and provides quick run and troubleshooting instructions.

- **Purpose:** Evaluate the tuned model on the test set using comprehensive metrics and visualizations. Generate evaluation reports, confusion matrices, ROC curves, and feature importance analysis.
- **Notebook:** `Notebooks/Phase7.ipynb`

**Produced Artifacts**
- `visualizations/14_evaluation.png`: Side-by-side ROC curve and confusion matrix heatmap.
- `visualizations/15_feature_importance.png`: Bar chart of top 15 most important features (if model supports it).
- Console output: Performance metrics (Accuracy, Precision, Recall, F1-Score, ROC-AUC), confusion matrix, classification report, and top 10 features table.

**Main Steps (high level)**
- Load preprocessed test data and the tuned model from Phase 6 (`models/best_model_tuned.pkl`).
- Generate predictions on the test set.
- Calculate key metrics:
  - **Accuracy:** Overall correctness of predictions.
  - **Precision:** % of predicted disease cases that are correct.
  - **Recall:** % of actual disease cases detected.
  - **F1-Score:** Harmonic mean of precision and recall.
  - **ROC-AUC:** Model's discrimination ability across all thresholds.
- Generate confusion matrix and classification report.
- Visualize ROC curve and confusion matrix side-by-side.
- Extract and visualize top feature importances (if available).

**How to run (PowerShell)**
1. From the project root, execute the notebook headless (example):

```powershell
python -m nbconvert --to notebook --execute "Notebooks\Phase7.ipynb" --output "Notebooks\Phase7_executed.ipynb"
```

2. Or run interactively in VS Code / Jupyter and execute cells in order.

**Metrics Reference**

| Metric | Formula | Meaning |
|--------|---------|---------|
| **Accuracy** | (TP+TN)/Total | % of correct predictions overall |
| **Precision** | TP/(TP+FP) | % of predicted disease cases that are correct |
| **Recall** | TP/(TP+FN) | % of actual disease cases that are detected |
| **F1-Score** | 2(PÃ—R)/(P+R) | Harmonic mean balancing precision & recall |
| **ROC-AUC** | Area under curve | Model's discrimination ability (0.5=random, 1.0=perfect) |

**Confusion Matrix Guide**

| | **Predicted No** | **Predicted Yes** |
|---|---|---|
| **Actual No** | TN (Correct) âœ“ | FP (Wrong) âœ— |
| **Actual Yes** | FN (Wrong) âœ— | TP (Correct) âœ“ |

- **TN:** Model correctly predicts no disease when patient is healthy.
- **TP:** Model correctly predicts disease when patient has disease.
- **FP:** Model incorrectly predicts disease (Type I error).
- **FN:** Model incorrectly predicts no disease (Type II error) â€” **Dangerous in medical context**.

**Notes & Troubleshooting**
- Missing `models/best_model_tuned.pkl`: Phase 7 loads the tuned model from Phase 6. Ensure Phase 6 has been executed successfully.
- Missing test data: Phase 7 requires `data/X_test_scaled.csv`, `data/X_test_scaled.csv`, `data/y_test.csv`, and `data/y_test.csv` (created in Phase 4). Run Phase 4 first if these files are absent.
- `FileNotFoundError` when saving visualizations: The notebook creates the `visualizations/` directory automatically via `os.makedirs('visualizations', exist_ok=True)` before saving. If this fails, check file system permissions.
- Model doesn't have `feature_importances_`: Some models (e.g., SVM, KNN) don't expose feature importances. The notebook gracefully skips this section with a try/except block and prints a message.
- Interpretation note: In medical contexts, **minimizing False Negatives (FN)** is typically critical â€” missing a disease diagnosis is more harmful than a false alarm.

**Model Performance Interpretation**
- **High Accuracy + High F1:** Good overall balance and performance.
- **High Recall, Low Precision:** Model detects most diseases but has many false alarms.
- **High Precision, Low Recall:** Model is conservative; few false alarms but misses some disease cases.
- **ROC-AUC near 1.0:** Excellent discrimination; near 0.5: random classifier.

**Next steps**
- Review the evaluation metrics and visualizations to understand model strengths/weaknesses.
- If performance is unsatisfactory, consider:
  - Adjusting hyperparameters further (Phase 6).
  - Collecting more data or engineering additional features (Phase 4).
  - Trying alternative models (Phase 5).
- For deployment, consider the trade-off between precision and recall based on clinical requirements.

**Quick validation**
After running, confirm these files exist:
- `visualizations/14_evaluation.png` âœ“
- `visualizations/15_feature_importance.png` âœ“ (if model supports it)
- Console output shows all 5 metrics and classification report âœ“

