# Phase 5.2: Custom Metrics and Comprehensive Model Evaluation

This comprehensive notebook demonstrates:
1. **Logging Comprehensive Metrics** - Beyond accuracy
2. **Per-Class Metrics** - Evaluate each class separately
3. **Cross-Validation Scores** - Robust evaluation
4. **Visualization Artifacts** - Confusion matrices, feature importance
5. **Evaluation Reports** - JSON summaries

## Why Go Beyond Accuracy?

Accuracy alone can be misleading! Consider:
- **Imbalanced classes**: 95% accuracy when 95% of data is one class
- **Cost of errors**: False negatives might be worse than false positives
- **Per-class performance**: Some classes might be harder to predict

## Metrics We'll Log

| Metric | What It Measures |
|--------|------------------|
| Accuracy | Overall correctness |
| Precision | Quality of positive predictions |
| Recall | Coverage of actual positives |
| F1 Score | Balance of precision & recall |
| AUC-ROC | Ranking ability |
| CV Scores | Robustness across data splits |

## Learning Goals
- Log comprehensive metrics manually
- Create per-class metrics
- Generate visualization artifacts
- Build evaluation reports

## Step 1: Import Libraries

In [None]:
# MLflow imports
import mlflow
import mlflow.sklearn

# sklearn imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score
)
from sklearn.preprocessing import label_binarize

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data handling
import pandas as pd
import numpy as np
import json
import os
import shutil

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print("Ready to learn about custom metrics!")

## Step 2: Connect to MLflow

In [None]:
# Get MLflow tracking server URL
TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")

# Connect to MLflow
mlflow.set_tracking_uri(TRACKING_URI)
mlflow.set_experiment("phase5-custom-metrics")

# Disable autolog for manual control
mlflow.sklearn.autolog(disable=True)

print(f"Connected to MLflow at: {TRACKING_URI}")
print(f"Experiment: phase5-custom-metrics")
print(f"Autolog disabled (we'll log manually for full control)")

## Step 3: Load Data and Create Temp Directory

In [None]:
# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create temp directory for artifacts
TEMP_DIR = "temp_eval"
os.makedirs(TEMP_DIR, exist_ok=True)

print("Data loaded!")
print(f"Training: {len(X_train)} samples")
print(f"Testing: {len(X_test)} samples")
print(f"Classes: {list(iris.target_names)}")

## Step 4: Comprehensive Model Evaluation

Now let's train a model and log EVERYTHING we might want to track.

In [None]:
print("="*60)
print("Comprehensive Model Evaluation")
print("="*60)

# Start MLflow run
with mlflow.start_run(run_name="comprehensive-evaluation"):
    
    # =============================================
    # SECTION 1: Train Model
    # =============================================
    print("\n[1] Training model...")
    
    model = RandomForestClassifier(
        n_estimators=100, 
        max_depth=10, 
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Get predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)  # Probability predictions
    
    print("    Model trained!")
    
    # =============================================
    # SECTION 2: Log Parameters
    # =============================================
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("random_state", 42)
    mlflow.log_param("train_size", len(X_train))
    mlflow.log_param("test_size", len(X_test))
    
    print("\n[2] Logging basic metrics...")

In [None]:
    # =============================================
    # SECTION 3: Log Basic Metrics
    # =============================================
    
    # Calculate metrics
    metrics = {
        # Basic accuracy
        "accuracy": accuracy_score(y_test, y_pred),
        
        # Precision - weighted and macro averages
        "precision_weighted": precision_score(y_test, y_pred, average="weighted"),
        "precision_macro": precision_score(y_test, y_pred, average="macro"),
        
        # Recall - weighted and macro averages
        "recall_weighted": recall_score(y_test, y_pred, average="weighted"),
        "recall_macro": recall_score(y_test, y_pred, average="macro"),
        
        # F1 Score - weighted and macro averages
        "f1_weighted": f1_score(y_test, y_pred, average="weighted"),
        "f1_macro": f1_score(y_test, y_pred, average="macro"),
    }
    
    # Log each metric
    for name, value in metrics.items():
        mlflow.log_metric(name, value)
        print(f"    {name}: {value:.4f}")

In [None]:
    # =============================================
    # SECTION 4: Log Per-Class Metrics
    # =============================================
    print("\n[3] Logging per-class metrics...")
    
    for i, class_name in enumerate(iris.target_names):
        # Create binary labels for this class (one-vs-rest)
        y_true_binary = (y_test == i).astype(int)
        y_pred_binary = (y_pred == i).astype(int)
        
        # Calculate per-class metrics
        precision = precision_score(y_true_binary, y_pred_binary, zero_division=0)
        recall = recall_score(y_true_binary, y_pred_binary, zero_division=0)
        f1 = f1_score(y_true_binary, y_pred_binary, zero_division=0)
        
        # Log with class name in metric name
        mlflow.log_metric(f"precision_{class_name}", precision)
        mlflow.log_metric(f"recall_{class_name}", recall)
        mlflow.log_metric(f"f1_{class_name}", f1)
        
        print(f"    {class_name}: P={precision:.3f}, R={recall:.3f}, F1={f1:.3f}")

In [None]:
    # =============================================
    # SECTION 5: Cross-Validation Scores
    # =============================================
    print("\n[4] Cross-validation scores...")
    
    # 5-fold cross-validation on ALL data
    cv_scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    
    # Log CV statistics
    mlflow.log_metric("cv_mean", cv_scores.mean())
    mlflow.log_metric("cv_std", cv_scores.std())
    mlflow.log_metric("cv_min", cv_scores.min())
    mlflow.log_metric("cv_max", cv_scores.max())
    
    print(f"    CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    print(f"    CV Range: [{cv_scores.min():.4f}, {cv_scores.max():.4f}]")

In [None]:
    # =============================================
    # SECTION 6: AUC-ROC Score (Multiclass)
    # =============================================
    print("\n[5] AUC-ROC scores...")
    
    # Binarize labels for multiclass ROC
    y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
    
    # Calculate AUC-ROC
    auc_weighted = roc_auc_score(y_test_bin, y_proba, average="weighted", multi_class="ovr")
    auc_macro = roc_auc_score(y_test_bin, y_proba, average="macro", multi_class="ovr")
    
    mlflow.log_metric("auc_roc_weighted", auc_weighted)
    mlflow.log_metric("auc_roc_macro", auc_macro)
    
    print(f"    AUC-ROC (weighted): {auc_weighted:.4f}")
    print(f"    AUC-ROC (macro): {auc_macro:.4f}")

In [None]:
    # =============================================
    # SECTION 7: Create Visualization Artifacts
    # =============================================
    print("\n[6] Creating visualization artifacts...")
    
    # --- Confusion Matrix ---
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(8, 6))
    sns.heatmap(
        cm, annot=True, fmt="d", cmap="Blues",
        xticklabels=iris.target_names,
        yticklabels=iris.target_names,
        ax=ax
    )
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    ax.set_title("Confusion Matrix")
    
    # Save and log
    cm_path = f"{TEMP_DIR}/confusion_matrix.png"
    fig.savefig(cm_path, dpi=100, bbox_inches="tight")
    plt.close()
    mlflow.log_artifact(cm_path)
    print("    Logged: confusion_matrix.png")

In [None]:
    # --- Feature Importance Plot ---
    importance = pd.Series(
        model.feature_importances_,
        index=iris.feature_names
    ).sort_values(ascending=True)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    importance.plot(kind="barh", ax=ax, color="steelblue")
    ax.set_xlabel("Importance")
    ax.set_title("Feature Importance")
    ax.grid(axis="x", alpha=0.3)
    
    # Save and log
    fi_path = f"{TEMP_DIR}/feature_importance.png"
    fig.savefig(fi_path, dpi=100, bbox_inches="tight")
    plt.close()
    mlflow.log_artifact(fi_path)
    print("    Logged: feature_importance.png")

In [None]:
    # --- Prediction Distribution Plot ---
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Actual distribution
    pd.Series(y_test).value_counts().sort_index().plot(
        kind="bar", ax=axes[0], color="steelblue"
    )
    axes[0].set_xticklabels(iris.target_names, rotation=45)
    axes[0].set_title("Actual Distribution")
    axes[0].set_ylabel("Count")
    
    # Predicted distribution
    pd.Series(y_pred).value_counts().sort_index().plot(
        kind="bar", ax=axes[1], color="coral"
    )
    axes[1].set_xticklabels(iris.target_names, rotation=45)
    axes[1].set_title("Predicted Distribution")
    axes[1].set_ylabel("Count")
    
    fig.tight_layout()
    
    # Save and log
    dist_path = f"{TEMP_DIR}/distribution.png"
    fig.savefig(dist_path, dpi=100, bbox_inches="tight")
    plt.close()
    mlflow.log_artifact(dist_path)
    print("    Logged: distribution.png")

In [None]:
    # =============================================
    # SECTION 8: Create JSON Reports
    # =============================================
    
    # --- Classification Report ---
    report = classification_report(
        y_test, y_pred,
        target_names=iris.target_names,
        output_dict=True
    )
    
    report_path = f"{TEMP_DIR}/classification_report.json"
    with open(report_path, "w") as f:
        json.dump(report, f, indent=2)
    mlflow.log_artifact(report_path)
    print("    Logged: classification_report.json")
    
    # --- Model Summary ---
    summary = {
        "model_type": "RandomForestClassifier",
        "n_estimators": 100,
        "max_depth": 10,
        "accuracy": float(metrics["accuracy"]),
        "f1_weighted": float(metrics["f1_weighted"]),
        "cv_mean": float(cv_scores.mean()),
        "cv_std": float(cv_scores.std()),
        "feature_importance": importance.to_dict(),
        "classes": list(iris.target_names),
    }
    
    summary_path = f"{TEMP_DIR}/model_summary.json"
    with open(summary_path, "w") as f:
        json.dump(summary, f, indent=2)
    mlflow.log_artifact(summary_path)
    print("    Logged: model_summary.json")

In [None]:
    # =============================================
    # SECTION 9: Log Model
    # =============================================
    print("\n[7] Logging model...")
    
    signature = mlflow.models.infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(model, "model", signature=signature)
    print("    Logged: model")

print("\n" + "="*60)
print("Comprehensive evaluation complete!")
print("="*60)

## Step 5: Clean Up

In [None]:
# Clean up temp directory
shutil.rmtree(TEMP_DIR)
print(f"Cleaned up: {TEMP_DIR}/ directory removed")

## Summary: What We Logged

### Metrics Logged

| Category | Metrics |
|----------|----------|
| Basic | accuracy |
| Precision | precision_weighted, precision_macro, precision_<class> |
| Recall | recall_weighted, recall_macro, recall_<class> |
| F1 | f1_weighted, f1_macro, f1_<class> |
| AUC-ROC | auc_roc_weighted, auc_roc_macro |
| Cross-Val | cv_mean, cv_std, cv_min, cv_max |

### Artifacts Logged

| Artifact | Purpose |
|----------|----------|
| confusion_matrix.png | Visualize predictions vs actuals |
| feature_importance.png | Understand model decisions |
| distribution.png | Compare actual vs predicted |
| classification_report.json | Detailed per-class metrics |
| model_summary.json | Overall model information |
| model | The trained model itself |

### Best Practices

1. **Log multiple metric types** - Accuracy isn't enough
2. **Include per-class metrics** - Identify weak classes
3. **Use cross-validation** - More robust than single split
4. **Create visualizations** - Easier to interpret
5. **Save JSON reports** - Programmatic access to results

In [None]:
print("="*60)
print("Custom Metrics Tutorial Complete!")
print("="*60)
print(f"\nView at: {TRACKING_URI}")
print("\nWhat you learned:")
print("  1. How to log comprehensive metrics")
print("  2. How to create per-class metrics")
print("  3. How to log cross-validation scores")
print("  4. How to create visualization artifacts")
print("  5. How to create JSON evaluation reports")
print("\nCheck the MLflow UI to see all the logged metrics and artifacts!")