# Comprehensive Model Evaluation

In this notebook, we will conduct a thorough evaluation of our trained Convolutional Neural Network (CNN) on the unseen test dataset. The goal is to move beyond a single accuracy score and gain a deeper understanding of the model's performance, its strengths, and its weaknesses.

We will cover:
1.  **Overall Performance:** Calculating test loss and accuracy.
2.  **Class-level Metrics:** Generating a detailed classification report with precision, recall, and F1-score.
3.  **Advanced Metrics:** Computing Top-K accuracy.
4.  **Visualizations:** Plotting a confusion matrix and ROC curves for an intuitive understanding of performance.
5.  **Qualitative Analysis:** Visualizing individual predictions and performing an error analysis to identify common misclassifications.

## 1. Setup and Imports

First, we'll import all the necessary libraries and helper functions. This includes `torch` and `torchvision` for data handling and modeling, `sklearn` for evaluation metrics, `matplotlib` for plotting, and our custom scripts for model architecture and evaluation utilities.

In [None]:
# PyTorch and Torchvision
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Data science and plotting libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix

# Custom helper scripts
from scripts.model_architectures import SimpleCNN
from scripts.evaluation_metrics import (
    evaluate_model,
    plot_confusion_matrix,
    plot_roc_curves,
    visualize_predictions,
    top_k_accuracy,
    plot_precision_recall_curves, # Although not used in the final version, good to have
    plot_calibration_curve,     # Although not used in the final version, good to have
)


## 2. Data and Model Loading

Next, we will prepare the test dataset and load our best-performing model checkpoint that was saved during the training phase.

In [None]:
# --- 2.1. Load Test Dataset ---

# Define the device for computation (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define data transformations for the test set
# These should match the validation transformations to ensure consistency
test_transforms = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load the test dataset from the specified directory
test_data_dir = "data/raw/test"
test_dataset = datasets.ImageFolder(root=test_data_dir, transform=test_transforms)

# Create a DataLoader for the test set
batch_size = 64
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=2)

# Print dataset information
print(f"Test dataset loaded from {test_data_dir}")
print(f"Test set size: {len(test_dataset)}")
class_names = test_dataset.classes
print(f"Classes: {class_names}")


In [None]:
# --- 2.2. Load Trained Model ---

# Initialize the model architecture
model = SimpleCNN(num_classes=len(class_names)).to(device)

# Load the saved weights from the best model checkpoint
checkpoint = torch.load("models/best_model.pth", map_location=device)
model.load_state_dict(checkpoint["state_dict"])

# Set the model to evaluation mode
# This is crucial as it disables layers like Dropout and BatchNorm's training behavior
model.eval()

print("Best model loaded successfully and set to evaluation mode.")


## 3. Overall Performance Evaluation

We'll start by getting a high-level view of the model's performance on the entire test set using overall loss and accuracy.

In [None]:
# Evaluate the model on the test loader
# This function returns metrics and raw predictions for further analysis
test_loss, test_accuracy, all_preds, all_labels, all_probs = evaluate_model(
    model, test_loader, nn.CrossEntropyLoss(), device
)

print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.2f}%")


## 4. Detailed Performance Analysis

Now, let's dive deeper into the model's performance with more detailed metrics and visualizations.

### 4.1. Classification Report

The classification report provides key metrics—precision, recall, and F1-score—for each class. This helps us identify if the model is biased towards or struggles with specific categories.

In [None]:
print("Classification Report:")
print(classification_report(all_labels, all_preds, target_names=class_names))


### 4.2. Top-K Accuracy

Top-K accuracy measures if the true label is among the model's top `K` predictions. This is useful in scenarios where the second or third guess might still be contextually relevant.
- **Top-1 Accuracy:** The standard accuracy (the top prediction must be correct).
- **Top-5 Accuracy:** The true label must be in the top 5 predictions.

In [None]:
top1_acc = top_k_accuracy(all_labels, all_probs, k=1)
top5_acc = top_k_accuracy(all_labels, all_probs, k=5)

print(f"Top-1 Accuracy (Exact Match): {top1_acc:.2f}%")
print(f"Top-5 Accuracy (Correct label in top 5): {top5_acc:.2f}%")


### 4.3. Confusion Matrix

The confusion matrix provides a visual representation of the model's predictions versus the actual labels. The diagonal elements show the number of correct predictions for each class, while off-diagonal elements reveal where the model is making mistakes.

In [None]:
plt.figure(figsize=(12, 10))
plot_confusion_matrix(all_labels, all_preds, class_names)
plt.title("Confusion Matrix")
plt.savefig("results/confusion_matrix.png")
plt.show()


### 4.4. ROC Curves (Receiver Operating Characteristic)

ROC curves illustrate the diagnostic ability of a classifier as its discrimination threshold is varied. For a multi-class problem, we plot one curve per class (one-vs-rest). A curve that bows towards the top-left corner indicates a better-performing classifier. The Area Under the Curve (AUC) summarizes this performance.

In [None]:
plt.figure(figsize=(12, 10))
plot_roc_curves(all_labels, all_probs, class_names)
plt.title("ROC Curves (One-vs-Rest)")
plt.savefig("results/roc_curves.png")
plt.show()


## 5. Qualitative Analysis

Beyond metrics, it's insightful to look at individual examples to understand the model's behavior.

### 5.1. Visualizing Individual Predictions

Let's visualize a few sample images from the test set along with their true labels and the model's predictions. This helps us build intuition about the kinds of images the model handles well and where it fails.

In [None]:
visualize_predictions(model, test_loader, device, class_names, num_samples=10)


### 5.2. Error Analysis: Most Common Misclassifications

By analyzing the confusion matrix numerically, we can programmatically identify the most frequent errors. This can reveal systematic issues, such as confusion between visually similar classes (e.g., 'cat' vs. 'dog', or 'car' vs. 'truck').

In [None]:
# Get the raw confusion matrix from sklearn
cm = confusion_matrix(all_labels, all_preds)

# Set diagonal to zero to focus only on misclassifications
np.fill_diagonal(cm, 0)

# Find the indices of the largest errors
indices = np.dstack(np.unravel_index(np.argsort(cm.ravel()), cm.shape))[0]

print("Top 10 Most Common Misclassifications:")
print("=======================================")
for i, j in reversed(indices[-10:]):
    count = cm[i, j]
    if count == 0:
        continue
    print(f"'{class_names[i]}' misclassified as '{class_names[j]}': {count} times")


## 6. Saving Results and Conclusion

Finally, we save all the collected evaluation data—metrics, predictions, and labels—to a file. This allows for easy reloading and comparison with other models in the future without needing to re-run the evaluation.

In [None]:
# Consolidate results into a dictionary
results = {
    "test_loss": test_loss,
    "test_accuracy": test_accuracy,
    "predictions": all_preds,
    "labels": all_labels,
    "probabilities": all_probs,
    "class_names": class_names,
    "classification_report": classification_report(
        all_labels, all_preds, target_names=class_names, output_dict=True
    ),
}

# Save the results dictionary to a .npy file
np.save("results/evaluation_results.npy", results, allow_pickle=True)

print("Evaluation complete. Results and visualizations have been saved to the 'results/' directory.")
