# 5. LeNet-5 Model Evaluation on Test Data

After training our LeNet-5 model, the next crucial step is to evaluate its performance on the unseen test set. This gives us an unbiased measure of how well our model generalizes to new data. We will apply the model, store its predictions in FiftyOne, and then use FiftyOne's powerful evaluation tools to analyze the results in detail.

**Key concepts covered:**
*   Applying a PyTorch model to a FiftyOne dataset
*   Storing predictions, confidence, and logits
*   Evaluating classification performance
*   Analyzing prediction confidence distributions
*   Computing sample hardness and mistakenness

## Setup

We need to reload our datasets and redefine our model architecture and helper classes to apply the trained model.

In [None]:
import os
from PIL import Image
import numpy as np
from tqdm import tqdm
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as Fun
import torchvision.transforms.v2 as transforms
from torch.utils.data import Dataset

import fiftyone as fo
import fiftyone.brain as fob
from fiftyone import ViewField as F

# Redefine the model architecture so we can load the weights
class ModernLeNet5(nn.Module):
    def __init__(self, num_classes=10):
        super(ModernLeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        self.conv3 = nn.Conv2d(16, 120, kernel_size=4)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(120, 84)
        self.fc2 = nn.Linear(84, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = self.pool(Fun.relu(self.conv1(x)))
        x = self.pool(Fun.relu(self.conv2(x)))
        x = Fun.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = Fun.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Redefine the custom dataset class
class CustomTorchImageDataset(torch.utils.data.Dataset):
    def __init__(self, fiftyone_dataset, image_transforms=None, label_map=None, gt_field="ground_truth"):
        self.fiftyone_dataset = fiftyone_dataset
        self.image_paths = self.fiftyone_dataset.values("filepath")
        self.str_labels = self.fiftyone_dataset.values(f"{gt_field}.label")
        self.image_transforms = image_transforms
        self.label_map = label_map if label_map is not None else {str(i): i for i in range(10)}

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        image = Image.open(image_path).convert('L')
        if self.image_transforms: image = self.image_transforms(image)
        label_str = self.str_labels[idx]
        label_idx = self.label_map.get(label_str, -1)
        return image, torch.tensor(label_idx, dtype=torch.long)

### Reload the Best LeNet Model

We load the saved model weights that achieved the best validation performance during training. This ensures we are evaluating the most generalizable version of our model.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model_save_path = Path(os.getcwd()) / 'best_lenet.pth'

loaded_model = ModernLeNet5().to(device)
loaded_model.load_state_dict(torch.load(model_save_path, map_location=device))
loaded_model.eval()

print(f"Model loaded successfully from {model_save_path}")

## Apply the Model to the Test Set

We'll now run inference on the entire test set. We'll collect the predictions, confidence scores, and raw logits for each sample and store them back into our FiftyOne dataset. Storing predictions as structured `fo.Classification` objects allows for rich, interactive analysis.

In [None]:
# Load datasets
test_dataset = fo.load_dataset("mnist-test-set")
train_dataset = fo.load_dataset("mnist-training-set")

# Recreate transforms using stats from the training set
mean_intensity, std_intensity = 0.1307, 0.3081 # Pre-computed for simplicity
image_transforms = transforms.Compose([
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize((mean_intensity,), (std_intensity,))
])
dataset_classes = sorted(test_dataset.distinct("ground_truth.label"))
label_map = {label: i for i, label in enumerate(dataset_classes)}

# Create DataLoader for the test set
torch_test_set = CustomTorchImageDataset(test_dataset, image_transforms=image_transforms, label_map=label_map)
test_loader = torch.utils.data.DataLoader(torch_test_set, batch_size=64, num_workers=os.cpu_count())

# Run inference
predictions, all_logits = [], []
with torch.inference_mode():
    for images, _ in tqdm(test_loader, desc="Applying model to test set"):
        images = images.to(device)
        logits = loaded_model(images)
        all_logits.append(logits.cpu().numpy())
        _, predicted = torch.max(logits.data, 1)
        predictions.extend(predicted.cpu().numpy())

all_logits = np.concatenate(all_logits, axis=0)

# Store predictions back in FiftyOne
with fo.ProgressBar(total=len(test_dataset), desc="Storing predictions") as pb:
    for i, sample in enumerate(test_dataset):
        pred_idx = predictions[i]
        sample_logits = all_logits[i]
        conf = float(Fun.softmax(torch.tensor(sample_logits), dim=0).numpy()[pred_idx])
        sample["lenet_classification"] = fo.Classification(
            label=dataset_classes[pred_idx],
            confidence=conf,
            logits=sample_logits.tolist()
        )
        sample.save()
        pb.update()

## Evaluating LeNet's Performance

With predictions stored, we can use `evaluate_classifications()` again to get a performance report and confusion matrix for our custom LeNet model. We expect a significant improvement over CLIP's zero-shot performance.

In [None]:
session = fo.launch_app(test_dataset)
lenet_evaluation_results = test_dataset.evaluate_classifications(
    "lenet_classification",
    gt_field="ground_truth",
    eval_key="lenet_eval")

session.refresh()

In [None]:
lenet_evaluation_results.print_report(digits=3)
lenet_evaluation_results.plot_confusion_matrix()

The accuracy should be >99%, a huge leap from CLIP's ~88%. The confusion matrix is also much cleaner, with most values concentrated on the diagonal.

## Hardness and Mistakenness

Beyond accuracy, we can analyze the model's logits to understand sample-level difficulties.

- **Hardness**: Measures the model's prediction uncertainty. High hardness indicates samples the model found difficult, which are often edge cases.
- **Mistakenness**: Identifies samples where the model was confident but wrong. High mistakenness can often point to labeling errors in the dataset.

We use `fiftyone.brain` to compute these values.

In [None]:
fob.compute_hardness(test_dataset, label_field='lenet_classification')

fob.compute_mistakenness(test_dataset, 
                         pred_field="lenet_classification",
                         label_field="ground_truth")

session.refresh()
print("Hardness and mistakenness computed.")

Let's view the samples with the highest mistakenness scores. These are the prime candidates for being mislabeled.

In [None]:
mistakenness_quantiles = test_dataset.quantiles("mistakenness", [0.99])

suspicious_test_samples_view = test_dataset.match(
                             F("mistakenness") > mistakenness_quantiles[-1]
                             ).sort_by("mistakenness", reverse=True)

session.view = suspicious_test_samples_view
print(f"Displaying {len(suspicious_test_samples_view)} most mistaken samples in the App: {session.url}")

## Next Steps

We've thoroughly evaluated our LeNet model on the test set. But what did the model actually *learn*? 

In the next notebook, we'll dive into the model's internal representations by extracting embeddings from its hidden layers on the *training data*. This will help us understand the features it learned and analyze the training set for quality issues.

Proceed to `6_lenet_feature_analysis.ipynb`.