# OOD Detection
The purpose of this lab project is to enhance our understanding of OOD detection. After accomplishing the lab project, you should be able to:
- Code different OOD score functions and use them for OOD detection.
- Perform benchmarking experiments involving different OOD score functions and different metrics.
- Visualize OOD detection results and check for common mistakes in OOD detection experiments.

As usual, we start by importing the necessary libraries.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from sklearn.metrics import roc_auc_score, precision_recall_curve
import numpy as np
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## 1. Data
The ultimate purpose of this notebook is to perform a benchmarking experiment in order to compare multiple OOD scores and OOD detection algorithms. To that end, we will use three different data sets:
1. The **Cifar-10 train** dataset in order to train a simple convolutional neural network for the task of image classification.
2. The **Cifar-10 test** set as the *in-distribution* dataset (i.e. the dataset of normal examples), for evaluating the different OOD scores.
3. (A subset of) The **SVHN test** set as the *out-of-distribution* dataset (i.e. the dataset of anomalous examples), for evaluating the different OOD scores.



In [None]:
# Data loading and preprocessing
batch_size = 128

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

cifar_train = datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
cifar_test = datasets.CIFAR10(root='./data', train=False, transform=transform, download=True)
svhn_test = datasets.SVHN(root='./data', split='test', transform=transform, download=True)

# Extract 10_000 random images from the svhn_test set
svhn_test, _ = torch.utils.data.random_split(svhn_test, [10_000, len(svhn_test) - 10_000])

train_loader = DataLoader(cifar_train, batch_size=batch_size, shuffle=True)
cifar_test_loader = DataLoader(cifar_test, batch_size=batch_size, shuffle=False)
svhn_test_loader = DataLoader(svhn_test, batch_size=batch_size, shuffle=False)

Files already downloaded and verified
Files already downloaded and verified
Using downloaded and verified file: ./data/test_32x32.mat


In [None]:
print(f"Number of training samples: {len(cifar_train)}")
print(f"Number of test samples: {len(cifar_test)}")
print(f"Number of SVHN test samples: {len(svhn_test)}")

Number of training samples: 50000
Number of test samples: 10000
Number of SVHN test samples: 10000


## 2. CNN Classifier
We will first train a CNN Classifier on the Cifar-10 training data, for the task of classifying the Cifar-10 images.

The architecture of the CNN should be:
- A convolutional layer with 32 filters, kernel size 3, stride 1 and padding 1.
- A ReLU activation
- A max pooling layer with kernel size 2.
- A convolutional layer with 64 filters, kernel size 3, stride 1 and padding 1.
- A ReLU activation
- A max pooling layer with kernel size 2.
- A fully connected layer with 128 neurons.
- A ReLU activations (the activations after this layer will be called the "features of the penultimate layer").
- A fully connected layer with 10 neurons.

This CNN will output the logit values.

**Exercise** Define a CNN having the above architecture by implementing the `__init__` and `forward` methods below. Bare in mind that some of the OOD scores we will define require access to the features of the penultimate layer.
- Add a `return_features` argument to the `forward` method, defaulting to `False`. If `return_features` is set to `True`, the `forward` method should return the features of the penultimate layer instead of the logit values.

In [None]:
# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        # TODO: define the necessary layers

    def forward(self, x, return_features=False):
        # TODO: apply the different layers to x in the correct order

## 3. Training
We will train the above CNN on the Cifar-10 training set.

**Exercise.** Train the CNN:
- For 5 epochs
- Using a learning rate of 0.001
- Choose an appropriate loss function
- Using the Adam optimizer
- Print the mean loss of the epoch at the end of each epoch.
- *Optional.* You can choose to monitor the training by printing the train/test accuracy too.

In [None]:
model = SimpleCNN().to(device)

In [None]:
# Hyper-parameters
# TODO: Set the number of epochs and the learning rate

# Loss and optimizer
# TODO: Set the loss function and the optimizer

# Training loop
def train_model():
    # TODO: Implement the training loop

In [None]:
train_model()

Epoch [1/5], Loss: 1.4152
Epoch [2/5], Loss: 1.0395


**Exercise.** Print the test loss and the test accuray after training.

In [None]:
# TODO: print test loss and accuracy

## 4. OOD Metrics
The objective of this section is to define the different OOD metrics studied during the lectures. Recall that we have seen two kinf of metrics:
1. Fixed-threshold metrics.
2. Threshold-independent metrics.

### 4.1. Fixed-threshold metrics
We will start to define the metrics for OOD detectors with a fixed threshold. The inputs to all of our metrics below will be:
- The `scores_negatives` nupy array: an array containing the scores for the ground truth negative images (i.e. the Cifar-10 test images).
- The `scores_positives` numpy array: an array containing the scores for the ground truth positive images (i.e. the SVHN test images).
- The `threshold` floating point number. The threshold value $\tau$ such that such that our OOD detector classifies examples according to their score as follwos:
$$\begin{cases}
s \leq \tau\quad &⇒\quad \text{negative}\\
s > \tau\quad &⇒\quad \text{positive}
\end{cases}$$
- Any other parameters necessary for the metric in question.

**Exercise.** Define the functions below:
1. A `confusion_matrix` function that outputs the number of *false positives*, *true positives*, *true negatives* and *false negatives*.
2. A `tpr_fpr` function that outputs the  *true positive rate* and *false positive rate*.
3. An `accuracy` function that outputs the accuracy.
4. A `precission_recall` function that outputs the *precision* and the *recall*.
5. A `f_beta` function that takes an additional input argument `beta` and returns the corresponding $F_\beta$ score.

In [None]:
# TPR and FPR

def confusion_matrix(scores_negatives, scores_positives, threshold):
    # TODO: Compute and return the confusion matrix

def tpr_fpr(scores_negatives, scores_positives, threshold):
    # TODO: Compute and return the tpr and fpr

def accuracy(scores_negatives, scores_positives, threshold):
    # TODO: Compute and return the accuracy

def precision_recall(scores_negatives, scores_positives, threshold):
    # TODO: Compute and return the precission and recall

def f_beta(scores_negatives, scores_positives, threshold, beta):
    # TODO: Compute and return the f_beta score

### 4.2. Threshold-independent metrics
**Exercise.** Define the function `roc_auc` that:
- Takes as input the `scores_negatives` and `scores_positives` numpy arrays.
- Plots the *ROC curve*.
- Returns the value of the *AUROC* as the area under the *ROC curve*.

In [None]:
def roc_auc(scores_negatives, scores_positives):
    # TODO: Combine scores and create labels
    scores = np.concatenate((scores_negatives, scores_positives))
    labels = ... # TODO Give the label 0 to negative data and the label 1 to positive data

    # Sort scores and labels
    sorted_indices = np.argsort(scores)
    scores = scores[sorted_indices]
    labels = labels[sorted_indices]

    # Initialize TPR and FPR
    tpr = []
    fpr = []
    n_pos = np.sum(labels)
    n_neg = len(labels) - n_pos

    tp = n_pos
    fp = n_neg

    # TODO: loop through all possible thresholds (i.e. all possible scores)
    # and update the number of true positives and false positives for eac threshold.
    # Compute the respective tpr and fpr and append them to the tpr and fpr lists.

    # Convert the tpr and fpr lists to numpy arrays
    tpr = np.array(tpr)
    fpr = np.array(fpr)

    # Compute AUROC (Area Under the Curve)
    auroc = # TODO: Compute the AUC using the np.trapz function

    # TODO: Plot ROC curve

    return auroc



## 5. OOD Scores
In this section, we will implement the different OOD scores seen during the lecture. Recall that we can split the different OOD scores into two score families:
1. Logit-based scores.
2. Feature-based scores.

### 5.1. Logit-based scores
Logit-based scores are simpler to implement than feature-based scores. We will implement each of the logit-based scores as a function that takes as inputs the `logits` array of logits of the different test points,
and returns the array of test point scores.

**Exercise.** Complete the functions below with the formulas seen during the lecture.

In [None]:
# MLS Score
def mls(logits):
    # TODO: Compute and return the MLS score

# MSP Score
def msp(logits):
    # TODO: Compute and return the MSP score

# Energy Score
def energy(logits, temp=1):
    # TODO: Compute and return the Energy score

# Entropy Score
def entropy(logits):
    # TODO: Compute and return the Entropy score

### 5.2. DKNN
In this section we define a class `DKNN` to compute the Deep $K$-nearest neighbor score. This score is more involved than the previous ones for two main reasons:
- It employs the activations of the penultimate layer of the CNN rather than the logit or softmax values.
- It requires a fitting dataset in order to compute distances of the test images with respect to the images in the fitting dataset. We will be using the Cifar-10 training set as fitting dataset.

*Exercise.* Complete the following methods in the class `DKNN` below:
1. The `_l2_normalization` method that computes that normalizes a batch of feature vectors by dividing each feature vector by its $\ell_2$ norm.
2. The `compute_scores` function that computes the distance from each of the test points to its $k$-th nearest neighbor in the fit dataset. The distances are computed between the normalized feature representations. The test points are processed in batches to avoid memory issues.

In [None]:
class DKNN:
    def __init__(self, k=50, batch_size=256):
        self.k = k
        self.batch_size = batch_size
        self.fit_features = None

    def _l2_normalization(self, feat):
        norms = ... # TODO: Compute the norm of each feature vector, and add a small constant to it to avoid dividing by zero
        return feat / norms

    def fit(self, fit_dataset):
        self.fit_features = # TODO: Apply the l2 normalization to the fit dataset.

    def compute_scores(self, test_features):
        test_features = ... # TODO: Apply the l2 normalization to the test dataset.
        scores = []

        # Process test features in batches
        for i in range(0, test_features.size(0), self.batch_size):
            batch = test_features[i:i + self.batch_size]
            # Compute pairwise distances for the batch
            distances = torch.cdist(batch, self.fit_features, p=2)  # (batch_size, num_fit_samples)
            # TODO: Sort distances and extract the k-th nearest
            # Append the results to the list of scores.


        # Concatenate scores from all batches
        return torch.cat(scores, dim=0).cpu().numpy()

### 5.3. Mahalanobis
In this section we define a class `Mahalanobis` to compute the Mahalanobis score. This class is similar to the `DKNN` for the same reasons as before:
- It employs the activations of the penultimate layer of the CNN rather than the logit or softmax values.
- It requires a fitting dataset in order to compute distances of the test images with respect to the images in the fitting dataset. We will be using the Cifar-10 training set as fitting dataset.

*Exercise.* Complete the following methods in the class `Mahalanobis` below:
1. The `fit` method that fits per-class mean vectors and a common covariance matrix to the fitting dataset.
2. The `_mahalanobis_distance` method that computes the Mahalanobis distance of a given vector with respect to the gaussian law parametrized by its mean vector and covariance matrix.
3. The `compute_scores` function that uses the two previous methods to compute the Mahalanobis score of all test points by taking the maximum of Mahalanobis distances over the set of different classes/labels.

In [None]:
class Mahalanobis():
    def __init__(self):
        self.mus = None
        self.inv_cov = None
        self.labels = None

    def fit(self, features, labels):
        self.labels = # TODO: extract the set of unique labels
        self.mus = {}
        covs = {}
        for label in self.labels:
            # TODO: fit the mean vector corresponding to the label
            # and the RESCALED covariance matrix

        cov = # TODO: Compute the common covariance matrix for all labels
        self.inv_cov = # TODO: Compute the (pseudo-)inverse of the covariance matrix

    def _mahalanobis_distance(self, x, mu, inv_cov):
        # TODO: Compute and return the Mahalanobis distance for the given mean and inverse covariance

    def compute_scores(self, test_features):
        scores = []
        for test_feature in test_features:
            distances = # TODO: Compute the vectore of per-label Mahalanobis distances
            # TODO: Compute the mahalanobis score of the current test example and append it to the list of scores.
        return torch.stack(scores).cpu().numpy()

## 6. Score Comparison
The objective of this section is to compare the different OOD scores that we have just defined. Note that in order to use the *threshold-dependent metrics*, we need to pick a threshold for each of the scores.

Picking the same threshold for all scores *is not* a proper way to compare the different scores, since thy are scaled differently. A common way to perform a more "fair" comparison is to do the following:
1. Fix a target TPR, e.g. 0.9.
2. Compute the threshold $\tau$ such that the TPR on the SVHN test dataset is equal to the target TPR 0.9.
3. Compute the remaining fixed-threshold metrics for such $\tau$.

**Exercise.** Define the function `compute_threshold` that:
- Takes as inputs `scores`, a numpy array of scores and a `target_tpr`, a value between 0 and 1 defaulting to 0.95.
- Assuming that the array of `scores` contains the scores of the positive examples, the function computes and returns the value of the threshold $\tau$ that achieves the desired `target_tpr`.

In [None]:
def compute_threshold(scores, target_tpr=0.95):
    # TODO: Compute and return the desired threshold.

In order to compare the different OOD scores that we have defined, we set the variable `target_tpr` equal to 0.9 and we initialise an empty dictionary to store the different metrics for the different OOD scores.

In [None]:
target_tpr = 0.9
metrics_dict = {}

### 6.1. Metrics for logit-based scores

**Exercise.** Next we compute the different evaluation metrics for each of the scores above, starting with the *logit-based scores*:
1. Extract the logits of the Cifar-10 test set and the SVHN test set.
2. For each of the *MLS*, *MSP*, *Energy (T=1)* and *Entropy* OOD score functions:
  - Compute the scores on the Cifar-10 test set and the SVHN test set.
  - Plot the histogram of the scores and check that the negative samples have, on average, lower scores than the positive samples.
  - Use the `roc_auc` function to plot the ROC curves and compute the AUROCs.
  - Compute the trhreshold that achieves 0.1 FPR and compute the fixed-threshold metrics associated to it: accuracy, TPR, Precision, Recall and $F_1$.
  - Store all the metrics in the `metrics_dict` dictionary for future comparison.

In [None]:
# Compute logits directly from the dataset
def compute_logits(dataset, model, device):
    # TODO: Compute and return the logits of the elements in the dataset as a torch tensor.

# Apply the function to CIFAR-10 and SVHN datasets
test_logits_negatives = compute_logits(cifar_test, model, device)
test_logits_positives = compute_logits(svhn_test, model, device)

In [None]:
scoring_functions = {
    'MLS': mls,
    'MSP': msp,
    'Energy': energy,
    'Entropy': entropy
}

for method, scoring_function in scoring_functions.items():

    # TODO: Compute scores
    scores_negatives = ...
    scores_positives = ...

    # TODO: Plot histogram of scores

    # Initialize empty dict for metrics
    metrics_dict[method] = {}

    # TODO: Plot ROC curve and compute AUROC
    auroc = ...
    metrics_dict[method]['auroc'] = auroc

    # TODO: Compute threshold for the given target_tpr
    threshold = ...

    # TODO: Compute and store remaining metrics

### 6.2. Metrics for feature-based scores

**Exercise.** Extract the representations in the feature space given by the penultimate layer of the CNN of the three datasets: Cifar-10 training dataset, Cifar-10 test set and SVHN test set.

In [None]:
# Compute features directly from the dataset
def compute_features(dataset, model, device):
    # TODO: Compute and return the feature representations of the elements in the dataset as a torch tensor.


# TODO: Extract the features of the CIFAR-10 train, test, and SVHN test datasets
train_features = ...
test_features_negatives = ...
test_features_positives = ...

**Exercise.**
1. Compute the *DKNN scores* for the Cifar-10 test dataset and the SVHN test datsets using the 5-th nearest neighbor.
2. Plot the histogram of the scores and check that the negative samples have, on average, lower scores than the positive samples.
2. Use the `roc_auc` function to plot the ROC curve and compute the AUROC.
3. Compute the trhreshold that achieves 0.1 FPR and compute the fixed-threshold metrics associated to it: accuracy, TPR, Precision, Recall and $F_1$.
4. Store all the metrics in the `metrics_dict` dictionary for future comparison.


In [None]:
metrics_dict['DKNN'] = {}

# TODO: initialize and fit the DKNN model

# TODO: Compute the scores of the negative and positive data from their feature representations

# TODO: Plot the histogram of the scores

# TODO: Plot ROC curve and compute AUROC
auroc = ...
metrics_dict['DKNN']['auroc'] = auroc

# TODO: Compute threshold for the given target_tpr
threshold = ...

# TODO: Compute and store remaining metrics

**Exercise.**
1. Compute the *Mahalanobis scores* for the Cifar-10 test dataset and the SVHN test datsets using the 5-th nearest neighbor.
2. Plot the histogram of the scores and check that the negative samples have, on average, lower scores than the positive samples.
2. Use the `roc_auc` function to plot the ROC curve and compute the AUROC.
3. Compute the trhreshold that achieves 0.1 FPR and compute the fixed-threshold metrics associated to it: accuracy, TPR, Precision, Recall and $F_1$.
4. Store all the metrics in the `metrics_dict` dictionary for future comparison.

In [None]:
metrics_dict['Mahalanobis'] = {}

# TODO: initialize and fit the Mahalanobis model

# TODO: Compute the scores of the negative and positive data from their feature representations

# TODO: Plot the histogram of the scores

# TODO: Plot ROC curve and compute AUROC
auroc = ...
metrics_dict['Mahalanobis']['auroc'] = auroc

# TODO: Compute threshold for the given target_tpr
threshold = ...

# TODO: Compute and store remaining metrics

## Results Table
**Exercise.** Plot the results stored in the dictionary `metrics_dict` by highlighting the method that achieves the best value for each of the different metrics.

In [None]:
# TODO: Display a table with the best results highlited.
# Careful! The best result is not always the maximum value!

**Bonus Exercises.** If you still have time, you can try and do the following:
1. Play with different temperature parameters in the *Energy* score to see how they affect the different metrics.
2. Play with different $k$ parameters in the *DKNN* algorithm to see how they affect the different metrics.
3. Write docstirngs for the above function (In the future, you will be greatful to your current self if you find yourself checking out this notebook and the docstrings are there).
4. Download a better model (e.g. a pre-trained VGG model fine-tuned on Cifar-10) and check out if you get better results with it.
5. Check out the OODEEL library where a benchmark like the one we have just carried-out is much easier to perform ;)