<a id="title"></a>
# Anomaly Detection using LOF, iForest, and OC-SVM

This notebook assumes you are familiar with basic machine learning vocabulary.

---

## Table of Contents
[Introduction](#intro) <br>
[0. Imports](#imports) <br>
[1. MNIST dataset and scaling](#mnist) <br>
[2. Reduce MNIST using UMAP](#umap) <br>
[3. Local Outlier Factor (LOF)](#lof) <br>
- [3a. Fit, predict and visualize using training set](#lof_train) <br>
- [3b. Predict test set and visualize boundaries](#lof_test) <br>
- [3c. LOF Distributions](#loc_adv) <br>

[4. Isolation Forest (iForest)](#if) <br>
- [4a. Fit, predict, and visualize using training set](#if_train) <br>
- [4b. Predict test set and visualize boundaries](#if_test) <br>
- [4c. iForest Distributions and Trees](#if_adv) <br>

[5. One Class Support Vector Machine (OC-SVM)](#svm) <br>
- [5a. Fit, predict, and visualize using training set](#svm_train) <br>
- [5b. Predict test set and visualize boundaries](#svm_test) <br>
- [5c. OC-SVM Distributions](#svm_adv) <br>

[6. Conclusions](#con) <br>
[Additional Resources](#add) <br>
[About this Notebook](#about) <br>
[Citations](#cite) <br>

<a id="intro"></a>
## Introduction

Finding anomalies, or outliers, in data is a difficult, yet crucial task for analysis, modeling, and science. Anomaly detection is a subset of machine learning techniques that determines outliers within a data set. We can detect anomalies to better understand the data set's complexity, and the extrema of its distribution. After detection, anomalies can be followed up for further investigation, which may provide insight into the data collection/processing procedures, or cleaned from the dataset to enhance modeling and prediction.

**The purpose of this notebook is to demonstrate Local Outlier Factor (LOF), Isolation Forest (iForest), and One Class Support Vector Machine (OC-SVM) as anomaly detection techniques on the MNIST dataset.**

<a id="imports"></a>
## 0. Imports

We use `numpy` for arrays, `matplotlib` for plotting, `tensorflow` for loading MNIST, and `umap` for reducing MNIST from a high dimensional space to a low dimensional space. In addition, we `sklearn` for our anomaly detection algorithms.

If you do not have some of the packages, please follow the installation guides:
- [Numpy](https://numpy.org/install/)
- [Matplotlib](https://matplotlib.org/stable/users/installing/index.html)
- Tensorflow ([conda](https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/) and [pip](https://www.tensorflow.org/install))
- [UMAP](https://umap-learn.readthedocs.io/en/latest/)
- [Scikit-learn](https://scikit-learn.org/stable/install.html)

In [None]:
from time import time

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm

import tensorflow as tf
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.ensemble._iforest import _average_path_length
from sklearn.tree import plot_tree
from sklearn.svm import OneClassSVM
from umap import UMAP

<a id="mnist"></a>
## 1. MNIST dataset and scaling

MNIST is a popular image dataset of handwritten digits from 0 to 9. We use it to showcase the different clustering techniques. Here are some qualities of the dataset:
- 60k training samples
- 10k testing samples
- 10 classifications (digits 0-9)
- 28x28 images
- 8-bit gray scaled (0-255 pixel values)

Why is MNIST such a good dataset to use for learning ML? 
- Relatively small images (784 features)
- Relatively large dataset (70k samples)
- 10 unique well defined labels (all the digits are clearly different from each other)
- Very clean dataset (little noise)
    - Backgorund pixels are 0 and signal pixels are nearly 255 so it approximates a binomial distibution
    - The digits are well centered, meaning pixels for similar parts of a digit should consistently be in the same vicinity
    
We retrieve our data using `tensorflow`.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

We define some global variables and min-max normalize the images so pixels range between 0-1 (normalizing data is a common practice in machine learning). We also flatten our 28x28 images into a 784 feature arrays, in which each pixel is a feature.

In [None]:
# Global variables
x_train_size = x_train.shape[0]
x_test_size = x_test.shape[0]
x_length = x_train.shape[1]
norm = x_train.max()
rs = 42

# Scale images
x_train_scale = x_train / norm
x_test_scale = x_test / norm

# Flatten arrays
x_train_scale_flat = x_train_scale.reshape(x_train_size, x_length ** 2)
x_test_scale_flat = x_test_scale.reshape(x_test_size, x_length ** 2)

Here are the first 16 samples in the training set.

In [None]:
fig, axs = plt.subplots(4,4,figsize=[10,10])
for i in range (4):
    for j in range (4):
        axs[i,j].imshow(x_train_scale[i*4+j])
plt.tight_layout()

<a id="umap"></a>
## 2. Reduce MNIST using UMAP

Before detecting anomalies in MNIST, we use dimensionality reduction to reduce our data from a 784D space to a 2D space. In a high dimensional space, the distance between most samples are too similar (i.e. curse of dimensionality). Since most anomaly detection techniques use a distance metric for determining anomalies, this quality of our data is a clear weakness. However, dimensionality reduction embeds similar samples near one another, and disimilar samples away from one another, making the distance metric more viable. Here, we use Uniform Manifold Approximation and Projection (UMAP) as our dimensionality reduction technique for its quality in embedding data, and its fast performance speed. For more information on this technique and dimensionality reduction in general, see `scikit_tutorial_mnist_dimred.ipynb`.

The two main hyperparameters for UMAP are `n_neighbors` and `min_dist`. The former determines the number of neighbors used for the local approximation, and larger values conserve more global structure. The latter determines how tight the data is in the reduced space, and smaller values conserve more local structure. Tuning these hyperparameters smoothly changes the overall embedding. We reduce to two components and choose the default hyperparameter values.

In [None]:
umap = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=rs)

We fit and transform the scaled MNIST training data using UMAP.

**Note: This should take no longer than a few minutes on a standard Mac laptop**

In [None]:
umap_mnist_train = umap.fit_transform(x_train_scale_flat)

We confirm the size of the reduced data.

In [None]:
umap_mnist_train.shape

We plot the data with labels, where each data point represents one image.

In [None]:
fig, axs = plt.subplots(1,2,figsize=[20,10])
axs[0].grid()
axs[0].set_title('UMAP MNIST (Train; Unlabeled)')
axs[0].scatter(umap_mnist_train[:, 0], umap_mnist_train[:, 1], s=1, alpha=0.5)
axs[0].set_xlabel('UMAP1')
axs[0].set_ylabel('UMAP2')

axs[1].grid()
axs[1].set_title('UMAP MNIST (Train; Labeled)')
for i in range (10):
    mask = y_train == i
    axs[1].scatter(umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
                   s=1, alpha=0.5, label=i, color='C{}'.format(i))
axs[1].set_xlabel('UMAP1')
axs[1].set_ylabel('UMAP2')
axs[1].legend()

UMAP does an excellent job reducing MNIST to separate digits with little noise. 

We also tranform the test set to UMAP space, check its size, and plot.

In [None]:
umap_mnist_test = umap.transform(x_test_scale_flat)

In [None]:
umap_mnist_test.shape

In [None]:
fig, axs = plt.subplots(1,2,figsize=[20,10])
axs[0].grid()
axs[0].set_title('UMAP MNIST (Test; Unlabeled)')
axs[0].scatter(umap_mnist_test[:, 0], umap_mnist_test[:, 1], s=1, alpha=0.5)
axs[0].set_xlabel('UMAP1')
axs[0].set_ylabel('UMAP2')

axs[1].grid()
axs[1].set_title('UMAP MNIST (Test; Labeled)')
for i in range (10):
    mask = y_test == i
    axs[1].scatter(umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
                   s=1, alpha=0.5, label=i, color='C{}'.format(i))
axs[1].set_xlabel('UMAP1')
axs[1].set_ylabel('UMAP2')
axs[1].legend()

Since the test set samples fall directly on the training set samples, we confirm UMAP has generalized to new data. Now with the UMAP MNIST data, we can detect anomalies as if they were unlabeled.

<a id="lof"></a>
## 3. Local Outlier Factor (LOF)

Local outlier factor [(LOF; detailed explanation with equations can be found under "Formal")](https://en.wikipedia.org/wiki/Local_outlier_factor) is a density-based anomaly detection algorithm that identifies outliers with respect to local data instead of global data. The objective of LOF is to find samples in low density regions with respect to the samples' k-nearest neighbors. First, we find a sample's reachability distance (similar to DBSCAN). Then, we find a sample's local reachability density, which is the harmonic mean of the sample's k-nearest neighbors' reachability distances. Lastly, we find a sample's local outlier factor by taking the ratio of the arithmetic mean of the samples' k-nearest neighbors' local reachability densities to the sample's local reachability density. After finding each samples' local outlier factors, a threshold is applied to determine which samples are outliers. Inliers have small values, while outliers have large values.

LOF is relatively fast, and effective at modeling imbalanced data and data of varying densities. However, LOF scales quadratically with respect to the data's dimensionality. In addition, the same threshold may not work on different data sets. Domain knowledge and analyzing how different thresholds detect outliers are necessary for optimal results. Many improvements to LOF have been developed, but are beyond the scope of this tutorial.

Here are some complementary resources:

- [LOF Overview Video (5 minues, simple language)](https://www.youtube.com/watch?v=Ymvq6JHjoBY)
- [LOF Step by Step Video (6 minutes, simple language)](https://www.youtube.com/watch?v=7L23sCOZjns)
- [LOF Detailed Video (16 minutes, intermediate lanuage)](https://www.youtube.com/watch?v=9-PHBzI_rDk)
- [LOF Original Paper](https://dl.acm.org/doi/pdf/10.1145/335191.335388)

<a id="lof_train"></a>
### 3a. Fit, predict, and visualize using training set

We use [scikit-learn for LOF](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html). The main parameter for LOF is `n_neighbors`, which defines the number of nearest neighbors to consider for outlier detection. We use `LocalOutlierFactor` default values, which are 20 nearest neighbors, and an offset (i.e. outlier score threshold) of 1.5

In [None]:
lof = LocalOutlierFactor(n_neighbors=20)

As a baseline, we fit and predict outliers using the pixels as features. 

**Note: this may take a couple of minutes to execute.**

In [None]:
lof_mnist_train = lof.fit_predict(x_train_scale_flat)

Let's check the shape to make sure we have a vector of predictions.

In [None]:
lof_mnist_train.shape

Now, we plot the UMAP training set with the training labels and detected outliers from pixel features.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; LOF (Pixels); {lof.n_neighbors} Neighbors)')
mask = lof_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Using the scaled pixels, LOF biased a lot of outliers towards the 1's (orange cluster). However, a good portion of outliers were on the borders of the other clusters.

We plot the 10 most anomalous train samples.

In [None]:
inds = np.argsort(lof.negative_outlier_factor_)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        lof_nof = lof.negative_outlier_factor_[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Factor: {lof_nof:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

A few slanted 8s and 2s were detected along with a 1 and 7.

Using a more efficient data representation will yield better results. Thus, we fit and predict anomalies using the UMAP MNIST data.

In [None]:
lof = LocalOutlierFactor(n_neighbors=20)

t0_lof = time()
lof_mnist_train = lof.fit_predict(umap_mnist_train)
t1_lof = time()

t_lof = t1_lof - t0_lof
print (f'Time spent fitting model: {t_lof:.4f} seconds')

LOF fit exceptionally fast. Now, we plot the UMAP training set with the training labels and detected outliers from the UMAP embedding.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; LOF; {lof.n_neighbors} Neighbors)')
mask = lof_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Nearly all of the anomalies are on the borders of the clusters, which matches our intuition. 

We plot the 10 most anomalous train samples.

In [None]:
inds = np.argsort(lof.negative_outlier_factor_)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        lof_nof = lof.negative_outlier_factor_[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Factor: {lof_nof:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

A few 2s with a slanted base were detected along with two 8s with missing pixels.

Next, we fit using 100 nearest neighbors, and produce similar plots as above. We also set the parameter `novelty=True` to predict new samples.

In [None]:
# Fit using 100 nearest neighbors (predict is not available when novelty=True)
lof = LocalOutlierFactor(n_neighbors=100, novelty=True)
lof.fit(umap_mnist_train)
lof_mnist_train_nof = lof.negative_outlier_factor_
lof_mnist_train = (lof_mnist_train_nof > lof.offset_).astype(int)
lof_mnist_train[lof_mnist_train==0] = -1

# Plot
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; LOF; {lof.n_neighbors} Neighbors)')
mask = lof_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Now all of the anomalies are on the borders of the clusters. 

Again, we plot the 10 most anomalous train samples.

In [None]:
inds = np.argsort(lof_mnist_train_nof)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        lof_nof = lof_mnist_train_nof[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Factor: {lof_nof:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

Several uncharacteristic 7s were detected along with a 0 and an 8.

Finally, we print the number of anomalies detected.

In [None]:
print (f'Number of Anomalies Detected: {(lof_mnist_train == -1).sum()}')

Around less than half of a percent of samples were found to be anomalies.

<a id="lof_test"></a>
### 3b. Predict test set and visualize boundaries

Now that LOF is trained, we predict outliers in the test set.

In [None]:
lof_mnist_test_nof = lof.score_samples(umap_mnist_test)
lof_mnist_test = (lof_mnist_test_nof > lof.offset_).astype(int)
lof_mnist_test[lof_mnist_test==0] = -1

We plot the test set with the test labels and the detected outliers.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; LOF; {lof.n_neighbors} Neighbors)')
mask = lof_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

LOF generalizes to the test set as well.

We also produce a similar plot of the detected outliers with rings proportional to the respective outlier factors (i.e. larger rings are more anomalous).

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; LOF; {lof.n_neighbors} Neighbors)')
mask = lof_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=1, alpha=1, label=f'Anomalies', color='k'
)
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=-lof_mnist_test_nof[mask]*100, facecolor='none', edgecolor='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Outliers near clusters have smaller rings, and outliers far from clusters have larger rings.

Similarly, we plot the 10 most anomalous test samples.

In [None]:
inds = np.argsort(lof_mnist_test_nof)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        lof_nof = lof_mnist_test_nof[ind]
        axs[i, j].set_title(f'Label: {y_test[ind]}\nNegative Outlier Factor: {lof_nof:.3f}')
        axs[i, j].imshow(x_test_scale[ind])
plt.tight_layout()

We detect several digits that are out of distribution.

To visualize the outlier boundaries, we can predict the outlier scores on a grid of points in our manifold. We define a function that creates a grid of points based on our features, plots the labeled and anomalous data, and plots the outlier detection grid with high transparency to act as a background.

In [None]:
def plot_outlier_boundaries(X, Y, Y_pred, model, name, num=200):
    """Plot outlier boundaries.

    Parameters
    ----------
    X : np.array
        Features as a 2D array.
    Y : np.array
        Labels as a 1D array.
    Y_pred : np.array
        Predictions as a 1D array.
    model : sklearn.model
        Anomaly detection model to predict outliers.
    name : str
        Name for the plot. Allowed values are 'LOF', 'iForest', and 'OC-SVM'.
    num : int, default=200
        Number of x/y coordinates. The number of grid points is num**2.
        
    Returns
    -------
    Z_grid : np.array
        Outlier labels for grid of points as a 2D array (num, num).
    """
    # Make grid points
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xs = np.linspace(x_min, x_max, num)
    ys = np.linspace(y_min, y_max, num)
    x_grid, y_grid = np.meshgrid(xs, ys)
    grid_coords = np.c_[x_grid.ravel(), y_grid.ravel()].astype(np.float32)

    # Predict outlier labels of grid points
    if name in ['LOF', 'iForest', 'OC-SVM']:
        Z = model.predict(grid_coords)
    else:
        raise ValueError(f'`name` is {name}. Use "LOF", "iForest", or "OC-SVM".')

    # Plot
    plt.figure(figsize=[10,10])
    plt.title(f'{name} Outlier Boundaries')
    for i in range (Y.max() + 1):
        # Data points
        mask_xy = Y == i
        plt.scatter(
            X[:, 0][mask_xy], X[:, 1][mask_xy], 
            color=f'C{i}', s=1, alpha=0.5, label=i
        )
    # Outliers
    mask_z = Z == -1
    plt.scatter(
        grid_coords[:, 0][mask_z], grid_coords[:, 1][mask_z], 
        color='k', alpha=.025, marker='s'
    )
    mask_xy = Y_pred == -1
    plt.scatter(
        X[:, 0][mask_xy], X[:, 1][mask_xy], 
        color='k', s=10, alpha=1, label=f'Outliers'
    )
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')
    plt.legend(loc='lower right')
    plt.show()

    Z_grid = Z.reshape(x_grid.shape)
    return Z_grid

In [None]:
Z_grid_lof = plot_outlier_boundaries(umap_mnist_train, y_train, lof_mnist_train, lof, 'LOF')

The LOF outlier boundaries are relatively smooth around each cluster. Any point outside of a cluster is detected as an anomaly.

<a id="lof_adv"></a>
### 3c. LOF Distributions

As previously mentioned, LOF assigns an outlier score to each sample, and detects an anomaly as having a score greater than 1.5 (or a negative outlier factor less than -1.5). We plot the distributions of the outlier scores for the training and test sets.

In [None]:
plt.grid()
plt.title('LOF Outlier Score Histogram')
plt.hist(lof_mnist_train_nof, bins=200, alpha=0.5, label='train')
plt.hist(lof_mnist_test_nof, bins=200, alpha=0.5, label='test')
plt.vlines(lof.offset_, 0, 10**4, color='k', label='offset')
plt.yscale('log')
plt.legend()

The outlier scores are exponentially distributed with rarer samples having extreme negative scores.

In addition, we can calculate the outlier score of the manifold using LOF. Samples that are more likely to be observed have an outlier score closer to 0. By visualizing the outlier score distribution on the manifold, we can determine where scores are similar. We define a function to plot contour lines showing the outlier score of samples across the UMAP manifold.

In [None]:
def plot_outlier_contour(X, Y, Y_pred, model, name, num=200, levels=[], tree=None):
    """Plot outlier score contour lines.

    Parameters
    ----------
    X : np.array
        Features as a 2D array.
    Y : np.array
        Labels as a 1D array.
    Y_pred : np.array
        Predictions as a 1D array.
    model : sklearn.model
        Anomaly detection model to predict outliers.
    name : str
        Name for the plot. Allowed values are 'LOF', 'IF', and 'OC-SVM'.
    num : int, default=200
        Number of x/y coordinates. The number of grid points is num**2.
    levels : array-like, default=[]
        The levels for the countour lines.
    tree : int, default=None
        If the model is an isolation forest, choose a tree to visualize.
        
    Returns
    -------
    Z_grid_scores : np.array
        Outlier scores for grid of points as a 2D array (num, num).
    """
    # Make grid points
    length = Y.max() + 1
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xs = np.linspace(x_min, x_max, num)
    ys = np.linspace(y_min, y_max, num)
    x_grid, y_grid = np.meshgrid(xs, ys)
    grid_coords = np.c_[x_grid.ravel(), y_grid.ravel()].astype(np.float32)

    # Raise error if incorrect name
    if name not in ['LOF', 'iForest', 'OC-SVM']:
        raise ValueError(f'`name` is {name}. Use "LOF", "iForest", or "OC-SVM".')

    # Plot
    plt.figure(figsize=[12,10])
    plt.grid()
    tree_str = ''
    if name == 'iForest' and isinstance(tree, int):
        tree_str = f'(Tree {tree})'
    plt.title(f'{name} Contours {tree_str}')
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')

    # Plot data points
    for i in range (length):
        mask_xy = Y == i
        plt.scatter(
            X[:, 0][mask_xy], X[:, 1][mask_xy], 
            s=1, alpha=0.5, label=i
        )
    if name == 'iForest' and isinstance(tree, int):
        Y_pred = Y_pred[tree]
    mask_xy = Y_pred == -1
    plt.scatter(
        X[:, 0][mask_xy], X[:, 1][mask_xy], 
        color='k', s=10, alpha=1, label=f'Outliers'
    )
    
    # Calculate scores and plot contours
    if name == 'iForest' and isinstance(tree, int):
        Z_grid_scores = if_score_samples_tree(grid_coords, model)[tree].reshape(x_grid.shape)
    else:
        Z_grid_scores = model.score_samples(grid_coords).reshape(x_grid.shape)
    plt.contour(x_grid, y_grid, Z_grid_scores, levels=levels)
        
    plt.colorbar()
    plt.legend(loc='lower right')
    plt.show()

    return Z_grid_scores

In [None]:
Z_grid_lof_scores = plot_outlier_contour(umap_mnist_test, y_test, lof_mnist_test, lof, 'LOF', levels=np.linspace(-10,0,5))

The darker contour lines represent less likely regions, and brighter contour lines represent more likely regions. The outlier score contour map produced from LOF is smooth with intuitive local maxima, such as in the centers of the clusters. As expected, the farther a sample is from the distribution, the higher the outlier score.

Generally, LOF performed well for anomaly detection. The next algorithm uses an ensemble of decision trees to detect anomalies, which has some advantages over this density-based algorithm.

<a id="if"></a>
## 4. Isolation Forest (iForest)

An isolation forest [(iForest)](https://en.wikipedia.org/wiki/Isolation_forest) is an ensemble tree-based anomaly detection algorithm that identifies outliers as shallow leaves on decision trees. iForest is an ensemble of decision trees, each trained on a random subset of data and a random subset of features. The objective of iForest is to randomly partition the data, and generate various binary search trees. Trees are split until all samples are isolated, or until a max tree depth is reached. Samples that travel more into a tree (i.e. require more partitions) are more normal, where as samples that travel less into a tree (i.e. require less partitions) are more anomalous. With a single tree, we calculate how shallow (i.e. early in tree depth) a sample is predicted to be. With the forest, we calculate the average depth of the sample, and normalize by the average depth of the tree. We then raise that quotient to the power of 2, which defines the anomaly score. Similar to LOF, a threshold is applied to determine which samples are outliers. Inliers have values close to 0, while outliers have values far from 0.

iForests are extremely fast, and can efficiently span large training sets and feature spaces. However, a priori knowledge of the percentage of outliers present in the data set is necessary for choosing an optimal threshold. In addition, since iForest is an ensemble of many weak decision trees, the learned decision boundaries tends to be rougher and more rigid. An improved algorithm called [extended isolation forest](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8888179) has been developed with [code](https://github.com/sahandha/eif) from the first author available, but we leave exploring that algorithm as an exercise for the reader.
 
Here are some complementary resources:

- [iForest Overview Video (5 minutes, simple language)](https://www.youtube.com/watch?v=Y1x51i1936M)
- [Six Sigma Pro Smart Video (11 minutes, simple language)](https://www.youtube.com/watch?v=kN--TRv1UDY)
- [PyData iForest Walkthrough Video (24 minutes, intermediate language)](https://www.youtube.com/watch?v=RyFQXQf4w4w)
- [iForest Original Paper](https://www.outspokenmarket.com/uploads/8/8/2/3/88233040/isolation_forest.pdf)
- [Extended iForest Code from Maintained Repository](https://github.com/h2oai/h2o-3/blob/master/h2o-py/h2o/estimators/extended_isolation_forest.py)

<a id="if_train"></a>
### 4a. Fit, predict, and visualize using training set

We use [scikit-learn for iForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html). The main parameters for iForest are `n_estimators`, which is the number of trees to fit, `max_samples`, which is the number of samples to train each tree, and `max_features`, which is the number of features to train each tree. We start with all the default values: 100, 256, and every feature (indicated by 1.0), respectively. We also use the default offset (i.e. outlier score threshold) of 0.5.

In [None]:
iso_forest = IsolationForest(n_estimators=100, max_samples=256, max_features=1.0, random_state=rs)

Again as a baseline, we fit and predict outliers using the pixels as features.

In [None]:
if_mnist_train = iso_forest.fit_predict(x_train_scale_flat)

iForest is significantly faster at fitting the full set of features in MNIST than LOF. Let's check the shape to make sure we have a vector of predictions.

In [None]:
if_mnist_train.shape

Now, we plot the UMAP training set with the training labels and detected outliers from pixel features.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; iForest (Pixels))\n{iso_forest.n_estimators} Estimators; {iso_forest.max_samples} Samples')
mask = if_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

When using pixels as features, iForest performs poorly on the manifold, labeling most 0s (blue) and 2s (green) as outliers. However, the outliers for 4s (purple), 7s (gray), and 9s (cyan) are mostly on the borders of the clusters. Lastly, only a handful of 1s (orange) were detected as outliers.

We plot the 10 most anomalous train samples.

In [None]:
if_negative_outlier_scores = iso_forest.score_samples(x_train_scale_flat)
inds = np.argsort(if_negative_outlier_scores)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        if_nos = if_negative_outlier_scores[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Score: {if_nos:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

When using the pixels as features, the IF detects thick 0s and 8s as outliers, and one 4 where the pixels are a Bernoulli distribution (i.e. only pixel values 0s and 1s).

Although iForest detected a particular subset of outliers, we fit and predict anomalies using the UMAP MNIST data for better results.

In [None]:
iso_forest = IsolationForest(n_estimators=100, max_samples=256, max_features=1.0, random_state=rs)

t0_if = time()
if_mnist_train = iso_forest.fit_predict(umap_mnist_train)
t1_if = time()

t_if = t1_if - t0_if
print (f'Time spent fitting model: {t_if:.4f} seconds')

Now, we plot the UMAP training set with the training labels and detected outliers from the UMAP embedding.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; iForest)\n{iso_forest.n_estimators} Estimators; {iso_forest.max_samples} Samples')
mask = if_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

When using UMAP embedding as features, iForest performs even worse on the manifold. The iForest labeled the 3s (red), 5s (brown) and 8s (yellow) as the central cluster, while most 0s (blue), 1s (orange), and 6s (pink) were labeled as outliers. Similarly, the outliers for 4s (purple), 7s (gray), and 9s (cyan) are mostly on the borders of the clusters. 

We plot the 10 most anomalous train samples.

In [None]:
if_negative_outlier_scores = iso_forest.score_samples(umap_mnist_train)
inds = np.argsort(if_negative_outlier_scores)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        if_nos = if_negative_outlier_scores[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Score: {if_nos:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

When using the pixels as features, the iForest trivially detects slanted 1s as outliers.

The poor fit is primarily caused by the number of samples used to fit each tree. By using 256 random samples to fit 100 random trees, even if each tree used a unique set of samples, we would only use less than half of our data. Here, we fit the iForest using 20000 samples per tree so there is near 100% certainty every sample is used. In addition, we set `contamination=0.05` to adjust the offset such that only 5% of samples are detected as outliers.

In [None]:
# Fit using optimal parameters
iso_forest = IsolationForest(n_estimators=100, max_samples=20000, max_features=1.0, contamination=0.05, random_state=rs)
if_mnist_train = iso_forest.fit_predict(umap_mnist_train)

# Plot
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; iForest)\n{iso_forest.n_estimators} Estimators; {iso_forest.max_samples} Samples')
mask = if_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

All of the outliers are on the borders of each cluster, indicating a good fit. 

Again, we plot the 10 most anomalous train samples.

In [None]:
if_mnist_train_nos = iso_forest.score_samples(umap_mnist_train)
inds = np.argsort(if_mnist_train_nos)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        if_nos = if_mnist_train_nos[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nNegative Outlier Score: {if_nos:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

Several uncharacteristic 2s were detected in addition to some other out of distribution digits. Contrasting with LOF, this subset of outliers is slightly more diverse by class.

Finally, we print the number of anomalies detected.

In [None]:
print (f'Number of Anomalies Detected: {(if_mnist_train == -1).sum()}')

As expected, 5% of samples were found to be anomalies.

<a id="if_test"></a>
### 4b. Predict test set and visualize boundaries

Now that the iForest is trained, we predict outliers in the test set and their scores.

In [None]:
if_mnist_test = iso_forest.predict(umap_mnist_test)
if_mnist_test_nos = iso_forest.score_samples(umap_mnist_test)

We plot the test set with the test labels and the detected outliers.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; iForest)\n{iso_forest.n_estimators} Estimators; {iso_forest.max_samples} Samples')
mask = if_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

iForest generalizes to the test set as well.

We also produce a similar plot of the detected outliers with rings proportional to the respective outlier factors (i.e. larger rings are more anomalous).

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; iForest)\n{iso_forest.n_estimators} Estimators; {iso_forest.max_samples} Samples')
mask = if_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=1, alpha=1, label=f'Anomalies', color='k'
)
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=-if_mnist_test_nos[mask]*100, facecolor='none', edgecolor='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Most of the outliers have a similar score. There also appears to be a slight bias for anomalies on the borders of the 0s (blue), 1s (orange), and 2s (green) clusters.

Similarly, we plot the 10 most anomalous test samples.

In [None]:
if_mnist_test_nos = iso_forest.score_samples(umap_mnist_test)
inds = np.argsort(if_mnist_test_nos)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        if_nos = if_mnist_test_nos[ind]
        axs[i, j].set_title(f'Label: {y_test[ind]}\nNegative Outlier Score: {if_nos:.3f}')
        axs[i, j].imshow(x_test_scale[ind])
plt.tight_layout()

We detect a few 0s, 1s, and 2s with out of distribution properties.

To visualize the outlier boundaries, we predict the outlier scores on a grid of points in our manifold.

In [None]:
Z_grid_if = plot_outlier_boundaries(umap_mnist_train, y_train, if_mnist_train, iso_forest, 'iForest')

Similarly to LOF, the iForest outlier boundaries surround the clusters.

<a id="if_adv"></a>
### 4c. iForest Distributions and Trees

Similar to LOF, iForest assigns an outlier score to each sample, and detects an anomaly as having a score greater than the offset. We plot the distributions of the outlier scores for the training and test sets.

In [None]:
plt.grid()
plt.title('iforest Outlier Score Histogram')
plt.hist(if_mnist_train_nos, bins=200, alpha=0.5, label='train')
plt.hist(if_mnist_test_nos, bins=200, alpha=0.5, label='test')
plt.vlines(iso_forest.offset_, 0, 10**3, color='k', label='offset')
plt.yscale('log')
plt.legend()

The outlier scores are skewed towards the left with rarer samples having extreme negative scores.

In addition, we can calculate the outlier score of the manifold using the iForest. Samples that are more likely to be observed have an outlier score closer to 0. By visualizing the outlier score distribution on the manifold, we can determine where scores are similar.

In [None]:
Z_grid_if_scores = plot_outlier_contour(umap_mnist_test, y_test, if_mnist_test, iso_forest, 'iForest', levels=np.linspace(-0.7,-0.4,5))

The darker contour lines represent less likely regions, and brighter contour lines represent more likely regions. The outlier score contour map produced from the iForest is much rougher than LOF. However, they both share similar properties (e.g. local maxima in the clusters' centers, higher outlier scores for more isolated samples, etc.).

Since the iForest is an ensemble of decision trees, we can analyze each tree to gain further insight on the trained model. The attribute `estimators_features_` returns the features that were used to train each tree. Since `max_features` was set to use all available features by default, each tree used both UMAP embeddings. If the following execution is `True`, then all trees use both UMAP dimensions as features.

In [None]:
np.unique(np.array(iso_forest.estimators_features_) == np.arange(2))

The attribute `estimators_samples_` returns the unique set of sample indices that were used to train each tree. If the following execution is `True`, then all samples were used at least once.

In [None]:
np.unique(iso_forest.estimators_samples_).shape[0] == x_train_size

We can mean stack each set of samples used by the tree to determine if each set has similar image properties.

In [None]:
fig, axs = plt.subplots(10,10,figsize=[20,20])
for i in range (10):
    for j in range (10):
        ind = i*10+j
        if_ind = iso_forest.estimators_samples_[ind]
        mean_tree = x_train_scale[if_ind].mean(0)
        axs[i,j].set_title(ind)
        axs[i,j].imshow(mean_tree)
fig.tight_layout()

All mean stacks look nearly identical, indicating each set of samples has similar image properties.

We can also access each tree individually using the attribute `estimators_` to further understand how they detect outliers. First, we investigate each trees' feature importances.

In [None]:
if_feat_imp = np.array([i.feature_importances_ for i in iso_forest.estimators_])

plt.figure(figsize=[5,5])
plt.grid()
plt.title(f'iForest Feature Importance (UMAP1)')
plt.plot(if_feat_imp[:, 0])
plt.xlabel('Tree Number')
plt.ylabel('Feature Importance')
print (f'Feature importances of UMAP1 and UMAP2: {if_feat_imp.mean(0)}')

The feature importances of UMAP1 and UMAP2 are essentially equal. However, they do oscillate about 0.5, meaning the trees are capturing different qualities of MNIST.

We can also compute the negative outlier score from each tree for a sample. This metric gives us more insight on how each tree contributes to the forest.

In [None]:
def if_score_samples_tree(X, iso_forest, subsample_features=False):
    """Compute the negative outlier score from each tree.

    This code was modified from the following function:
    https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/ensemble/_iforest.py#L574

    which computes the negative outlier score for the entire isolation forest.
    Instead of using the summed depths of the forest scaled by the number of trees in the forest
    to calculate the score, we calculate each depth individually and scale by unity (i.e. one tree).

    Parameters
    ----------
    X : np.array
        Features as a 2D array.
    iso_forest : sklearn.model
        Isolation forest model.
    subsample_features : bool, default=False
        If True, use a subsample of the features.
        
    Returns
    -------
    scores : np.array
        Negative outlier scores as a 2D array (trees, samples).
    """
    # Prepare variables before loop
    n_samples = X.shape[0]
    depths = np.zeros((iso_forest.n_estimators, n_samples))
    average_path_length_max_samples = _average_path_length([iso_forest._max_samples])
    
    # Find the depth of each sample for each tree
    for tree_idx, (tree, features) in enumerate(
        zip(iso_forest.estimators_, iso_forest.estimators_features_)
    ):
        # Use subset of feature if necessary
        X_subset = X[:, features] if subsample_features else X
        # Find the index of the leaf each sample is predicted as
        leaves_index = tree.apply(X_subset, check_input=False)
        # Calculate the depth
        depths[tree_idx] = (
            iso_forest._decision_path_lengths[tree_idx][leaves_index]
            + iso_forest._average_path_length_per_tree[tree_idx][leaves_index]
            - 1.0
        )
    # Scale the depth by the average path length of the maximum samples
    denominator = average_path_length_max_samples
    # Calculate the negative outlier scores
    scores = -2 ** (
        -np.divide(
            depths, denominator, out=np.ones_like(depths), where=denominator != 0
        )
    )
    return scores

In [None]:
if_mnist_test_nos_tree = if_score_samples_tree(umap_mnist_test, iso_forest)
if_mnist_test_tree = (if_mnist_test_nos_tree > iso_forest.offset_).astype(int)
if_mnist_test_tree[if_mnist_test_tree==0] = -1

With the scores of each tree, we plot some histograms to understand their distributions.

In [None]:
# Calculate ensemble metrics
if_mnist_test_nos_mean = if_mnist_test_nos_tree.mean(0)
if_mnist_test_nos_per = (if_mnist_test_nos_tree>iso_forest.offset_).sum(0)

# Plot
fig, axs = plt.subplots(1,2,figsize=[10,5])
axs[0].grid()
axs[0].set_title('Mean Negative Outlier Score')
axs[0].hist(if_mnist_test_nos_mean[~mask].flatten(), alpha=0.5, bins=100, label='Nominal')
axs[0].hist(if_mnist_test_nos_mean[mask].flatten(), alpha=0.5, bins=100, label='Anomaly', color='k')
axs[0].vlines(iso_forest.offset_, 0, 160, color='k', label=f'offset={iso_forest.offset_:.3f}')
axs[0].set_xlabel('Mean NOS')
axs[0].set_ylabel('Frequency')
axs[0].set_yscale('log')
axs[0].legend()
axs[1].grid()
axs[1].set_title('Percentage of NOS Greater than Offset')
axs[1].hist(if_mnist_test_nos_per[~mask].flatten(), alpha=0.5, bins=100, label='Nominal', range=[0,100])
axs[1].hist(if_mnist_test_nos_per[mask].flatten(), alpha=0.5, bins=100, label='Anomaly', color='k', range=[0,100])
axs[1].vlines(50, 0, 300, color='k', label='50%')
axs[1].set_xlabel('% NOS > Offset')
axs[1].set_ylabel('Frequency')
axs[1].set_yscale('log')
axs[1].legend()

The left histogram illustrates for each sample in the MNIST test set the mean score from all the trees, separated by anomaly class. The right histogram illustrates for each sample in the MNIST test set the percentage of scores greater than the offset. The nominal distribution is separate from the anomalous distribution in each histogram with minimal overlap near the offset and 50%. The overlap shows that some nominal samples near the threshold may have exceptionally larger scores in a few trees, which is enough to make them nominal even if less than 50% of the trees predict the sample to be nominal.

Similarly to the whole iForest, we can plot contour maps for individual trees.

In [None]:
Z_grid_tree_scores = plot_outlier_contour(umap_mnist_test, y_test, if_mnist_test_tree, iso_forest, 'iForest', levels=np.linspace(-0.7,-0.4,5), tree=0)

A single tree's decision boundaries are very rigid, and perform weakly against MNIST. This plot demonstrates how powerful ensemble methods are as the complement for the iForest was smoother, and performed well.

Lastly, we can plot the decision tree itself. However, we restrict the max depth to 3 for faster rendering.

In [None]:
plt.figure(figsize=[20,10])
plot_tree(iso_forest.estimators_[0], max_depth=3, feature_names=['UMAP1', 'UMAP2'], proportion=True, fontsize=10)
plt.show()

In each node contains four quantities:
- The decision condition for splitting
- The squared error computed of the training set from each split
- The percentage of samples of the training set from each split
- A spurious "mean value" of the training set from each split

We can gain even deeper insight on the iForest by analyzing the decision conditions for each node, and why those set of decision conditions may be useful for outlier detection.

Generally, iForest also performed well for anomaly detection with some various pros and cons in comparison to LOF. The last algorithm uses a kernel to detect anomalies, which also has prefered use cases over the previous two algorithms.

<a id="svm"></a>
## 5. One Class Support Vector Machine (OC-SVM)

A one-class support vector machine [(OC-SVM; detailed explanation with equations can be found under "Introduction")](https://en.wikipedia.org/wiki/One-class_classification) is a kernel-based anomaly detection algorithm that identifies outliers as samples outside of a decision boundary. A vanilla support vector machine (SVM) is a supervised learning technique that predicts a regression or classification value. The objective of SVM is to find a decision boundary (i.e. margin) that separates dissimilar predictions the most with respect to support vectors (i.e. training samples near the margin). SVMs use kernels, which are functions used to transform low dimensional data into a higher dimension. In the higher dimension, a margin is found to separate the data. Then, the data and learned margin are inverse transformed back to the lower dimension. In the case of OC-SVMs, the task is "binary classification" with the training data being one class, and the origin being a pseudo-class. In addition, the parameter nu is introduced, which determines the probability of a sample being outside the margin. This parameter is similar to iForest contamination parameter, and controls the proportion of outliers allowed in the training set. By minimizing the enclosing space of the inliers and maximizing nu proportion of outliers, we learn a margin that optimally separates the two classes. 

Some advantages of OC-SVM are few hyperparameters and a smooth margin over the learned manifold. OC-SVM is also useful when only normal samples are in the training set. However, it is relatively slow to train/predict compared to other algorithms, and weakly scales with sample size. A [linear approximation using stochastic gradient descent (SGD)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDOneClassSVM.html#sklearn.linear_model.SGDOneClassSVM) exists, but we leave exploring that option as an exercise for the reader.
 
Here are some complementary resources:

- [SVM Overview Video (2 minutes, simple language)](https://www.youtube.com/watch?v=_YPScrckx28)
- [StatQuest SVM Main Ideas Video (21 minutes, simple language)](https://www.youtube.com/watch?v=efR1C6CvhmE)
- [OC-SVM Overview Video (6 minutes, simple language)](https://www.youtube.com/watch?v=5YlhdWzltM4)
- [OC-SVM Detailed Video (12 minutes, advanced language)](https://www.youtube.com/watch?v=7HroYexkvXs)
- [OC-SVM Paper 1](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-99-87.pdf)
- [OC-SVM Paper 2](https://proceedings.neurips.cc/paper_files/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf)

<a id="svm_train"></a>
### 5a. Fit, transform, and visualize using training set

We use [scikit-learn for OC-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html). The main parameters for OC-SVM are `kernel`, which is the kernel used for the SVM, `gamma`, which is the scale for the kernel, and `nu`, which is the portion of outliers allowed. We use the default parameters `kernel='rbf'`, which is the [radial basis function kernel](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html) (or the squared exponential kernel), `gamma='scale'`, which scales the kernel by the inverse of the product of the number of features and the data's variance, and `nu=0.5`, which fits 50% of samples as outliers.

In [None]:
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.5)

Since the pixel dimensions are too high to sufficiently fit (i.e. would take around half an hour), we directly fit and detect outliers using the UMAP MNIST data.

**Note: this may take a few minutes to execute.**

In [None]:
t0_oc_svm = time()
oc_svm_mnist_train = oc_svm.fit_predict(umap_mnist_train)
t1_oc_svm = time()

t_oc_svm = t1_oc_svm - t0_oc_svm
print (f'Time spent fitting model: {t_oc_svm:.4f} seconds')

For this 2D data, fitting OC-SVM is significantly slower than the other models. Let's check the shape to make sure we have a vector of predictions.

In [None]:
oc_svm_mnist_train.shape

Now, we plot the UMAP training set with the training labels and detected outliers from the UMAP embedding.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; OC-SVM)\nKernel={oc_svm.kernel}; Gamma={oc_svm.gamma}; Nu={oc_svm.nu}')
mask = oc_svm_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

OC-SVM overfits the data, detecting a majority of clusters as outliers besides 3 (green), 5 (brown), 8 (yellow), and 9 (cyan). 

We plot the 10 most anomalous train samples. **Note: this may also take a few minutes to execute.**

In [None]:
oc_svm_mnist_train_os = oc_svm.score_samples(umap_mnist_train)

In [None]:
inds = np.argsort(oc_svm_mnist_train_os)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        oc_svm_os = oc_svm_mnist_train_os[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nOutlier Score: {oc_svm_os:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

Mostly 0s were detected along with one 5. 

Now, we fit using optimal parameters found a priori. We use `gamma='auto'`, which scales the kernel by the inverse of the number of features, and `nu=0.05`, which significantly decreases the portion of outliers to 5%.

In [None]:
oc_svm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.05)

t0_oc_svm = time()
oc_svm_mnist_train = oc_svm.fit_predict(umap_mnist_train)
t1_oc_svm = time()

t_oc_svm = t1_oc_svm - t0_oc_svm
print (f'Time spent fitting model: {t_oc_svm:.4f} seconds')

Although these parameters fit ten times faster than the default parameters and fits within one minute, OC-SVM is still significantly slower than the other models.

We plot the UMAP training set with the training labels and detected outliers from the UMAP embedding.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Train; OC-SVM)\nKernel={oc_svm.kernel}; Gamma={oc_svm.gamma}; Nu={oc_svm.nu}')
mask = oc_svm_mnist_train == -1
plt.scatter(
    umap_mnist_train[:, 0][~mask], umap_mnist_train[:, 1][~mask], 
    s=1, alpha=0.5, c=y_train[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

OC-SVM performs well with most outliers on the clusters' borders. However, some outliers were found within the clusters, such as for 4 (purple), 5 (brown), and 7 (gray).

Again, we plot the 10 most anomalous train samples.

In [None]:
oc_svm_mnist_train_os = oc_svm.score_samples(umap_mnist_train)
inds = np.argsort(oc_svm_mnist_train_os)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        oc_svm_os = oc_svm_mnist_train_os[ind]
        axs[i, j].set_title(f'Label: {y_train[ind]}\nOutlier Score: {oc_svm_os:.3f}')
        axs[i, j].imshow(x_train_scale[ind])
plt.tight_layout()

Similar to LOF, several uncharacteristic 7s were detected along with a 0.

Finally, we print the number of anomalies detected.

In [None]:
print (f'Number of Anomalies Detected: {(oc_svm_mnist_train == -1).sum()}')

As expected, around 5% of samples were outliers.

<a id="svm_test"></a>
### 5b. Predict test set and visualize boundaries

Now that the OC-SVM is trained, we predict outliers in the test set and their scores.

In [None]:
oc_svm_mnist_test = oc_svm.predict(umap_mnist_test)
oc_svm_mnist_test_os = oc_svm.score_samples(umap_mnist_test)

We plot the test set with the test labels and the detected outliers.

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; OC-SVM)\nKernel={oc_svm.kernel}; Gamma={oc_svm.gamma}; Nu={oc_svm.nu}')
mask = oc_svm_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=10, alpha=1, label=f'Anomalies', color='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

OC-SVM generalizes to the test set as well.

We also produce a similar plot with the detected outliers with rings *inversely* proportional to the respective outlier factors (i.e. smaller rings are more anomalous).

In [None]:
plt.figure(figsize=[10,10])
plt.grid()
plt.title(f'UMAP MNIST (Test; OC-SVM)\nKernel={oc_svm.kernel}; Gamma={oc_svm.gamma}; Nu={oc_svm.nu}')
mask = oc_svm_mnist_test == -1
plt.scatter(
    umap_mnist_test[:, 0][~mask], umap_mnist_test[:, 1][~mask], 
    s=1, alpha=0.5, c=y_test[~mask], cmap='tab10'
)
plt.colorbar()
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=1, alpha=1, label=f'Anomalies', color='k'
)
plt.scatter(
    umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
    s=oc_svm_mnist_test_os[mask], facecolor='none', edgecolor='k'
)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

Most of the outliers have a similar score.

Similarly, we plot the 10 most anomalous test samples.

In [None]:
oc_svm_mnist_test_os = oc_svm.score_samples(umap_mnist_test)
inds = np.argsort(oc_svm_mnist_test_os)
fig, axs = plt.subplots(2,5,figsize=[20,5])
for i in range (2):
    for j in range (5):
        ind = inds[i*5+j]
        oc_svm_os = oc_svm_mnist_test_os[ind]
        axs[i, j].set_title(f'Label: {y_test[ind]}\nOutlier Score: {oc_svm_os:.3f}')
        axs[i, j].imshow(x_test_scale[ind])
plt.tight_layout()

We detect a handful of digits that are out of distribution.

To visualize the outlier boundaries, we predict the outlier scores on a grid of points in our manifold.

In [None]:
Z_grid_oc_svm = plot_outlier_boundaries(umap_mnist_train, y_train, oc_svm_mnist_train, oc_svm, 'OC-SVM')

The OC-SVM outlier boundaries are similar to LOF and iForest.

<a id="svm_adv"></a>
### 5c. OC-SVM Distributions

Similar to LOF and iForest, OC-SVM assigns an outlier score to each sample, but detects an anomaly as having a score *less* than the offset. We plot the distributions of the outlier scores for the training and test sets.

In [None]:
plt.grid()
plt.title('OC-SVM Outlier Score Histogram')
plt.hist(oc_svm_mnist_train_os, bins=200, alpha=0.5, label='train')
plt.hist(oc_svm_mnist_test_os, bins=200, alpha=0.5, label='test')
plt.vlines(oc_svm.offset_, 0, 10**4, color='k', label='offset')
plt.yscale('log')
plt.legend()

The outlier scores are skewed towards the left with rarer samples having low scores.

Similarly, we can calculate the outlier score of the manifold using the OC-SVM. Samples that are *less* likely to be observed have an outlier score closer to 0. By visualizing the outlier score distribution on the manifold, we can determine where scores are similar.

In [None]:
Z_grid_oc_svm_scores = plot_outlier_contour(umap_mnist_test, y_test, oc_svm_mnist_test, oc_svm, 'OC-SVM', levels=np.linspace(30, 100, 5))

The darker contour lines represent less likely regions, and brighter contour lines represent more likely regions. The outlier score contour map produced from the OC-SVM is also similar to LOF with slightly smoother regions.

Generally, OC-SVM also performed well for anomaly detection.

<a id="con"></a>
## 6. Conclusions

Anomaly detection is a cruicial task for exploring data and understanding its complexity. Local Outlier Factor (LOF) is a density-based algorithm that scores low density regions as anomalous. Isolation Forest (iForest) is an ensemble tree-based algorithm that trains multiple decisions trees simultaneously and scores short tree paths (i.e. isolated samples) as anomalous. One Class Support Vector Machine (OC-SVM) is a kernel-based algorithm that maximizes the inlier/outlier margin, and scores rare and deviant samples as anomalous. There are many anomaly detection algorithms, but using these three in unison is a substantial start to any machine learning based exploratory data analysis.

**Thank you and congratulations for completing the notebook!**

In [None]:
# Time spent for each model
print (f'Time spent fitting LOF: {t_lof:.4f} seconds')
print (f'Time spent fitting iForest: {t_if:.4f} seconds')
print (f'Time spent fitting OC-SVM: {t_oc_svm:.4f} seconds')

In [None]:
# Outlier Boundaries
fig, axs = plt.subplots(1,3,figsize=[15,5])
axs[0].set_title('LOF Outlier Boundary')
axs[0].imshow(Z_grid_lof, origin='lower')
axs[1].set_title('iForest Outlier Boundary')
axs[1].imshow(Z_grid_if, origin='lower')
axs[2].set_title('OC-SVM Outlier Boundary')
axs[2].imshow(Z_grid_oc_svm, origin='lower')

In [None]:
# Outlier Contours
fig, axs = plt.subplots(1,3,figsize=[15,5])
axs[0].set_title('LOF Outlier Score Contour')
axs[0].imshow(Z_grid_lof_scores, origin='lower', vmin=-4, vmax=-1)
axs[1].set_title('iForest Outlier Score Contour')
axs[1].imshow(Z_grid_if_scores, origin='lower', vmin=-0.7, vmax=-0.5)
axs[2].set_title('OC-SVM Outlier Score Contour')
axs[2].imshow(Z_grid_oc_svm_scores, origin='lower', vmin=50, vmax=100)

In [None]:
# Outlier Scores
fig, axs = plt.subplots(1,3,figsize=[15,5])
axs[0].grid()
axs[0].set_title('LOF Outlier Score (Colored)')
axs0 = axs[0].scatter(umap_mnist_train[:, 0], umap_mnist_train[:, 1], s=1, alpha=1, c=lof_mnist_train_nof)
cbar0 = fig.colorbar(axs0, ax=axs[0])
axs[1].grid()
axs[1].set_title('iForest Outlier Score (Colored)')
axs1 = axs[1].scatter(umap_mnist_train[:, 0], umap_mnist_train[:, 1], s=1, alpha=1, c=if_mnist_train_nos)
cbar1 = fig.colorbar(axs1, ax=axs[1])
axs[2].grid()
axs[2].set_title('OC-SVM Outlier Score (Colored)')
axs2 = axs[2].scatter(umap_mnist_train[:, 0], umap_mnist_train[:, 1], s=1, alpha=1, c=oc_svm_mnist_train_os)
cbar2 = fig.colorbar(axs2, ax=axs[2])
plt.tight_layout()

<a id="add"></a>
## Additional Resources

Machine learning is a dense and rapidly evolving field of study. Becoming an expert takes years of practice and patience, but hopefully this notebook brought you closer in that direction. Here are some of the author's favorite resources for learning about machine learning and data science:

- [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
- [scikit-learn Python Library](https://scikit-learn.org/stable/index.html) (go-to for most ML algorithms besides neural networks)
  - [Novelty and Outlier Detection](https://scikit-learn.org/stable/modules/outlier_detection.html)
- [StatQuest YouTube Channel](https://www.youtube.com/c/joshstarmer)
- [DeepLearningAI YouTube Channel](https://www.youtube.com/c/Deeplearningai/videos)
- [Towards Data Science](https://towardsdatascience.com/) (articles about data science and machine learning, some involving example blocks of code)
- Advance searching [arxiv](https://arxiv.org/search/advanced) (e.g. search term "machine learning" in Abstract for Subject astro-ph) to see what others are doing currently
- Google, YouTube, Wikipedia and ChatGPT (confirm results with external sources) in general
  - [SVM Kernel Trick Video (3 minutes, simple language)](https://www.youtube.com/watch?v=Q7vT0--5VII)
- [PyOD: Python library for outlier detection](https://pyod.readthedocs.io/en/latest/)

<a id="about"></a>
## About this Notebook

**Author:** Fred Dauphin, DeepWFC3

**Updated on:** 2025-01-16

<a id="cite"></a>
## Citations

If you use `numpy`, `matplotlib`, `sklearn`, or `umap` for published research, please cite the authors. Follow these links for more information about citing `numpy`, `matplotlib`, `sklearn`, and `umap`:

* [Citing `numpy`](https://numpy.org/doc/stable/license.html)
* [Citing `matplotlib`](https://matplotlib.org/stable/users/project/license.html#:~:text=Matplotlib%20only%20uses%20BSD%20compatible,are%20acceptable%20in%20matplotlib%20toolkits.)
* [Citing `sklearn`](https://scikit-learn.org/stable/about.html#citing-scikit-learn)
* [Citing `umap`](https://github.com/lmcinnes/umap/blob/master/LICENSE.txt)

***
[Top of Page](#title)