# Benchmark of various outlier detection models

**[PyOD](https://github.com/yzhao062/Pyod)** is a comprehensive **Python toolkit** to **identify outlying objects** in 
multivariate data with both unsupervised and supervised approaches.

The following models are used for comparison:

  1. Linear Models for Outlier Detection:
     1. **PCA: Principal Component Analysis** use the sum of
       weighted projected distances to the eigenvector hyperplane 
       as the outlier outlier scores) [10]
     2. **One-Class Support Vector Machines** [3]
     
  2. Proximity-Based Outlier Detection Models:
     1. **LOF: Local Outlier Factor** [1]
     2. **kNN: k Nearest Neighbors** (use the distance to the kth nearest 
     neighbor as the outlier score)
     3. **Average kNN** Outlier Detection (use the average distance to k 
     nearest neighbors as the outlier score)
     4. **Median kNN** Outlier Detection (use the median distance to k nearest 
     neighbors as the outlier score)
     5. **HBOS: Histogram-based Outlier Score** [5]
     
  3. Probabilistic Models for Outlier Detection:
     1. **ABOD: Angle-Based Outlier Detection** [7]
     2. **FastABOD: Fast Angle-Based Outlier Detection using approximation** [7]
  
  4. Outlier Ensembles and Combination Frameworks
     1. **Isolation Forest** [2]
     2. **Feature Bagging** [9]

In [1]:
from __future__ import division
from __future__ import print_function

import os
import sys

# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
sys.path.append(os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

import numpy as np
from sklearn.model_selection import train_test_split
from scipy.io import loadmat

from pyod.models.lof import LOF
from pyod.models.knn import KNN
from pyod.models.abod import ABOD
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA
from pyod.models.iforest import IForest

from pyod.utils.utility import standardizer
from pyod.utils.data import evaluate_print

In [2]:
# Define data file and read X and y
# Generate some data if the source data is missing

mat_file_list = ['annthyroid.mat',
                 'arrhythmia.mat',
                 'breastw.mat',
                 'cardio.mat',
                 'glass.mat',
                 'ionosphere.mat',
                 'letter.mat',
                 'lympho.mat',
                 'mammography.mat',
                 'mnist.mat',
                 'musk.mat',
                 'optdigits.mat',
                 'pendigits.mat',
                 'pima.mat',
                 'satellite.mat',
                 'satimage-2.mat',
                 'shuttle.mat',
                 'thyroid.mat',
                 'vertebral.mat',
                 'vowels.mat',
                 'wbc.mat']


for mat_file in mat_file_list:
    print("Processing", mat_file)
    mat = loadmat(os.path.join('data', mat_file))

    X = mat['X']
    y = mat['y'].ravel()

    # 60% data for training and 40% for testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

    # standardizing data for processing
    X_train_norm, X_test_norm = standardizer(X_train, X_test)

Processing annthyroid.mat
Processing arrhythmia.mat
Processing breastw.mat
Processing cardio.mat
Processing glass.mat
Processing ionosphere.mat
Processing letter.mat
Processing lympho.mat
Processing mammography.mat
Processing mnist.mat
Processing musk.mat
Processing optdigits.mat
Processing pendigits.mat
Processing pima.mat
Processing satellite.mat
Processing satimage-2.mat
Processing shuttle.mat




Processing thyroid.mat
Processing vertebral.mat
Processing vowels.mat
Processing wbc.mat


In [3]:
print("Training data:", X_train.shape, y_train.shape)
print("Test data:", X_test.shape, y_test.shape)

Training data: (226, 30) (226,)
Test data: (152, 30) (152,)


In [4]:
n_clf = 20  # number of base detectors

# Initialize 20 base detectors for combination
k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
          150, 160, 170, 180, 190, 200]

train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])

print('Initializing {n_clf} kNN detectors'.format(n_clf=n_clf))

for i in range(n_clf):
    k = k_list[i]

    clf = KNN(n_neighbors=k, method='largest')
    clf.fit(X_train_norm)

    train_scores[:, i] = clf.decision_scores_
    test_scores[:, i] = clf.decision_function(X_test_norm)
    print('Base detector %i is fitted for prediction' % i)

Initializing 20 kNN detectors
Base detector 0 is fitted for prediction
Base detector 1 is fitted for prediction
Base detector 2 is fitted for prediction
Base detector 3 is fitted for prediction
Base detector 4 is fitted for prediction
Base detector 5 is fitted for prediction
Base detector 6 is fitted for prediction
Base detector 7 is fitted for prediction
Base detector 8 is fitted for prediction
Base detector 9 is fitted for prediction
Base detector 10 is fitted for prediction
Base detector 11 is fitted for prediction
Base detector 12 is fitted for prediction
Base detector 13 is fitted for prediction
Base detector 14 is fitted for prediction
Base detector 15 is fitted for prediction
Base detector 16 is fitted for prediction
Base detector 17 is fitted for prediction
Base detector 18 is fitted for prediction
Base detector 19 is fitted for prediction


In [7]:
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,
                                                   test_scores)

# Predicted scores from all base detectors on the test data is 
# stored in train_scores_norm and test_scores_norm
print('Decision score matrix on training data', train_scores_norm.shape)
print('Decision score matrix on test data', test_scores_norm.shape)

Decision score matrix on training data (226, 20)
Decision score matrix on test data (152, 20)
