# Demonstration of Deep Outlier Detection Models
1. [Introduction](#1-introduction)
2. [Demonstration on Classical Dataset](#2-demonstration-on-classical-dataset)
    1. [Load Data](#21-load-data)
    2. [Model Setting](#22-model-setting)
    3. [Performance Comparation](#23-performance-comparation)
4. [Reference](#reference)

## 1. Introduction

This demostration shows the performace of deep outlier detection models in several classical dataset. The models covered in this demostration includes:

1. **Deep SVDD** Deep One-Class Classification. (ICML'18)
2. **REPEN** Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection. (KDD'18)
3. **RDP** Unsupervised Representation Learning by Predicting Random Distances. (IJCAI'20)
4. **RCA** A Deep Collaborative Autoencoder Approach for Anomaly Detection. (IJCAI'21)
5. **GOAD** Classification-Based Anomaly Detection for General Data. (ICLR'20)
6. **Neutral** Neural Transformation Learning for Deep Anomaly Detection Beyond Images. (ICML'21)
7. **ICL** Anomaly Detection for Tabular Data with Internal Contrastive Learning. (ICLR'22)
8. **DIF** Deep Isolation Forest for Anomaly Detection. (TKDE'23)


In [1]:
import numpy as np
from numpy import percentile
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings("ignore")

## 2. Demonstration on Classical Dataset

In [2]:
from pyod.utils.utility import standardizer
from pyod.utils.utility import precision_n_scores
from sklearn.metrics import roc_auc_score
from scipy.io import loadmat
from time import time
import os

## 2.1.  Load Data
All the following datasets are downloaded from 
Outlier Detection DataSets (ODDS): http://odds.cs.stonybrook.edu/#table1

In [10]:
mat_file_list = [
                 'arrhythmia.mat',
                 'cardio.mat',
                 'ionosphere.mat',
                 'letter.mat',
                 'lympho.mat',
                 'mnist.mat',
                 'musk.mat',
                 'optdigits.mat',
                 'pendigits.mat',
                 'pima.mat',
                 'satellite.mat',
                 'satimage-2.mat',
                 'vertebral.mat',
                 'vowels.mat',
                 'wbc.mat']

## 2.2. Model Setting
All the following models are downloaded from 
Deep learning-based Outlier Detection (deepod): https://github.com/xuhongzuo/DeepOD




In [18]:
from deepod.models.dsvdd import DeepSVDD
from deepod.models.rdp import RDP
from deepod.models.repen import REPEN
from deepod.models.rca import RCA
from deepod.models.goad import GOAD
from deepod.models.dif import DeepIsolationForest
from deepod.models.neutral import NeuTraL
from deepod.models.icl import ICL

### General Parameter settings

#### **epochs**: int, optional (default=100)
        Number of training epochs

#### **batch size**: int, optional (default=64 for all algorithms expect Deep Isolation Forest (1000))
        Number of samples in a mini-batch

#### **lr**: float, optional (default=1e-3)
        Learning rate

#### **hidden_dims**: list, str or int, optional (default='100,50')
        Number of neural units in hidden layers
            - If list, each item is a layer
            - If str, neural units of hidden layers are split by comma
            - If int, number of neural units of single hidden layer

#### **act**: str, optional ('LeakyReLU' for GOAD, NeuTral and Repen. 'Relu' for all other algorithms)
        activation layer name
        choice = ['ReLU', 'LeakyReLU', 'Sigmoid', 'Tanh']

#### **bias**: bool, optional (default=False)
        Additive bias in linear layer

#### **epoch_steps**: int, optional (default=-1)
        Maximum steps in an epoch
            - If -1, all the batches will be processed

#### **verbose**: int, optional (default=1)
        Verbosity mode

#### **random_state**： int, optional (default=42)
        the seed used by the random

#### **Deep Isolation Forest**
1. #### **n_estimators** int, optional (default=6)
        The number of base estimators in the ensemble.

2. #### **max_samples** int or float, optional (default=256)
        The number of samples to draw from X to train each base estimator.


## 2.3. Performance Comparation

In [20]:
# initialize the container for saving the results
classifiers = [
	 'DeepSVDD',
	 'RDP',
	 'REPEN',
	 'RCA',
	 'GOAD',
	 'Neutral',
	 'ICL',
     'DIF']

classifiers_indices = dict(zip(classifiers, range(len(classifiers))))

df_columns = ['Data', '# Samples', '# Dimensions', 'Outlier Perc'] + classifiers

roc_df = pd.DataFrame(columns=df_columns)
n_ite = 1
n_classifiers = len(classifiers)

for j in tqdm(range(len(mat_file_list))):
    mat_file = mat_file_list[j]
    print("\n... Processing", mat_file, '...')
    
    data = loadmat(os.path.join('datasets', mat_file))
    X = data['X']
    y = data['y'].ravel()
    outliers_fraction = np.count_nonzero(y) / len(y)
    outliers_percentage = round(outliers_fraction * 100, ndigits=4)

    # construct containers for saving results
    roc_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    roc_mat = np.zeros(n_classifiers)

    random_state = np.random.RandomState()
    
    classifiers_dict = {
	 'DeepSVDD': DeepSVDD(verbose=0),
	 'RDP': RDP(verbose=0),
	 'REPEN':REPEN(verbose=0),
	 'RCA': RCA(verbose=0),
	 'GOAD': GOAD(verbose=0),
	 'Neutral': NeuTraL(verbose=0),
	 'ICL': ICL(verbose=0),
     'DIF': DeepIsolationForest(verbose=0),
	}

    X_norm = standardizer(X)

    for clf_name, clf in classifiers_dict.items():
        clf.fit(X_norm)
        test_scores = clf.decision_function(X_norm)

        roc = round(roc_auc_score(y, test_scores), ndigits=4)
        # prn = round(precision_n_scores(y, test_scores), ndigits=4)
        roc_mat[classifiers_indices[clf_name]] = roc

    roc_list = roc_list + list(roc_mat)
    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_df = pd.concat([roc_df, temp_df], axis=0)

  0%|          | 0/14 [00:00<?, ?it/s]


... Processing cardio.mat ...


100%|██████████| 10/10 [00:00<00:00, 63.75it/s]
100%|██████████| 10/10 [00:00<00:00, 63.72it/s]



... Processing ionosphere.mat ...


100%|██████████| 10/10 [00:00<00:00, 260.01it/s]
100%|██████████| 10/10 [00:00<00:00, 289.85it/s]



... Processing letter.mat ...


100%|██████████| 10/10 [00:00<00:00, 70.43it/s]
100%|██████████| 10/10 [00:00<00:00, 70.55it/s]



... Processing lympho.mat ...


100%|██████████| 10/10 [00:00<00:00, 575.33it/s]
100%|██████████| 10/10 [00:00<00:00, 473.86it/s]



... Processing mnist.mat ...


100%|██████████| 10/10 [00:00<00:00, 15.19it/s]
100%|██████████| 10/10 [00:00<00:00, 15.16it/s]



... Processing musk.mat ...


100%|██████████| 10/10 [00:00<00:00, 35.87it/s]
100%|██████████| 10/10 [00:00<00:00, 35.84it/s]



... Processing optdigits.mat ...


100%|██████████| 10/10 [00:00<00:00, 21.74it/s]
100%|██████████| 10/10 [00:00<00:00, 21.82it/s]



... Processing pendigits.mat ...


100%|██████████| 10/10 [00:00<00:00, 16.12it/s]
100%|██████████| 10/10 [00:00<00:00, 16.09it/s]



... Processing pima.mat ...


100%|██████████| 10/10 [00:00<00:00, 132.75it/s]
100%|██████████| 10/10 [00:00<00:00, 133.03it/s]



... Processing satellite.mat ...


100%|██████████| 10/10 [00:00<00:00, 17.39it/s]
100%|██████████| 10/10 [00:00<00:00, 17.40it/s]



... Processing satimage-2.mat ...


100%|██████████| 10/10 [00:00<00:00, 19.44it/s]
100%|██████████| 10/10 [00:00<00:00, 19.47it/s]



... Processing vertebral.mat ...


100%|██████████| 10/10 [00:00<00:00, 428.41it/s]
100%|██████████| 10/10 [00:00<00:00, 428.84it/s]



... Processing vowels.mat ...


100%|██████████| 10/10 [00:00<00:00, 76.96it/s]
100%|██████████| 10/10 [00:00<00:00, 77.09it/s]



... Processing wbc.mat ...


100%|██████████| 10/10 [00:00<00:00, 280.72it/s]
100%|██████████| 10/10 [00:00<00:00, 282.67it/s]


In [21]:
roc_df

Unnamed: 0,Data,# Samples,# Dimensions,Outlier Perc,DeepSVDD,RDP,REPEN,RCA,GOAD,Neutral,ICL,DIF
0,cardio,1831,21,9.6122,0.5575,0.7227,0.7463,0.8647,0.1549,0.4477,0.1944,0.9122
0,ionosphere,351,33,35.8974,0.3623,0.7542,0.8787,0.9036,0.8528,0.8789,0.7403,0.8995
0,letter,1600,32,6.25,0.5951,0.7743,0.6695,0.7507,0.7843,0.8825,0.7996,0.658
0,lympho,148,18,4.0541,0.3286,0.5023,0.919,0.9718,0.1338,0.6866,0.5235,0.9847
0,mnist,7603,100,9.2069,0.5344,0.7715,0.7516,0.8529,0.4846,0.4825,0.728,0.8579
0,musk,3062,166,3.1679,0.9282,0.9648,1.0,0.9588,1.0,0.9778,0.9845,1.0
0,optdigits,5216,64,2.8758,0.4485,0.5283,0.5595,0.4484,0.6884,0.5609,0.5199,0.496
0,pendigits,6870,16,2.2707,0.6831,0.6637,0.9344,0.9007,0.1407,0.4686,0.5196,0.9497
0,pima,768,8,34.8958,0.4819,0.681,0.6461,0.6722,0.4549,0.6074,0.452,0.6826
0,satellite,6435,36,31.6395,0.4742,0.6824,0.7541,0.6732,0.5925,0.6162,0.5548,0.7145


## Reference

1. Ting et al. [**Isolation Distributional Kernel A New Tool for Point & Group Anomaly Detection**](https://ieeexplore.ieee.org/abstract/document/9573389) *IEEE Transactions on Knowledge and Data Engineering*, 2021.
2. Bandaragoda et al. [**Isolation‐based anomaly detection using nearest‐neighbor ensembles.**](https://proceedings.neurips.cc/paper_files/paper/2022/file/cf93972b116ca5268827d575f2cc226b-Paper-Datasets_and_Benchmarks.pdf) *Computational Intelligence*, 2018.
3. Han et al. [**Adbench: Anomaly detection benchmark**](https://proceedings.neurips.cc/paper_files/paper/2022/file/cf93972b116ca5268827d575f2cc226b-Paper-Datasets_and_Benchmarks.pdf) *Advances in Neural Information Processing Systems*, 2022.
4. [**DeepOD** (github.com/xuhongzuo/DeepOD)](https://github.com/xuhongzuo/DeepOD/tree/main)

There are some other useful demonstrations:
* https://github.com/dennishnf/unsupervised-anomaly-detection
* https://github.com/tvhahn/anomaly-pyconca/blob/master/anom-detection-milling.ipynb
* https://github.com/Apress/beginning-anomaly-detection-using-python-based-dl



