# Anomaly detection example

*BLUF: `empdens.cade` with all-default settings narrowly outperforms sklearn's IsolationForest in an anomaly detection task, but the best published results substantially outperform both methods.* 

The [Japanese vowels data](http://odds.cs.stonybrook.edu/japanese-vowels-data/) consist of repeated observations from 9 individuals. To create a labeled anomalies dataset, the data curators \[who?\] downsampled the observations of one of the individuals, artificially causing that individual's observations to be anomalous in the context of the larger dataset. 

[Sathe and Aggarwol (2016)](http://saketsathe.net/downloads/lodes.pdf) present an outlier detection method that uses an iterative spectral embedding to model the graph structure of the data. Their table 3 shows comparably-sophisticated outlier prediction routines achieving AUC scores approaching 0.95 on the vowels data.

How does `empdens.cade` (with all-default settings) measure up? 

In [1]:
import pandas as pd

import empdens

vowels = empdens.load_Japanese_vowels_data()
vowels.head()

Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,feature11,label
0,0.580469,-0.902534,0.617899,-0.997942,-2.463799,-0.846455,2.349849,0.3754,-0.649334,1.604637,-0.62306,-0.383125,0.0
1,0.784375,-1.077366,0.615781,-0.921911,-2.388553,-0.638047,2.106684,0.361018,-0.714317,1.260236,-0.423339,-0.287791,0.0
2,0.791292,-1.086242,0.669773,-0.806112,-2.260781,-0.538491,2.053282,0.266492,-0.842815,1.081797,-0.267201,-0.172203,0.0
3,1.217306,-1.083425,0.855483,-0.724879,-2.155552,-0.101879,1.768597,0.303151,-1.04471,0.65529,0.214298,-0.34184,0.0
4,1.065352,-1.030178,0.773297,-0.452289,-1.955907,0.248205,1.530474,0.25374,-0.968961,-0.208287,0.331578,0.007288,0.0


Fit CADE:

In [2]:
labels = vowels.label
X = vowels.drop("label", axis=1)
cade = empdens.cade.Cade()
cade.train(X)

Compute the AUROC of the estimated density as a predictor of anomalousness:

In [3]:
eval_df = pd.DataFrame({"anomaly_score": 1 - cade.predict(X), "label": labels})
metrics = empdens.evaluation.Binary(truth=eval_df.label.values, pred=eval_df.anomaly_score.values)
metrics.AUROC()

0.8278378378378378

As an additional reference point, let's see what an isolation forest does with this data:

In [6]:
isof = empdens.wrappers.SklearnIsolationForest()
isof.train(X)
metrics = empdens.evaluation.Binary(truth=eval_df.label.values, pred=1 - isof.predict(X))
metrics.AUROC()

0.7459886201991466