Anomaly Detection with Isolation Forest Algorithm
The goal of this project was to implement the Isolation Forest algorithm as defined in this paper from scratch.
- Anomalies are few and different. We want to explicity isolate anomalies rather than construct a profile of normal instances.
- With many iterations, we can start to see which data points are consistently different than others to isolate the anomalies.
- This algorithm has a linear time complexity with a low constant, and it requires less memory.
About the Algorithm
- For a dataset, we construct a binary tree with it by splitting at random variables and values.
- Anomalies are more susceptible to isolation, so they end up closer to the root of the tree. Normal points are isolated deeper in the tree.
- By creating an ensemble of these random trees, we can average the height of each instance. Anomalies will have a shorter average path length than normal points.
Improving the Algorithm
If the dataset is very large, the trees could grow wider and deeper which may make the anomalies not distinct from the normal points. This affects speed and increases noise sensitivity.
To improve the algorithm, here are some ideas:
- I chose to simply limit the number of instances that could be in a leaf node to prune each tree.
- The authors suggest computing and sorting by feature kurtosis then using a subset of those features during training.
- You can try using dask or Python 3's multiprocessing.
Code & Usage
- Kaggle credit card fraud competition data set
- Get cancer data into
cancer.csvby executing savecancer.csv
- http.zip; download, unzip to get
score.py file with
This script was given by Professor Terence Parr of USF, and it tests for True Positive Rate (TPR), False Positive Rate (FPR), and fit time across the three labeled datasets.
Visualizations of Anomaly Separation
|http.csv, 200 trees, 99% desired TPR|
|creditcard.csv, 200 trees, 80% desired TPR||creditcard.csv, 200 trees, 90% desired TPR|
|cancer, 300 trees, 70% desired TPR||cancer, 300 trees, 80% desired TPR|
Algorithms extracted from the Liu et al paper:
Please use this version of average path length c(), not the one in the original paper:
where "H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)."
- MSDS689 Class Project given by Professor Terence Parr. Thank you for the materials!
- Authors of the Isolation Forest paper: Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou