Skip to content
Anomaly detection with Isolation Forest algorithm
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
images
IsolationForestPaper.pdf
README.md
iforest.py
score.py

README.md

Anomaly Detection with Isolation Forest Algorithm

The goal of this project was to implement the Isolation Forest algorithm as defined in this paper from scratch.

Main Idea

  • Anomalies are few and different. We want to explicity isolate anomalies rather than construct a profile of normal instances.
  • With many iterations, we can start to see which data points are consistently different than others to isolate the anomalies.
  • This algorithm has a linear time complexity with a low constant, and it requires less memory.

About the Algorithm

  • For a dataset, we construct a binary tree with it by splitting at random variables and values.
  • Anomalies are more susceptible to isolation, so they end up closer to the root of the tree. Normal points are isolated deeper in the tree.
  • By creating an ensemble of these random trees, we can average the height of each instance. Anomalies will have a shorter average path length than normal points.

Improving the Algorithm

If the dataset is very large, the trees could grow wider and deeper which may make the anomalies not distinct from the normal points. This affects speed and increases noise sensitivity.

To improve the algorithm, here are some ideas:

  • I chose to simply limit the number of instances that could be in a leaf node to prune each tree.
  • The authors suggest computing and sorting by feature kurtosis then using a subset of those features during training.
  • You can try using dask or Python 3's multiprocessing.

Code & Usage

Datasets

Requirements

  • numpy
  • pandas
  • sklearn

Usage

Run the score.py file with -noise and -improved flags.

This script was given by Professor Terence Parr of USF, and it tests for True Positive Rate (TPR), False Positive Rate (FPR), and fit time across the three labeled datasets.

Visualizations of Anomaly Separation

http.csv, 200 trees, 99% desired TPR
creditcard.csv, 200 trees, 80% desired TPRcreditcard.csv, 200 trees, 90% desired TPR
cancer, 300 trees, 70% desired TPR cancer, 300 trees, 80% desired TPR

Algorithm

Algorithms extracted from the Liu et al paper:

Please use this version of average path length c(), not the one in the original paper:

Then finally here's the scoring formula:

where "H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)."

Credits

  • MSDS689 Class Project given by Professor Terence Parr. Thank you for the materials!
  • Authors of the Isolation Forest paper: Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou
You can’t perform that action at this time.