An isolation forest is comprised of many of a special kind of tree. In each node the tree, a random feature is split in a random point (for example, if values in a feature range between 0 and 100, a random split might be 34). This process continues until every data point is isolated in its own brand. This is repeated until the end result is a forest of trees.

The intuition is that isolating a non-outlier data point will require many splits to be isolated because it is very similar to other data points. On the flip side, an outlier data point will require few splits to be isolated because it is very disimilar to the other data points.

## Preliminaries

In [1]:
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
import numpy as np

## Create Data

`make_blobs` with a single center will create a single cluster of data.

In [2]:
# Make the features (X) with 200 samples,
X, _ = make_blobs(n_samples = 300,
                  # two feature variables,
                  n_features = 2,
                  # three clusters,
                  centers = 1,
                  # with .5 cluster standard deviation,
                  cluster_std = 0.5,
                  # shuffled,
                  shuffle = True)

# View the two features
X

array([[4.73967471, 7.72630304],
       [3.37828574, 8.01554621],
       [3.51365219, 8.78544516],
       [4.85330559, 9.33483891],
       [3.73981318, 8.52568663],
       [2.73191512, 8.71551403],
       [4.51654455, 8.99794452],
       [3.75681895, 9.79288515],
       [3.72099469, 8.21238069],
       [4.13375696, 8.82387981],
       [4.23132321, 9.13911734],
       [3.63499843, 7.52097065],
       [4.65571557, 8.81447377],
       [2.75505626, 7.89688485],
       [3.48769846, 8.39237723],
       [3.48109108, 7.84481044],
       [3.50481859, 8.96866248],
       [3.26077879, 8.85692956],
       [3.85754571, 8.98532986],
       [3.77492493, 8.78112782],
       [3.91109002, 8.26971091],
       [4.53992567, 8.34021788],
       [3.57704548, 8.60060707],
       [4.44272122, 8.21874564],
       [2.39049763, 8.63694227],
       [2.98074699, 7.56211626],
       [3.61656443, 8.7923086 ],
       [3.20115279, 8.3506277 ],
       [3.49554796, 8.87866493],
       [3.43283   , 8.10576015],
       [2.

## Add Outlier To Data

In [3]:
# Create an outlier data point
outlier = [[100,100]]

# Concat the outlier with the 
X_with_outlier = np.concatenate((outlier, X))

## Train Isolation Forest

In [4]:
# Setup the isolation forest
clf = IsolationForest(# Randomly sample observations for each tree with replacement
                      bootstrap=True, 
                      # Number of trees
                      n_estimators=100,
                      # Unnecessary but added to avoid a deprecation error
                      contamination='auto', 
                      # Unnecessary but added to avoid a deprecation error
                      behaviour='new')

# Train the isolation forest
clf.fit(X_with_outlier)

IsolationForest(behaviour='new', bootstrap=True, contamination='auto',
        max_features=1.0, max_samples='auto', n_estimators=100,
        n_jobs=None, random_state=None, verbose=0)

## Predict If New Data Point Is Outlier

In [5]:
# Create a new data point that is an outlier
new_data_point = [[200,20]]

In [6]:
# Predict if the new data point is an outlier (1 is not an outlier, -1 is an outlier)
clf.predict(new_data_point)

array([-1])

In [7]:
# Display the anomaly score for the new data point (lower is more abnormal)
clf.score_samples(new_data_point)

array([-0.7477598])