# Isolation Forests

Here we use isolation forests on contaminated and uncontaminated training dfata.

Findings:

1) The isolation forest perfroms well on contaminated data with 5% outliers.
2) The model continues to perfrom well on uncontaminated data with lower nominal contamination rates giving better results e.g. 0.001% performed better than 0.01%.

The data used is identical to the previous notebooks. Please view notebook 00 for more detailed ploting of the data.

Work is based on: https://towardsdatascience.com/use-the-isolated-forest-with-pyod-3818eea68f08

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500       # number of training points
n_test = 500        # number of testing points
n_features = 6      # number of features
X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, 
    n_test=n_test, 
    n_features= n_features, 
    contamination=contamination, 
    random_state=123)

#plot first 5 rows of train
X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()

Unnamed: 0,0,1,2,3,4,5
0,2.39609,2.092611,2.073392,1.988262,1.953473,2.450997
1,1.63104,1.746182,1.89805,2.380148,1.967332,1.858916
2,1.824683,2.131412,2.028829,1.703454,2.502966,2.119108
3,2.106098,2.165173,2.340826,2.170109,1.749139,1.678661
4,1.829647,1.775596,1.829438,2.054768,1.57719,1.594549


In [12]:

from pyod.models.iforest import IForest
isft = IForest(contamination=0.05, max_samples=40, behaviour='new', random_state=123) 
isft.fit(X_train)

# Training data
y_train_scores = isft.decision_function(X_train)
y_train_pred = isft.predict(X_train)

# Test data
y_test_scores = isft.decision_function(X_test)
y_test_pred = isft.predict(X_test) # outlier labels (0 or 1)

# Threshold for the defined comtanimation rate
print("The threshold for the defined contamination rate:" , isft.threshold_)

from sklearn.metrics import classification_report
print('train metrics:')
print(classification_report(y_train, y_train_pred))
print('test metrics:')
print(classification_report(y_test, y_test_pred))

The threshold for the defined contamination rate: -4.292573241304609e-15
train metrics:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       1.00      1.00      1.00        25

    accuracy                           1.00       500
   macro avg       1.00      1.00      1.00       500
weighted avg       1.00      1.00      1.00       500

test metrics:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       0.93      1.00      0.96        25

    accuracy                           1.00       500
   macro avg       0.96      1.00      0.98       500
weighted avg       1.00      1.00      1.00       500



In [17]:
# what if we don't have outliers in the train set
X_train_inliers = X_train[y_train == 0]
y_train_inliers = y_train[y_train == 0]

# since the training data is uncontaminated but we have to provide a contamination rate to the model, lets try a few values

contam_vals = [0.01, 0.005, 0.001] # 1/100, 1/200, 1/1000

for c in contam_vals:
    isft = IForest(contamination=c, max_samples=40, behaviour='new', random_state=123) 
    isft.fit(X_train_inliers)

    # Test data
    y_test_scores = isft.decision_function(X_test)
    y_test_pred = isft.predict(X_test) # outlier labels (0 or 1)

    # Threshold for the defined comtanimation rate
    print(f'test metrics for uncontaminated data with c = {c}:')
    print(classification_report(y_test, y_test_pred))

test metrics for uncontaminated data with c = 0.01:
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99       475
         1.0       0.68      1.00      0.81        25

    accuracy                           0.98       500
   macro avg       0.84      0.99      0.90       500
weighted avg       0.98      0.98      0.98       500

test metrics for uncontaminated data with c = 0.005:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       0.96      1.00      0.98        25

    accuracy                           1.00       500
   macro avg       0.98      1.00      0.99       500
weighted avg       1.00      1.00      1.00       500

test metrics for uncontaminated data with c = 0.001:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       0.96      1.00      0.98        25

    accuracy            