# GMM Clustering Outlier Detection

Here we use the probability of beloning to a model cluster to locate the annomalies.

The model requires two parameters: number of components and contamination rate.
Model ensembling with varying numbers of components to improve stability.

Strong performance on the contaminated data set.

The contamination % needs to be parameterised for the training data. A few values are chosen for the uncontaminated data with lower values showing strong perfromance. This model performed quite poorly compared to previous models here with f1 scores of around 80%.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500       # number of training points
n_test = 500        # number of testing points
n_features = 6      # number of features
X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, 
    n_test=n_test, 
    n_features= n_features, 
    contamination=contamination, 
    random_state=123)

#plot first 5 rows of train
X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()

Unnamed: 0,0,1,2,3,4,5
0,2.39609,2.092611,2.073392,1.988262,1.953473,2.450997
1,1.63104,1.746182,1.89805,2.380148,1.967332,1.858916
2,1.824683,2.131412,2.028829,1.703454,2.502966,2.119108
3,2.106098,2.165173,2.340826,2.170109,1.749139,1.678661
4,1.829647,1.775596,1.829438,2.054768,1.57719,1.594549


In [4]:

from pyod.models.gmm import GMM
model = GMM(n_components=4, contamination=0.05) 
model.fit(X_train)

# Training data
y_train_scores = model.decision_function(X_train)
y_train_pred = model.predict(X_train)

# Test data
y_test_scores = model.decision_function(X_test)
y_test_pred = model.predict(X_test) # outlier labels (0 or 1)

# Threshold for the defined comtanimation rate
print("The threshold for the defined contamination rate:" , model.threshold_)

from sklearn.metrics import classification_report
print('train metrics:')
print(classification_report(y_train, y_train_pred))
print('test metrics:')
print(classification_report(y_test, y_test_pred))

The threshold for the defined contamination rate: 7.097087758314963
train metrics:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       1.00      1.00      1.00        25

    accuracy                           1.00       500
   macro avg       1.00      1.00      1.00       500
weighted avg       1.00      1.00      1.00       500

test metrics:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       475
         1.0       0.96      1.00      0.98        25

    accuracy                           1.00       500
   macro avg       0.98      1.00      0.99       500
weighted avg       1.00      1.00      1.00       500



In [6]:
# what if we don't have outliers in the train set
X_train_inliers = X_train[y_train == 0]
y_train_inliers = y_train[y_train == 0]

# since the training data is uncontaminated but we have to provide a contamination rate to the model, lets try a few values

contam_vals = [0.01, 0.005, 0.001] # 1/100, 1/200, 1/1000

for c in contam_vals:
    model = GMM(n_components =4, contamination=c) 
    model.fit(X_train_inliers)

    # Test data
    y_test_scores = model.decision_function(X_test)
    y_test_pred = model.predict(X_test) # outlier labels (0 or 1)

    # Threshold for the defined comtanimation rate
    print(f'test metrics for uncontaminated data with c = {c}:')
    print(classification_report(y_test, y_test_pred))

test metrics for uncontaminated data with c = 0.01:
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99       475
         1.0       0.66      1.00      0.79        25

    accuracy                           0.97       500
   macro avg       0.83      0.99      0.89       500
weighted avg       0.98      0.97      0.98       500

test metrics for uncontaminated data with c = 0.005:
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99       475
         1.0       0.71      1.00      0.83        25

    accuracy                           0.98       500
   macro avg       0.86      0.99      0.91       500
weighted avg       0.99      0.98      0.98       500

test metrics for uncontaminated data with c = 0.001:
              precision    recall  f1-score   support

         0.0       1.00      0.96      0.98       475
         1.0       0.58      1.00      0.74        25

    accuracy            