## Special Topics to Machine Learning Project Assignment

* After you run the above code, data will be saved in the df variable.
* The dataset is collected from sensors attached to the vessel main engine (to keep the confidential issue, every column name and its sensor value is normalized).
* The objective is to train an anomaly detector using Isolation Forest.
* There is a label in the dataset (column class), where 0 means a normal datapoint and 1 means an anomalous datapoint.
* You will need to preprocess the data, split it into training and testing sets, and then apply the Isolation Forest algorithm to detect anomalies.
* Finally, evaluate the performance of your model using appropriate metrics and visualize the results to understand the model's effectiveness.

## 1. Importing dataset and module

In [2]:
# To do the assignment you have to run this cell
# After run this code, your colab will install anomaly detection module (but their interface is the same with sklearn)
!pip install pyod

Collecting pyod
  Downloading pyod-2.0.0.tar.gz (164 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.0/165.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyod
  Building wheel for pyod (setup.py) ... [?25l[?25hdone
  Created wheel for pyod: filename=pyod-2.0.0-py3-none-any.whl size=196324 sha256=9c6d7baf59d1d3ffd3e36999bcfcc8aa4fd2e1329f196b0a798fd56f41d5a895
  Stored in directory: /root/.cache/pip/wheels/15/0e/91/96b270e6741d4eece88727489411330226ff47ac1cb9ea0097
Successfully built pyod
Installing collected packages: pyod
Successfully installed pyod-2.0.0


In [3]:
# Step 0: Import necessary modeuls
import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

# Step 1: Load the dataset from pyod
df = pd.read_csv('https://raw.githubusercontent.com/ralbu85/STML/main/assignment_data.csv.csv')
df = df[['Dim_0','Dim_16','Dim_17','Dim_18','Dim_19','Dim_20','class']]

In [None]:
# The shape of the dataset will be like that
df

## Problem 1
* Complete the following cell to calculate the number of normal and abnormal data points.
* Save the number of normal data points in the variable 'n_normal' and the number of abnormal data points in the variable 'n_anormal'.

In [5]:
## Complete following code int the cell
n_normal = df[df['class'] == 0].shape[0]
n_anormal = df[df['class'] == 1].shape[0]

In [6]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"Number of normal data points: {n_normal}")
print(f"Number of abnormal data points: {n_anormal}")

Number of normal data points: 6666
Number of abnormal data points: 534


## Problem 2
* Next we will split the dataset into training/validation/testing
* First, 75% of normal data will be used for trining
* Second, remaining 25% normal data and entire anormal data will be evenly splitted into validation and testing dataset
* Complete the following cell to split the dataset

In [7]:
## You don't need to code, but you have to run

## Split the data with respect to the normal/anormal
df_normal = df[df['class']==0]
df_anormal = df[df['class']==1]

## Splitting data for normal
y_normal = df_normal['class']
X_normal = df_normal.drop(columns=['class'])

## Splitting data for anormal
y_anormal = df_anormal['class']
X_anormal = df_anormal.drop(columns=['class'])

In [8]:
## Complete following code int the cell

# Split normal data into training (75%) and the remaining (25%)
X_normal_train, X_normal_temp, y_normal_train, y_normal_temp = train_test_split(X_normal, y_normal , test_size=0.25, random_state=42) # Complete the code
X_normal_val, X_normal_test, y_normal_val, y_normal_test = train_test_split(X_normal_temp , y_normal_temp, test_size=0.5, random_state=42) # Complete the code

# Use all the anormal data for validation and testing
X_anormal_val, X_anormal_test, y_anormal_val, y_anormal_test = train_test_split(X_anormal, y_anormal , test_size=0.5, random_state=42) # Complete the code

# Combine normal and anormal data for validation and testing
X_val = np.vstack((X_normal_val, X_anormal_val))
y_val = np.hstack((y_normal_val, y_anormal_val))

X_test = np.vstack((X_normal_test, X_anormal_test))
y_test = np.hstack((y_normal_test, y_anormal_test))

In [9]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"Shape of X_train: {X_normal_train.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of X_test: {X_test.shape}")

Shape of X_train: (4999, 6)
Shape of X_val: (1100, 6)
Shape of X_test: (1101, 6)


## Problem 3.
* Next we will perform hyper parameter tuning of Isolation Forest to find the best model using Validation Dataset
* Note that IForest model use same API for sklearn
* However, we cannot directly run GridSearchCV in this setting
* So we will directly iterate the entire hyper parameter spaces

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split

# Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_samples': ['auto'],
    'contamination': [0.1, 0.2,0.3,0.5],
    'max_features': [1.0, 0.5, 0.8]
}

# Function to train and evaluate IsolationForest with given hyperparameters
# We will use f1 metric for finding best hyper parameter
def train_and_evaluate(params):
    iso_forest = IForest(
        n_estimators=params['n_estimators'],
        max_samples=params['max_samples'],
        contamination=params['contamination'],
        max_features=params['max_features'],
        random_state=42
    )
    iso_forest.fit(X_val)
    y_val_pred = iso_forest.predict(X_val)
    # Convert predictions to binary labels
    y_val_pred = [1 if x == 1 else 0 for x in y_val_pred]

    metric = f1_score(y_val, y_val_pred)
    return metric

# Perform manual hyperparameter tuning
best_params = None
best_metric = -np.inf

for n_estimators in param_grid['n_estimators']:
    for max_samples in param_grid['max_samples']:
        for contamination in param_grid['contamination']:
            for max_features in param_grid['max_features']:
                params = {
                    'n_estimators': n_estimators,
                    'max_samples': max_samples,
                    'contamination': contamination,
                    'max_features': max_features
                }
                metric = train_and_evaluate(params)
                if metric > best_metric:
                    best_metric = metric
                    best_params = params

print(f'Best Hyperparameters: {best_params}')
print(f'Best Validation F1-Score: {best_metric}')

Best Hyperparameters: {'n_estimators': 100, 'max_samples': 'auto', 'contamination': 0.5, 'max_features': 0.8}
Best Validation F1-Score: 0.5263157894736842


## Problem 4.
* Now the best params is stored in 'best_params'
* You will now re-run the IsolationForest model with best hyper parameters using TEST dataset
* Also, we will now check the performance of our best model using various metrics

In [15]:
# Train the best model on the training data
best_iso_forest = IForest(
    n_estimators=best_params['n_estimators'],
    max_samples=best_params['max_samples'],
    contamination=best_params['contamination'],
    max_features=best_params['max_features'],
    random_state=42
)
best_iso_forest.fit(X_normal_train)

# Predicting anomalies on testing set
y_test_pred = best_iso_forest.predict(X_test)

# Evaluating on testing set
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)
roc_auc_test = roc_auc_score(y_test, y_test_pred)

print(f'Test Precision: {precision_test}')
print(f'Test Recall: {recall_test}')
print(f'Test F1-Score: {f1_test}')
print(f'Test AUC-ROC: {roc_auc_test}')

Test Precision: 0.38088445078459343
Test Recall: 1.0
Test F1-Score: 0.5516528925619835
Test AUC-ROC: 0.7398081534772183


## Problem 5.
* Also we want to examine confusion matrix of our result
* Try to extract True Positive, False Positive, True Negative, False Negative
* You have to search confusion_matrix API of scikit learn

In [16]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_pred)

tp = cm[1, 1]
fp = cm[0, 1]
tn = cm[0, 0]
fn = cm[1, 0]


In [17]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"True Positive: {tp}")
print(f"False Positive: {fp}")
print(f"True Negative: {tn}")
print(f"False Negative: {fn}")

True Positive: 267
False Positive: 434
True Negative: 400
False Negative: 0


## Problem 6 (Extra Credit)
I am using the CBLOF (Cluster-Based Local Outlier Factor) algorithm from the PYOD library for anomaly detection


In [21]:
from pyod.models.cblof import CBLOF

In [22]:
# Hyperparameter grid for CBLOF
param_grid_cblof = {
    'n_clusters': [5, 10, 15],
    'contamination': [0.1, 0.2, 0.3, 0.5]
}

In [23]:
# Function to train and evaluate CBLOF
def train_and_evaluate_cblof(params):
    cblof = CBLOF(
        n_clusters=params['n_clusters'],
        contamination=params['contamination']
    )
    cblof.fit(X_val)
    y_val_pred = cblof.predict(X_val)
    y_val_pred = [1 if x == 1 else 0 for x in y_val_pred]
    metric = f1_score(y_val, y_val_pred)
    return metric

In [24]:
# Perform manual hyperparameter tuning for CBLOF
best_params_cblof = None
best_metric_cblof = -np.inf

for n_clusters in param_grid_cblof['n_clusters']:
    for contamination in param_grid_cblof['contamination']:
        params = {
            'n_clusters': n_clusters,
            'contamination': contamination
        }
        metric = train_and_evaluate_cblof(params)
        if metric > best_metric_cblof:
            best_metric_cblof = metric
            best_params_cblof = params

print(f'Best Hyperparameters for CBLOF: {best_params_cblof}')
print(f'Best Validation F1-Score for CBLOF: {best_metric_cblof}')



Best Hyperparameters for CBLOF: {'n_clusters': 10, 'contamination': 0.3}
Best Validation F1-Score for CBLOF: 0.4489112227805695


In [25]:
# Train and evaluate the best CBLOF model on the test data
best_cblof = CBLOF(
    n_clusters=best_params_cblof['n_clusters'],
    contamination=best_params_cblof['contamination']
)
best_cblof.fit(X_normal_train)

y_test_pred_cblof = best_cblof.predict(X_test)
y_test_pred_cblof = [1 if x == 1 else 0 for x in y_test_pred_cblof]

precision_test_cblof = precision_score(y_test, y_test_pred_cblof)
recall_test_cblof = recall_score(y_test, y_test_pred_cblof)
f1_test_cblof = f1_score(y_test, y_test_pred_cblof)
roc_auc_test_cblof = roc_auc_score(y_test, y_test_pred_cblof)

print(f'CBLOF Test Precision: {precision_test_cblof}')
print(f'CBLOF Test Recall: {recall_test_cblof}')
print(f'CBLOF Test F1-Score: {f1_test_cblof}')
print(f'CBLOF Test AUC-ROC: {roc_auc_test_cblof}')



CBLOF Test Precision: 0.34615384615384615
CBLOF Test Recall: 0.5393258426966292
CBLOF Test F1-Score: 0.4216691068814056
CBLOF Test AUC-ROC: 0.6065933769838062


In [26]:
# Confusion matrix for CBLOF
cm_cblof = confusion_matrix(y_test, y_test_pred_cblof)
tp_cblof = cm_cblof[1, 1]
fp_cblof = cm_cblof[0, 1]
tn_cblof = cm_cblof[0, 0]
fn_cblof = cm_cblof[1, 0]

print(f'CBLOF True Positives (TP): {tp_cblof}')
print(f'CBLOF False Positives (FP): {fp_cblof}')
print(f'CBLOF True Negatives (TN): {tn_cblof}')
print(f'CBLOF False Negatives (FN): {fn_cblof}')

CBLOF True Positives (TP): 144
CBLOF False Positives (FP): 272
CBLOF True Negatives (TN): 562
CBLOF False Negatives (FN): 123
