## Special Topics to Machine Learning Project Assignment

* After you run the above code, data will be saved in the df variable.
* The dataset is collected from sensors attached to the vessel main engine (to keep the confidential issue, every column name and its sensor value is normalized).
* The objective is to train an anomaly detector using Isolation Forest.
* There is a label in the dataset (column class), where 0 means a normal datapoint and 1 means an anomalous datapoint.
* You will need to preprocess the data, split it into training and testing sets, and then apply the Isolation Forest algorithm to detect anomalies.
* Finally, evaluate the performance of your model using appropriate metrics and visualize the results to understand the model's effectiveness.

## 1. Importing dataset and module

In [1]:
# To do the assignment you have to run this cell
# After run this code, your colab will install anomaly detection module (but their interface is the same with sklearn)
!pip install pyod

Collecting pyod
  Downloading pyod-2.0.0.tar.gz (164 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.0/165.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyod
  Building wheel for pyod (setup.py) ... [?25l[?25hdone
  Created wheel for pyod: filename=pyod-2.0.0-py3-none-any.whl size=196324 sha256=61de9cd3d4351a3b7f8babdfaefb349ff5d24324022c2c7c0a7205e6805ccb5b
  Stored in directory: /root/.cache/pip/wheels/15/0e/91/96b270e6741d4eece88727489411330226ff47ac1cb9ea0097
Successfully built pyod
Installing collected packages: pyod
Successfully installed pyod-2.0.0


In [2]:
# Step 0: Import necessary modeuls
import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

# Step 1: Load the dataset from pyod
df = pd.read_csv('https://raw.githubusercontent.com/ralbu85/STML/main/assignment_data.csv.csv')
df = df[['Dim_0','Dim_16','Dim_17','Dim_18','Dim_19','Dim_20','class']]

In [3]:
# The shape of the dataset will be like that
df

Unnamed: 0,Dim_0,Dim_16,Dim_17,Dim_18,Dim_19,Dim_20,class
0,0.750000,0.001132,0.080780,0.197324,0.300926,0.225000,0
1,0.239583,0.000472,0.164345,0.235786,0.537037,0.165625,0
2,0.479167,0.003585,0.130919,0.167224,0.527778,0.118750,0
3,0.656250,0.001698,0.091922,0.125418,0.337963,0.129688,0
4,0.229167,0.000472,0.142061,0.229097,0.337963,0.235938,0
...,...,...,...,...,...,...,...
7195,0.604167,0.004717,0.113092,0.128763,0.379630,0.121875,0
7196,0.520833,0.200000,0.030641,0.005017,0.333333,0.005469,1
7197,0.520833,0.001434,0.109192,0.147157,0.231481,0.206250,0
7198,0.354167,0.005283,0.109192,0.147157,0.333333,0.154688,0


## Problem 1
* Complete the following cell to calculate the number of normal and abnormal data points.
* Save the number of normal data points in the variable 'n_normal' and the number of abnormal data points in the variable 'n_anormal'.

In [None]:
## Complete following code int the cell
n_normal = # Complete the code
n_anormal = # Complete the code

In [None]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"Number of normal data points: {n_normal}")
print(f"Number of abnormal data points: {n_anormal}")

Number of normal data points: 6666
Number of abnormal data points: 534


## Problem 2
* Next we will split the dataset into training/validation/testing
* First, 75% of normal data will be used for trining
* Second, remaining 25% normal data and entire anormal data will be evenly splitted into validation and testing dataset
* Complete the following cell to split the dataset

In [None]:
## You don't need to code, but you have to run

## Split the data with respect to the normal/anormal
df_normal = df[df['class']==0]
df_anormal = df[df['class']==1]

## Splitting data for normal
y_normal = df_normal['class']
X_normal = df_normal.drop(columns=['class'])

## Splitting data for anormal
y_anormal = df_anormal['class']
X_anormal = df_anormal.drop(columns=['class'])

In [None]:
## Complete following code int the cell

# Split normal data into training (75%) and the remaining (25%)
X_normal_train, X_normal_temp, y_normal_train, y_normal_temp = train_test_split( , , , random_state=42) # Complete the code
X_normal_val, X_normal_test, y_normal_val, y_normal_test = train_test_split( , , test_size=0.5, random_state=42) # Complete the code

# Use all the anormal data for validation and testing
X_anormal_val, X_anormal_test, y_anormal_val, y_anormal_test = train_test_split( , , test_size=0.5, random_state=42) # Complete the code

# Combine normal and anormal data for validation and testing
X_val = np.vstack((X_normal_val, X_anormal_val))
y_val = np.hstack((y_normal_val, y_anormal_val))

X_test = np.vstack((X_normal_test, X_anormal_test))
y_test = np.hstack((y_normal_test, y_anormal_test))

In [None]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"Shape of X_train: {X_normal_train.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of X_test: {X_test.shape}")

Shape of X_train: (4999, 6)
Shape of X_val: (1100, 6)
Shape of X_test: (1101, 6)


## Problem 3.
* Next we will perform hyper parameter tuning of Isolation Forest to find the best model using Validation Dataset
* Note that IForest model use same API for sklearn
* However, we cannot directly run GridSearchCV in this setting
* So we will directly iterate the entire hyper parameter spaces

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split

# Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_samples': ['auto'],
    'contamination': [0.1, 0.2,0.3,0.5],
    'max_features': [1.0, 0.5, 0.8]
}

# Function to train and evaluate IsolationForest with given hyperparameters
# We will use f1 metric for finding best hyper parameter
def train_and_evaluate(params):
    iso_forest = IForest(
        n_estimators=params['n_estimators'],
        max_samples=params['max_samples'],
        contamination=params['contamination'],
        max_features=params['max_features'],
        random_state=42
    )
    iso_forest.fit() # Complete the code
    y_val_pred = iso_forest.predict() # Complete the code
    metric = f1_score(, ) # Complete the code
    return metric

# Perform manual hyperparameter tuning
best_params = None
best_metric = -np.inf

for n_estimators in param_grid['n_estimators']:
    for max_samples in param_grid['max_samples']:
        for contamination in param_grid['contamination']:
            for max_features in param_grid['max_features']:
                params = {
                    'n_estimators': n_estimators,
                    'max_samples': max_samples,
                    'contamination': contamination,
                    'max_features': max_features
                }
                metric = train_and_evaluate(params)
                if metric > best_metric:
                    best_metric = # complete the code
                    best_params = # complete the code

print(f'Best Hyperparameters: {best_params}')
print(f'Best Validation F1-Score: {best_metric}')

Best Hyperparameters: {'n_estimators': 100, 'max_samples': 'auto', 'contamination': 0.2, 'max_features': 0.8}
Best Validation F1-Score: 0.7519260400616332


## Problem 4.
* Now the best params is stored in 'best_params'
* You will now re-run the IsolationForest model with best hyper parameters using TEST dataset
* Also, we will now check the performance of our best model using various metrics

In [None]:
# Train the best model on the training data
best_iso_forest = IForest(
    n_estimators=# Complete the code,
    max_samples=# Complete the code,
    contamination=# Complete the code,
    max_features=# Complete the code
    random_state=42
)
best_iso_forest.fit() ## Complete the code

# Predicting anomalies on testing set
y_test_pred = best_iso_forest.predict() # Complete the code

# Evaluating on testing set
precision_test = precision_score( , ) # Complete the code,
recall_test = recall_score( , ) # Complete the code,
f1_test = f1_score( , ) # Complete the code,
roc_auc_test = roc_auc_score( , ) # Complete the code,

print(f'Test Precision: {precision_test}')
print(f'Test Recall: {recall_test}')
print(f'Test F1-Score: {f1_test}')
print(f'Test AUC-ROC: {roc_auc_test}')

Test Precision: 0.5785714285714286
Test Recall: 0.9101123595505618
Test F1-Score: 0.7074235807860262
Test AUC-ROC: 0.8489410718616118


## Problem 5.
* Also we want to examine confusion matrix of our result
* Try to extract True Positive, False Positive, True Negative, False Negative
* You have to search confusion_matrix API of scikit learn

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_pred)

tp = # complete the following code
fp = # complete the following code
tn = # complete the following code
fn = # complete the following code


In [None]:
## Answer checking (If your are correct and run the following cell, the following will be printed.
print(f"True Positive: {tp}")
print(f"False Positive: {fp}")
print(f"True Negative: {tn}")
print(f"False Negative: {fn}")

True Positive: 243
False Positive: 24
True Negative: 657
False Negative: 177


## Problem 6 (Extra Credit)
* Try to apply the same procedure using PYOD anomaly detection algorithms like LOF, CBOD, ...
* Your experment result should be recorded in the following cells..
* You have search PYOD in google and try to find the document
* Hyper parameter setting depend on student's choice (Avoid too much hyper paramter grids)


In [None]:
# complete the following code