# Lab 4, Exercise 4

In [1]:
import numpy as np
import pandas as pd

## Load data 


In [2]:
# Load the data in the following two CSVs:
# data/exercise4/lab4_normal_data.csv
# data/exercise4/lab4_malicious_data.csv
# The first consists completely of normal data, while the second consists completely of malicious data
# Note: Both sets of data contain the same features used in Exercise 1; the data has already been preprocessed
# (i.e., you can keep all the features and there are no labels in the CSVs)

# CODE HERE
norm_data = pd.read_csv('data/exercise4/lab4_normal_data.csv')
mal_data = pd.read_csv('data/exercise4/lab4_malicious_data.csv')

In [3]:
# Create 15 datasets, where the ith dataset consists of:
# - all normal data
# - only the ith malicious datapoint

# CODE HERE
datasets = []
for i in range(mal_data.shape[0]):
    datasets.append(pd.concat([norm_data,mal_data.iloc[i:i+1,:]],ignore_index=True))

## Anomaly detection

In [4]:
# For each dataset, run isolation forests
#
# Use the following evaluation metric:
# - rank the anomalousness of each datapoint using the isolation forest
# - record the list index of each attack datapoint when sorting from most to least unusual
#     - e.g., if the attack datapoint is at index 0 in the list, we want to record the value 0
#
# Note: don't worry about ties in ranking
# Hint: What is the difference between isolation forest's 'decision_function' and 'predict' methods? 

# CODE HERE
from sklearn.ensemble import IsolationForest
list_idxs = []
for i in range(len(datasets)):
    model = IsolationForest(behaviour='new', contamination=0.1, random_state=42).fit(datasets[i])
    anom_scores = model.decision_function(datasets[i])
    anom_list = [x for x in zip(range(len(anom_scores)), anom_scores)]
    anom_list.sort(key=lambda anom: anom[1])
    list_idxs.append([x[0] for x in anom_list].index(datasets[i].shape[0]-1))
print('List index of each attack datapoint:\n{}'.format(list_idxs))

List index of each attack datapoint:
[161, 18579, 1562, 1412, 0, 28441, 15273, 25003, 722, 17906, 139, 364, 364, 1130, 172]


## Questions:
1) Why is there no separate training and test set?

There is no separate training and test set because this is not a problem using a supervised model, such as regression or classification. This problem utilizes an unsupervised model to "cluster" datapoints and find anomalies in the datapoints, which does not require a training and test split.

2) What is the metric measuring?  What would be a perfect score?  Bonus: What is the expected performance of an outlier detector that assigns a random score to each datapoint?

The anomaly score is measuring the degree to which a datapoint is an anomaly, while the index metric measures how anomalous the datapoint is compared to all of the normal data. A perfect score for the index metric would be a 0, indicating that the datapoint is the most anomalous of all the datapoints.

3) How well does the isolation forest perform compared to a perfect score? Bonus: How well does the isolation forest perform compared to a random detector?

The isolation forest does not perform the best compared to a perfect score, despite getting it once, since there are lots of indexes that are very far off a perfect score.

4) What are some issues that would prevent this model from being practically deployed?

An issue that would prevent this model from being practically deployed is that a lot of the times it scores the malicious data as not very anomalous compared to the rest of the normal data. As a result, either the malicious data goes undetected or a lot of normal data must be checked if, say, the top 20000 anomalous data points need to be analyzed in order to find the malicious datapoint. Also, if there are lots of malicious datapoints, then they might not seem as "anomalous" as they really are and be harder to detect.

5) What might happen if we inject five attack datapoints at a time?  What might happen if we inject 100 attack datapoints at a time?

With more attack datapoints, their anomaly scores might go down since there are more datapoints similar to the attack ones. With just 5 attack datapoints, there might be very little difference, but with 100 it would be much harder to tell if a datapoint is malicious based on anomaly score.

6) What is the effect of the parameters max_features and max_samples?  What other parameters could you adjust to change performance?

Max_features specifies the max number of features, or columns, used to train the base tree estimator while max_samples specifies the max number of samples, or rows, used to train the base tree estimator. Some other parameters that could be used to improve performance are contamination, bootstrap, and behavior.

Optional: What are some alternative anomaly detection models one could use instead of an isolation forest? Bonus: Try one of these alternatives and compare performance.