## Hypothesis

Anomaly detection techniques are commonly employed in predictive maintenance strategies to identify abnormal patterns or behaviors in machine data that may indicate impending failures.

By monitoring various sensor readings, operational parameters, or other relevant data from machines, anomaly detection algorithms can learn the normal behavior or expected patterns of the machines during their normal operation. When a deviation from the normal behavior is detected, it may indicate a potential failure or malfunction. The hypothesis is that the sensor readings of a pump will generate not normal values in case of an (upcoming) failure, and these can be determined with anomaly detection. Several algorithms will be used to evaluate the hypothesis

## Assignment

#### Analyze the end result plot to evaluate the algorithm's performance. Look for anomalies identified by the algorithm and compare them to known anomalies or instances of abnormal behavior in the data. Assess whether the algorithm successfully captures these anomalies and if it shows promising results in detecting abnormal patterns. Based on the plot analysis, provide argumentation for the validity of the anomaly detection algorithm hypothesis (see above). Discuss how the algorithm effectively captures anomalies in the time series data and why it is a suitable approach for the use case. Support your argument with references to relevant literature that discuss the effectiveness of the chosen algorithm or similar algorithms in detecting anomalies in time series data.

The Isolation Forest algorithm works by isolating anomalies in the data by constructing isolation trees.

These trees are binary partitioning trees that randomly select features and split the data along those features until each data point is isolated in its own leaf node.

Anomalies are expected to require fewer splits to be isolated, making them stand out from the normal data points.

One of the key advantages of the Isolation Forest algorithm is its ability to handle high-dimensional datasets and outliers without requiring complex pre-processing or assumptions about the underlying distribution of the data.

This makes it particularly suitable for time series data, where anomalies can occur in various forms and patterns. 

Also the efficiency of the Isolation Forest algorithm is another advantage for anomaly detection in time series data. 

The algorithm has a linear time complexity, meaning it can process large datasets efficiently. This efficiency is beneficial for analyzing time series data, which often involves processing large volumes of data points.

Several studies have demonstrated the effectiveness of the Isolation Forest algorithm and similar algorithms in detecting anomalies in time series data. For instance, in the survey "Anomaly Detection in Univariate Time-Series:

A Survey on the State-of-the-Art" by Mohammad Braei and Dr.-Ing. Sebastian Wagner (2020), the authors evaluated the performance of several unsupervised anomaly detection algorithms on univariate time series data.

They found that the Isolation Forest algorithm achieves a "very high AUC-value", indicating its ability to effectively discriminate between normal and anomalous instances.

Reference:
Braei, M., & Wagner, S. (2020). Anomaly Detection in Univariate Time-Series: A Survey on the State-of-the-Art. arXiv preprint arXiv:2011.01958. Retrieved from https://arxiv.org/abs/2011.01958

In [None]:
# Done together with Fateme Rakhshanifar

-----

In this assignment focused on Credit Card Fraud Detection using Anomaly Detection techniques, various predictive models were employed to accurately identify normal transactions from fraudulent ones. The dataset used for this task was sourced from Kaggle. The primary goal of fraud detection is to prevent customers from being wrongly charged for fraudulent transactions. Different techniques were explored:

### Outlier Detection:
Outlier detection approaches consider fraud transactions as points distributed in a multi-dimensional space, aiming to detect statistically significant deviations from regular transactions. Unsupervised anomaly models are used to map data points onto a one-dimensional line using algorithms, generating anomaly scores and labels.

### Isolation Forest Algorithm:
The Isolation Forest Algorithm is a technique used to detect anomalies by isolating points. It works by randomly selecting features and splitting on values between the maximum and minimum values of the feature to isolate points. Anomalies are easier to isolate due to their distinct behavior, resulting in a shorter path length to separate them from normal observations. This algorithm has low computational complexity and memory usage.

### Local Outlier Factor Algorithm:
The Local Outlier Factor (LOF) algorithm is an unsupervised method that computes the local density deviation of a data point with respect to its neighbors. It identifies points with substantially lower density compared to their neighbors as outliers. The choice of the number of neighbors affects the sensitivity of this algorithm.

### One Class SVM:
One-Class SVM is an unsupervised algorithm designed for outlier detection. It learns from normal transactions to create a model representing the data. When introduced to observations far away from the normal behavior, it labels them as outliers with a negative score. Observations close to the normal behavior receive positive scores.

### Autoencoders:
Autoencoders, a type of neural network used for unsupervised learning, were explored. They learn patterns in the data by creating hidden layers that extract essential information from the input. Autoencoders use non-linear techniques to capture complex patterns in the data, similar to Principal Component Analysis (PCA) but with enhanced capabilities.

## Credit Card Fraud Detection (Anomaly Detection)

In [None]:

import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import warnings
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, accuracy_score, roc_curve, auc, silhouette_score, adjusted_rand_score
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from scipy.stats.mstats import winsorize
warnings.filterwarnings('ignore')
import yaml

In [None]:
# Load the configuration from the YAML file
with open("config.yml", "r") as file:
    config = yaml.safe_load(file)

# Get the dataset path from the configuration
dataset = config["anomaly_data"]["path"]

# Load the dataset
df = pd.read_csv(dataset)
df.head()



In [None]:
df[df.columns].isnull().sum()

In [None]:
df.describe()

In [None]:
corr_matrix = df.corr()
sns.heatmap(corr_matrix,cmap="coolwarm")
plt.show()

In [None]:
normal = df[df['Class']==0]
fraud = df[df['Class']==1]
print(f"normal shape:", normal.shape)
print(f"fraud shape:", fraud.shape)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

features = df.columns[:-1]

plt.figure(figsize=(20,50))
i=1
for feature in features:
    plt.subplot(11,3,i)
    sns.kdeplot(normal[feature], shade= True, label='Normal',color ='pink')
    sns.kdeplot(fraud[feature], shade= True, label='Fraud',color ='blue')
    i=i+1
    plt.tight_layout()
    plt.legend()
plt.show()


it is obvious that only a few of features has correlation with each other and we can neglect rest of them.

Also fraud is  dificault to detect because they are hiding in a lower dimensional subspace.

### Normalization

In [None]:
#scale between (0,1)
df_norm = MinMaxScaler().fit_transform(df)
df_norm = pd.DataFrame(df_norm, columns=df.columns)
df_norm['Class'] = df['Class']

In [None]:
#Assigning the transaction class "0 = NORMAL  & 1 = FRAUD"
normal = df_norm[df_norm['Class']==0]
fraud = df_norm[df_norm['Class']==1]
print(f"normal shape:", normal.shape)
print(f"fraud shape:", fraud.shape)

Here we select some of the features that make sense and also for reducing the dimension.

In [None]:
col_list = []

for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        corr_value = corr_matrix.iloc[i, j]
        feature1 = corr_matrix.columns[i]
        feature2 = corr_matrix.columns[j]
        
        if abs(corr_value) > 0.3:
            col_list.extend([feature1, feature2])

# Remove duplicates from the list and convert to a comma-separated string
col_list = set(col_list)

# Remove 'Class' and 'Time' from col_list
col_list.remove('Class')
col_list.remove('Time')
col_list.remove('Amount')
print("List of columns with correlation > 0.3 or < -0.3:")
print(col_list)


In [None]:
X = df_norm[list(col_list)]
y = df_norm['Class']
print(X.shape)
print(y.shape)

## Classification

In [None]:
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

outliers_fraction = 1 - (len(normal)/(len(df))) 

anomaly_algorithms = [
    ("Isolation Forest",IsolationForest(contamination=outliers_fraction, n_jobs = -1)),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction)),
    ("Local Outlier Factor",LocalOutlierFactor(contamination=outliers_fraction, n_jobs = -1)),
    ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction))]

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# ## fit the models: Mind you this takes a lot of time!!!!!!!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
for name, algorithm in anomaly_algorithms:
    print(algorithm)

    if name == "Local Outlier Factor":
        y_pred = algorithm.fit_predict(X)
    else:
        y_pred = algorithm.fit(X).predict(X)
    
    df[f'{name}'] = y_pred
    print('-'*100)
    print(f'number of fraud detected')
    print(df[f'{name}'].value_counts())
    print('-'*100)


In [None]:
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

outliers_fraction = 1 - (len(normal) / len(df))

anomaly_algorithms = [
    ("Isolation Forest", IsolationForest(contamination=outliers_fraction, n_jobs=-1)),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction)),
    ("Local Outlier Factor", LocalOutlierFactor(contamination=outliers_fraction, n_jobs=-1)),
    ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)),
]


for name, clf in anomaly_algorithms:
    print(algorithm)

    if name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_prediction = clf.negative_outlier_factor_
    elif name == 'One-Class SVM':
        clf.fit(X)
        y_pred = clf.predict(X)
    else:
        clf.fit(X)
        scores_prediction = clf.decision_function(X)
        y_pred = clf.predict(X)

    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    n_errors = (y_pred != y).sum()
    df[f'{name}'] = y_pred
    
    print("{} number of errors: {}".format(name, n_errors))
    print(df[f'{name}'].value_counts())
    print(pd.crosstab(y, y_pred))
    print('silhouette coefficient:', round(metrics.silhouette_score(df, y_pred, metric='euclidean'), 3))
    print('Adjusted Rand index   :', round(metrics.adjusted_rand_score(y, y_pred), 3))
    print("Classification Report :")
    print(classification_report(y, y_pred))

    # Calculate accuracy (percentage of correctly detected anomalies)
    accuracy = df[df['Class'] == 1][f'{name}'].value_counts().get(-1, 0) / len(df[df['Class'] == 1])
    print(f'Accuracy using {name}: {accuracy:.2%}')
    print('-' * 20)



In [None]:
#save results
filename = 'outcome.csv'
df.to_csv(filename, index=False)

The output shows the evaluation results of different anomaly detection algorithms:

Isolation Forest:
Number of anomalies detected: 492
Accuracy: 22.97%

One-Class SVM:
Number of anomalies detected: 493
Accuracy: 34.15%

Local Outlier Factor:
Number of anomalies detected: 492
Accuracy: 0.00%
Robust Covariance:
Number of anomalies detected: 492
Accuracy: 22.97%



The "One-Class SVM" algorithm was able to correctly detect anomalies with the highest accuracy compared to other algorithms. But 



