# Capstone Project

# Unsupervised Anomaly Detection Modeling


---
Unsupervised Learning is a machine learning technique in which the users do not need to supervise the model. Instead, it allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with the unlabelled data. When we don't have a Y variable to predict, we are in the realm of **unsupervised learning**. Since there is no Y variable, unsupervised learning has no measurable "goal". Un-supervised learning can be applied to a number of important task such as manufacturing defect detection ,labelling un-labeled samples, catching outliers in a dataset and fraud detection in a bank transaction



## 1) K-Means Algorithm to classify Fraud/Non-Fraud Credit Card Transaction
Clustering is one of the most popular concepts in the domain of unsupervised learning. The assumption here is that the Data points that are similar tend to belong to similar groups or clusters, as determined by their distance from local centroids.<br>
It aims to partition the observations into k-sets so as to minimize the within-cluster sum of squares. It starts with a group of randomly initialized centroids and then performs iterative calculations to optimize the position of centroids until the centroids stabilize, or the defined number of iterations is reached. <br>
Steps involved are:
1. Pick a value for 𝑘 (the number of clusters to create).
2. Initialize 𝑘 'centroids' (starting points). These do not have to be actual data points.
3. Create clusters by assigning each data point to its nearest centroid.
4. Make your clusters better. Reassign each centroid to the center of its cluster.
5. Repeat steps 3-4 until the centroids converge and do not change across iterations.

## 2) Local Outlier Factor
The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

The number of neighbors considered, (parameter n_neighbors) is typically chosen 1) greater than the minimum number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general.

## 3) Isolation Forest
One of the newest techniques to detect anomalies is called Isolation Forests. The algorithm is based on the fact that anomalies are data points that are few and different. As a result of these properties, anomalies are susceptible to a mechanism called isolation.
<br>
This method is highly useful and is fundamentally different from all existing methods. It introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. Moreover, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a good performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a data set.
<br>
Typical machine learning methods tend to work better when the patterns they try to learn are balanced, meaning the same amount of good and bad behaviors are present in the dataset.
<br>
How Isolation Forests Work? The Isolation Forest algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The logic argument goes: isolating anomaly observations is easier because only a few conditions are needed to separate those cases from the normal observations. On the other hand, isolating normal observations require more conditions. Therefore, an anomaly score can be calculated as the number of conditions required to separate a given observation.
<br>
The way that the algorithm constructs the separation is by first creating isolation trees, or random decision trees. Then, the score is calculated as the path length to isolate the observation.

## 4) One Class SVM
One-class classification techniques can be used for binary (two-class) imbalanced classification problems where the negative case (class 0) is taken as “normal” and the positive case (class 1) is taken as an outlier or anomaly.

- Negative Case: Normal or inlier.
- Positive Case: Anomaly or outlier.

One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection.In one class problem, all data belong to a single class. In this case, the algorithm is trained to learn what is “normal”, so that when a new data is shown the algorithm can identify whether it should belong to the group or not. 
a regular SVM for classification finds a max-margin hyperplane that seperates the positive points from the negative ones. 
The one-class SVM finds a hyper-plane that separates the given dataset from the origin such that the hyperplane is as close to the datapoints as possible.
one-class SVM uses a hyperparameter which is used to define what portion of data should be classified as outliers.
The predicted dataset will have either 1 or -1 values, where -1 values are outliers detected by the algorithm



****

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import IsolationForest
from sklearn.svm import SVC, OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix, auc, \
                            accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve


from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler



======================================================================================================================
### Read the data

In [2]:
pd.set_option('display.max_columns', None)
df = pd.read_csv('../data/creditcard 2.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Define X and y

In [31]:
X = df.drop(columns = 'Class')
y = df['Class']

In [32]:
# Baseline model

y.value_counts()

0    284315
1       492
Name: Class, dtype: int64

### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [None]:
y_train.value_counts()

### Scale the Data

In [34]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [35]:
# Scale your X data without splitting
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

# Unsupervised Learning Models
Unsupervised approach to detect frauds, the only place the labels are used is to evaluate the algorithm. One of the biggest challenge of this problem is that the target is highly imbalanced as only 0.17% cases are fraudulent transactions. But the advantage of the representation learning approach is that it is still able to handle such imbalance nature of the problems.<br>
Un-supervised learning can be applied to a number of important task such as manufacturing defect detection ,labelling un-labeled samples, catching outliers in a dataset and fraud detection in a bank transaction


## K-Means Clustering

In [41]:
%%time
# Instantiate and fit Kmeans algorithm

kmeans=KMeans(n_clusters=2,algorithm="auto",max_iter=10000)
kmeans.fit(X_scaled)
kmeans_predicted_labels=kmeans.predict(X_scaled)

print("tn : true negatives")
print("fp : false positives")
print("fn : false negatives")
print("tp : true positives")
tn,fp,fn,tp=confusion_matrix(y,kmeans_predicted_labels).ravel()
reassignflag=False
if tn+tp<fn+fp:
    # clustering is opposite of original classification
    reassignflag=True

# Predict on Xs_test

preds_kmean=kmeans.predict(X_scaled)


if reassignflag:
    preds_kmean=1-preds_kmean
#calculating confusion matrix for kmeans
tn,fp,fn,tp=confusion_matrix(y,preds_kmean).ravel()
# plot_confusion_matrix(kmeans, Xs_test, y_test, cmap='plasma', values_format='d');
#scoring kmeans
kmeans_accuracy_score=accuracy_score(y,preds_kmean)
kmeans_precison_score=precision_score(y,preds_kmean)
kmeans_recall_score=recall_score(y,preds_kmean)
kmeans_f1_score=f1_score(y,preds_kmean)

errors_kmean = (preds_kmean != y).sum() # Total number of errors is calculated.
print('=========================================================================')
print('Total Errros = ' ,errors_kmean)
#printing
print("")
print("K-Means")
print("Confusion Matrix")
print("tn =",tn,"fp =",fp)
print("fn =",fn,"tp =",tp)
print("Scores")
print("Accuracy: ",kmeans_accuracy_score)
print("Precison: ",kmeans_precison_score)
print("Recall: ",kmeans_recall_score)
print("F1: ",kmeans_f1_score)

print('=========================================================================')
print(classification_report(preds_kmean,y))
print('=========================================================================')

tn : true negatives
fp : false positives
fn : false negatives
tp : true positives
Total Errros =  130255

K-Means
Confusion Matrix
tn = 154358 fp = 129957
fn = 298 tp = 194
Scores
Accuracy:  0.5426552015926575
Precison:  0.0014905763305698766
Recall:  0.3943089430894309
F1:  0.002969925675313641
              precision    recall  f1-score   support

           0       0.54      1.00      0.70    154656
           1       0.39      0.00      0.00    130151

    accuracy                           0.54    284807
   macro avg       0.47      0.50      0.35    284807
weighted avg       0.48      0.54      0.38    284807

CPU times: user 8.65 s, sys: 607 ms, total: 9.26 s
Wall time: 3.57 s


---
## Local Outlier Factor


In [7]:
# Outlier fraction Calculation

Non_fraud = df[df['Class'] == 0]
Fraud = df[df['Class'] == 1]  # print the shape of the class
print('class 0 = Non_fraud:', Non_fraud.shape)
print('class 1 = Fraud:', Fraud.shape)
outlier_fraction = len(Fraud)/float(len(Non_fraud))
print("Outlier_fraction: ", outlier_fraction)

class 0 = Non_fraud: (284315, 31)
class 1 = Fraud: (492, 31)
Outlier_fraction:  0.0017304750013189597


In [47]:
%%time

# Instantiate LOF
#  contamination is the proportion of outliers in the data set or outlier fraction
lof = LocalOutlierFactor(n_neighbors = 20, contamination = outlier_fraction)

# Fitting the model.(fit_predict when novelty is false)
y_pred_lof = lof.fit_predict(X_scaled) 

# # Prediction using trained model.(When novelty is true)
# y_pred_lof = lof.predict(X_scaled)

y_pred_lof[y_pred_lof == 1] = 0 # Valid transactions are labelled as 0.
y_pred_lof[y_pred_lof == -1] = 1 # Fraudulent transactions are labelled as 1.
errors_lof = (y_pred_lof != y).sum() # Total number of errors is calculated.
print('=========================================================================')
print(errors_lof)
print("Accuracy = ", accuracy_score(y_pred_lof,y))
print("Precision = ", precision_score(y,y_pred_lof))
print("Recall = ", recall_score(y,y_pred_lof))
print("F1 Score = ", f1_score(y,y_pred_lof))
print('=========================================================================')
print(classification_report(y_pred_lof,y))
print('=========================================================================')

985
Accuracy =  0.9965415175891043
Precision =  0.0
Recall =  0.0
F1 Score =  0.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    284314
           1       0.00      0.00      0.00       493

    accuracy                           1.00    284807
   macro avg       0.50      0.50      0.50    284807
weighted avg       1.00      1.00      1.00    284807

CPU times: user 26min 59s, sys: 9min 1s, total: 36min
Wall time: 32min 50s


We can see the binary classification report with Local Outlier Factor model. The Class 0 (Non Fraud) has higher precision and recall than the class 1(Fraud). 
- Recall "how many of this class you find over the whole number of element of this class"
- The precision will be "how many are correctly classified among that class"
- The f1-score is the harmonic mean between precision & recall

---
## Isolation Forest

In [38]:
%%time
# Instantiate Isolation Forest
iso_for = IsolationForest(max_samples = len(X),contamination = outlier_fraction)

# Fitting the model.
iso_for.fit(X_scaled)

# Prediction using trained model.
y_pred_if = iso_for.predict(X_scaled) 
y_pred_if[y_pred_if == 1] = 0 # Valid transactions are labelled as 0.
y_pred_if[y_pred_if == -1] = 1 # Fraudulent transactions are labelled as 1.
errors_if = (y_pred_if != y).sum() # Total number of errors is calculated.
print('=========================================================================')
print("Total errors", errors_if)
print('Accuracy:', accuracy_score(y_pred_if,y))
print('=========================================================================')
print(classification_report(y_pred_if,y))

Total errors 643
Accuracy: 0.9977423307713644
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    284314
           1       0.35      0.35      0.35       493

    accuracy                           1.00    284807
   macro avg       0.67      0.67      0.67    284807
weighted avg       1.00      1.00      1.00    284807

CPU times: user 39.1 s, sys: 5.15 s, total: 44.2 s
Wall time: 45.2 s


In [45]:
print("Accuracy = ", accuracy_score(y_pred_if,y))
print("Precision = ", precision_score(y_pred_if,y))
print("Recall = ", recall_score(y_pred_if,y))
print("F1 Score = ", f1_score(y_pred_if,y))


Accuracy =  0.9977423307713644
Precision =  0.3475609756097561
Recall =  0.34685598377281945
F1 Score =  0.3472081218274112


---
## One Class SVM

In [43]:
%%time
one_svm = OneClassSVM(kernel='rbf', degree=3, gamma=0.1,nu=0.05)

# Fitting the model.
one_svm.fit(X_scaled)

# Prediction using trained model.
y_pred_svm = one_svm.predict(X_scaled) 
y_pred_svm[y_pred_svm == 1] = 0 # Valid transactions are labelled as 0.
y_pred_svm[y_pred_svm == -1] = 1 # Fraudulent transactions are labelled as 1.
errors_svm = (y_pred_svm != y).sum() # Total number of errors is calculated.

print('=========================================================================')
print("Total errors", errors_svm)
print('Accuracy:', accuracy_score(y_pred_svm,y))
print("Precision = ", precision_score(y_pred_svm,y))
print("Recall = ", recall_score(y_pred_svm,y))
print("F1 Score = ", f1_score(y_pred_svm,y))
print('=========================================================================')
print(classification_report(y_pred_svm,y))

Total errors 13951
Accuracy: 0.951015951152886
Precision =  0.8495934959349594
Recall =  0.02924099335431969
F1 Score =  0.05653614661527017
              precision    recall  f1-score   support

           0       0.95      1.00      0.97    270512
           1       0.85      0.03      0.06     14295

    accuracy                           0.95    284807
   macro avg       0.90      0.51      0.52    284807
weighted avg       0.95      0.95      0.93    284807

CPU times: user 32min 51s, sys: 13.5 s, total: 33min 5s
Wall time: 37min 50s


---
## Results

The results are presented based on three metrics: precision, recall and F-measure. Precision refers to the ability of the model to be trustworthy in regard to its classified positive points; that is, precision tells us how many of the predicted frauds are actually frauds. High precision means that when the model classifies a point as positive, it is highly likely that it is a correct classification. This metric is defined by the following equation: Precision = True Positive/(True Positive + False Positive). Recall indicates the ability of the model to detect the positive class. When a model presents a high recall, it means that the majority of positive data points are correctly identified. The equation for recall is as follows: Recall = True Positive/(True Positive + False Negative). Precision and recall indicate two opposite properties of a model, meaning that optimising one implies worsening the other. In order to gain a more comprehensive overview of the performance of the model, we can use the F-measure metric, defined as shown in the following equation: F-Measure = 2(Precision ∗ Recall)/(Precision + Recall). These metrics are calculated for each of the models.

- Isolation Forest detected 643 errors versus Local Outlier Factor detecting 985 errors and 13951 in One Class SVM, KMeans had error 130255 (classified non fraud data as fraud transactions)
- Isolation Forest has a 99.74% more accurate than LOF of 99.65% and SVM of 95.1%
- When comparing error precision & recall for 3 models , the Isolation Forest performed much better than the LOF, SVM and KMeans as we can see that the detection of fraud cases is around 28 % versus LOF detection rate of just 0 for LOF and SVM of 2%.
- Finally the model computation time was also better than the rest of the models(44secs). It too fairly less time than SVM and LOF.
- So overall Isolation Forest Method performed much better in determining the fraud cases which is around 28%.
- We can conclude saying that Isolation forest is a better anomaly detection algorithm than the others for the given data set.


***