# **Content**

**Anomaly Detection**
  
  * Outlier Detection
    <ul>
    <li>Local Outlier Detection</li>
    </ul>
  
  * Novelty Detection
    <ul>
    <li>Elliptic Envelope</li>
    <li>Isolation Forest</li>
    <li>Local Outlier Detection for Novelties</li>
    <li>One Class Support Vector Machine</li>
    <li>Stochastic Gradient Descent One Class Support Vector Machine</li>
    </ul>

# Using Google Colab with Tutorial

If using Google Colab, run the following code prior to running any tutorial code.  If running code locally, ignore this section

In [None]:
# Inastall the required packages
!pip install numpy
!pip install EMD_signal
!pip install ewtpy
!pip install matplotlib
!pip install padasip==1.2.2
!pip install pandas
!pip install pytz
!pip install PyWavelets
!pip install scikit_learn
!pip install sktime
!pip install scipy
!pip install seaborn
!pip install statsmodels==0.13.5
!pip install tslearn==0.6.1
!pip install vmdpy==0.2
!pip install tftb
!pip install tqdm
!pip install ssqueezepy
!pip install numba
!pip install jupyter
!pip install fastsst --no-deps
!pip install sktime
!pip install chart_studio
!pip install plotly

Collecting EMD_signal
  Downloading EMD_signal-1.6.4-py3-none-any.whl.metadata (8.9 kB)
Collecting pathos>=0.2.1 (from EMD_signal)
  Downloading pathos-0.3.3-py3-none-any.whl.metadata (11 kB)
Collecting ppft>=1.7.6.9 (from pathos>=0.2.1->EMD_signal)
  Downloading ppft-1.7.6.9-py3-none-any.whl.metadata (12 kB)
Collecting dill>=0.3.9 (from pathos>=0.2.1->EMD_signal)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting pox>=0.3.5 (from pathos>=0.2.1->EMD_signal)
  Downloading pox-0.3.5-py3-none-any.whl.metadata (8.0 kB)
Collecting multiprocess>=0.70.17 (from pathos>=0.2.1->EMD_signal)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Downloading EMD_signal-1.6.4-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pathos-0.3.3-py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m7.7 MB/s[

In [None]:
# This code clones the SensorAI github repository to colab
  # Note: rerunning this code segment will give an error if the repository currently exists in your colab
!git clone https://github.com/wsonguga/SensorAI.git

# Once this code is executed, click the file icon to the left to verify all files have been cloned

Cloning into 'SensorAI'...
remote: Enumerating objects: 1024, done.[K
remote: Counting objects: 100% (359/359), done.[K
remote: Compressing objects: 100% (175/175), done.[K
remote: Total 1024 (delta 224), reused 314 (delta 184), pack-reused 665 (from 1)[K
Receiving objects: 100% (1024/1024), 308.05 MiB | 46.71 MiB/s, done.
Resolving deltas: 100% (556/556), done.


In [None]:
# Set the root path to github repository
import os

root_path = "/content/SensorAI"

repo_root = os.path.join("/content/SensorAI")

!ls

sample_data  SensorAI


In [None]:
# Change to the tutorial repository
import os

os.chdir('SensorAI')

%ls             # display directory content

[0m[01;34manomaly_detection[0m/  [01;32minstall.sh[0m*             temp.py
classification.py   [01;34minstance[0m/               tutorial_anomaly_detection.ipynb
clustering.py       [01;34mPyEMD[0m/                  tutorial_classification.ipynb
[01;34mdata[0m/               pytorch_weight_norm.py  tutorial_clustering.ipynb
deeplearning.py     README.md               tutorial_dsp.ipynb
detection_old.py    regression.py           tutorial_regression.ipynb
detection.py        requirements.txt        utils.py
dsp.py              sk_grid_builder.py


In [None]:
# This command pulls any updated files from the repository
# This code segment may be re-executed at any point if there have been updates to the repository
!git pull https://github.com/wsonguga/SensorAI.git

From https://github.com/wsonguga/SensorAI
 * branch            HEAD       -> FETCH_HEAD
Already up to date.


# Anomaly Detection

"**Anomaly Detection** is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations." <a href="https://www.geeksforgeeks.org/machine-learning-for-anomaly-detection/">[1]</a>

Anomaly detection can be supervised, unsupervised, or semi-supervised.

There are two broad categories of anomaly detection, **outlier detection** and **novelty detection**.

</br>

**References**

1.   https://www.geeksforgeeks.org/machine-learning-for-anomaly-detection/

## Outlier Detection

In outlier detection, you are looking for anomalies that may lie within your current dataset. [2] Outlier detection is an unsupervised approach. It is used when you suspect there is anomalous data points that you wish to remove prior to using the data.

### Local Outlier Factor

Sklearn LOF source code: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

Local Outlier Factor (LOF) is an unsupervised method of outlier detection. An anomaly score, called the local outlier factor, is generated for each input sample.  The LOF is a measure of an samples deviation fo the density with respect to its neighbors [3].  The lower the density score, the more likely that the sample is an outlier.  Neighbors are loosely defined as the previous samples that are closest to the current one.  A "nearest neighbors" algorithm is used to generate neighborhoods.  For more information on trees, see the "Nearest Neighbors" subsection under "Classification."

In Sklearn, the LOF algorithm has two modes, one for outliers and one for novelties.  Here we only discuss the outlier mode.  For more details on the novelty mode, see the* Local Outlier Factor with Novelties* section below.


**Sklearn parameters of note:**

**n_neighbors** - this is a variable for the number of neighbors to use for a comparison.

**algorithm** - this is a variable for choosing the algorithm to be used to generate neighborhoods. The options are 'auto', 'ball_tree','kd_tree','brute'.  Note that 'auto' selects the best method based on the parameters passed by the user.

**metric** - this variable to for the distance metric that will be used. 'euclidean' is one of the most common, but Sklearn offers a large number of options for distance metrics.  They are all listed here: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html


In [None]:
# GET TIME SERIES DATASET & PRINT SAMPLE DATA
import numpy as np
import random
from dsp import sine_wave, triangle_wave, square_wave, generate_anomaly_data
from pathlib import Path # pathlib is OS agnostic
from sklearn.model_selection import train_test_split

# import the anomaly detector builders from the cloned python files
import detection as skn
#import sk_grid_builder as sgb

x, y = generate_anomaly_data(wave_number=50)
print("Lenghth ",len(x[0]))

n_classes = int(np.amax(y)+1)
print("number of classes is ",n_classes)

lof = skn.pipeBuild_LocalOutlierFactor(algorithm=['ball_tree','kd_tree'],novelty=[False])

names=['LOF']
pipes=[lof]

# Build and run a grid search for anomaly detectors.  Outputs best model and heat map of each type.
skn.gridsearch_outlier(names=names,pipes=pipes,X=x,y=y,plot_number=3)
#skn.gridsearch_clustering(names=names,pipes=pipes,X=x,y=y,plot_number=3)

Lenghth  1000
number of classes is  2
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Best parameter (CV score=nan):
{'lof__algorithm': 'ball_tree', 'lof__contamination': 'auto', 'lof__leaf_size': 30, 'lof__metric': 'minkowski', 'lof__metric_params': None, 'lof__n_jobs': None, 'lof__n_neighbors': 20, 'lof__novelty': False, 'lof__p': 2}


## Novelty Detection

In novelty detection, you are planning on identifying anomalies that may exist in new data. Novely detection is either supervised or semi-supervised.

In [None]:
# GET TIME SERIES DATASET & PRINT SAMPLE DATA
import numpy as np
import random
from dsp import sine_wave, triangle_wave, square_wave, generate_anomaly_data
from pathlib import Path # pathlib is OS agnostic
from sklearn.model_selection import train_test_split

# import the anomaly detector builders from the cloned python files
import detection as skn
import sk_grid_builder as sgb

x, y = generate_anomaly_data(wave_number=50)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=42)

n_classes = int(np.amax(y_train)+1)
print("number of classes is ",n_classes)

lofn = skn.pipeBuild_LocalOutlierFactor(algorithm=['ball_tree','kd_tree'],novelty=[True])

names=['LOF Novelty']
pipes=[lofn]

# Build and run a grid search for anomaly detectors.  Outputs best model and heat map of each type.
skn.gridsearch_outlier(names=names,pipes=pipes,X=x,y=y,plot_number=3)

number of classes is  2
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Best parameter (CV score=-0.620):
{'lof__algorithm': 'ball_tree', 'lof__contamination': 'auto', 'lof__leaf_size': 30, 'lof__metric': 'minkowski', 'lof__metric_params': None, 'lof__n_jobs': None, 'lof__n_neighbors': 20, 'lof__novelty': True, 'lof__p': 2}


In [None]:
# GET TIME SERIES DATASET & PRINT SAMPLE DATA
import numpy as np
import random
from dsp import sine_wave, triangle_wave, square_wave, generate_anomaly_data
from pathlib import Path # pathlib is OS agnostic
from sklearn.model_selection import train_test_split

# import the anomaly detector builders from the cloned python files
import detection as skn
import classification as skc

x, y = generate_anomaly_data(wave_number=50)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=42)

n_classes = int(np.amax(y_train)+1)
print("number of classes is ",n_classes)

sst = skn.pipeBuild_SstDetector(y=y,win_length=20)

names=['SST']
pipes=[sst]

# Build and run a grid search for anomaly detectors.  Outputs best model and heat map of each type.
#skn.gridsearch_outlier(names=names,pipes=pipes,X=x,y=y,plot_number=3)
#skc.gridsearch_classifier(names=names,pipes=pipes,X_train=X_train,X_test=X_test,y_train=y_train,y_test=y_test,plot_number=3)
skn.gridsearch_clustering(names=names,pipes=pipes,X=x,y=y,plot_number=3)

number of classes is  2
Fitting 5 folds for each of 1 candidates, totalling 5 fits


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

In [None]:
# GET TIME SERIES DATASET & PRINT SAMPLE DATA
import numpy as np
import random
from dsp import sine_wave, triangle_wave, square_wave, generate_anomaly_data
from pathlib import Path # pathlib is OS agnostic
from sklearn.model_selection import train_test_split

# import the anomaly detector builders from the cloned python files
import detection as skn
#import sk_grid_builder as sgb

x, y = generate_anomaly_data(wave_number=50)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.10, random_state=42)

n_classes = int(np.amax(y_train)+1)
print("number of classes is ",n_classes)

iso = skn.pipeBuild_IsolationForest(n_estimators=[50,100])

names=['Isolation Forest']
pipes=[iso]

# Build and run a grid search for anomaly detectors.  Outputs best model and heat map of each type.
skn.gridsearch_outlier(names=names,pipes=pipes,X=x,y=y,plot_number=3)

number of classes is  2
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Best parameter (CV score=-1.860):
{'isofrst__bootstrap': False, 'isofrst__contamination': 'auto', 'isofrst__max_features': 1.0, 'isofrst__max_samples': 'auto', 'isofrst__n_estimators': 50, 'isofrst__n_jobs': None, 'isofrst__verbose': 0, 'isofrst__warm_start': False}
