# Anomaly Detection Tutorial

In this tutorial we will learn:

- Getting Data: How to import data from PyCaret repository?
- Setting up Environment: How to setup experiment in PyCaret to get started with building anomaly models?
- Create Model: How to create a model and assign anomaly labels to original dataset for analysis?
- Plot Model: How to analyze model performance using various plots?
- Predict Model: How to assign anomaly labels to new and unseen dataset based on trained model?
- Save / Load Model: How to save / load model for future use?

In [None]:
# Logging setup
import logging

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.ERROR)
logging.getLogger('sintel').setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

import os

## Creating an instance of the DBExplorer

Sintel requires the use of MongoDB to store data.
In order to connect to the database, all you need to do is import and create an instance of the class.

To create the `DBExplorer` instance you will need to pass:

* `user`: An identifier of the user that is running Orion.
* `database`: The name of the MongoDB database to use. This is optional and defaults to `sintel`.

In [2]:
from sintel.db import DBExplorer

db_name = 'sintel'
dbex = DBExplorer(user='dyu', database=db_name)

This will directly create a connection to the database named `'sintel-ad'` at the default
MongoDB host, `localhost`, and port, `27017`.

In case you wanted to connect to a different database, host or port, or in case user authentication
is enabled in your MongoDB instance, you can pass a dictionary or a path to a JSON file containing
any required additional arguments:

* `host`: Hostname or IP address of the MongoDB Instance. Defaults to `'localhost'`.
* `port`: Port to which MongoDB is listening. Defaults to `27017`.
* `username`: username to authenticate with.
* `password`: password to authenticate with.
* `authentication_source`: database to authenticate against.

Once we have created the `DBExplorer` instance, and to be sure that we are ready to follow
the tutorial, let's do the following two set-up steps:

1. Drop the currently existing `sintel-ad` database

**WARNING**: This will remove all the data that exists in this database!

In [3]:
dbex.drop_database()

2. Make sure to have downloaded some demo data using the `orion.data.download_demo()` function

In [4]:
from sintel.data import download_demo

download_demo(path='data')

INFO:sintel.data:Downloading Sintel Demo Data to folder nasa


This will create a folder called `orion-data` in your current directory with the 3 CSV files
that we will use later on.

## Add information to the database


### 1. Add a Dataset

In order to add a dataset you can use the `add_dataset` method, which has the following arguments:

* `name (str)`: Name of the dataset
* `entity (str)`: Name or Id of the entity which this dataset is associated to

Let's create the `Demo Dataset` that we will use for our demo.

In [5]:
dataset = dbex.add_dataset(
    name='NASA',
    entity='NASA',
)

This call will try to create a new _Dataset_ object in the database and return it.

We can now see the _Dataset_ that we just created using the `get_datasets` method:

In [6]:
dbex.get_datasets()

Unnamed: 0,dataset_id,created_by,entity,insert_time,name
0,620dec235c92715ecc1f88a6,dyu,NASA,2022-02-17 06:33:06.839,NASA


### 2. Add a Signal

The next step is to add Signals. This can be done with the `add_signal` method, which expects:

* `name (str)`: Name of the signal
* `dataset (Dataset or ObjectID)`: Dataset Object or Dataset Id.
* `start_time (int)`: (Optional) minimum timestamp to be used for this signal. If not given, it
  defaults to the minimum timestamp found in the data.
* `stop_time (int)`: (Optional) maximum timestamp to be used for this signal. If not given, it
  defaults to the maximum timestamp found in the data.
* `data_location (str)`: URI of the dataset
* `timestamp_column (int)`: (Optional) index of the timestamp column. Defaults to 0.
* `value_column (int)`: (Optional) index of the value column. Defaults to 1.

For example, adding the `S-1` signal to the Demo Dataset that we just created could be done like
this:

In [7]:
dbex.add_signal(
    name='S-1',
    dataset=dataset,
    data_location=os.path.join('data', 'S-1.csv')
)

<Signal: Signal object>

Additionally, we can also add all the signals that exist inside a folder by using the `add_signals`
method, passing a `signals_path`:

In [8]:
dbex.add_signals(
    dataset=dataset,
    signals_path='data'
)

After this is done, we can see that one signal has been created for each one of the CSV
files that we downloaded before.

In [9]:
dbex.get_signals(dataset=dataset)

Unnamed: 0,signal_id,created_by,data_location,dataset,insert_time,name,start_time,stop_time
0,620dec235c92715ecc1f88a7,dyu,nasa/S-1.csv,620dec235c92715ecc1f88a6,2022-02-17 06:33:07.391,S-1,1222819200,1442016000
1,620dec235c92715ecc1f88a9,dyu,nasa/P-1.csv,620dec235c92715ecc1f88a6,2022-02-17 06:33:07.865,P-1,1222819200,1468540800
2,620dec235c92715ecc1f88aa,dyu,nasa/E-1.csv,620dec235c92715ecc1f88a6,2022-02-17 06:33:07.877,E-1,1222819200,1468951200


### Add a Template

The next thing we need to add is a _Template_ to the Database using the `add_template` method.

This method expects:

* `name (str)`: Name of the template.
* `template (dict or str)`: Optional. Specification of the template to use, which can be one of:
    * An MLPipeline instance
    * The name of a registered template
    * a dict containing the MLPipeline details
    * The path to a pipeline JSON file.

In [10]:
template = dbex.add_template(
    name='lstmdt',
    template='./pipelines/orion_lstmdt.json',
)

Using TensorFlow backend.


We can now see the _Template_ that we just created

In [11]:
dbex.get_templates()

Unnamed: 0,template_id,created_by,insert_time,name
0,620dec285c92715ecc1f88ab,dyu,2022-02-17 06:33:11.788,lstmdt


Also, during this step, apart from a _Template_ object, a _Pipeline_ object has also been
registred with the same name as the _Template_ and using the default hyperparameter values.

In [12]:
dbex.get_pipelines()

Unnamed: 0,pipeline_id,created_by,insert_time,name,template
0,620dec285c92715ecc1f88ac,dyu,2022-02-17 06:33:12.275,lstmdt,620dec285c92715ecc1f88ab


However, if we want to use a configuration different from the default, we might want to
create another _Pipeline_ with custom hyperparameter values.

In order to do this we will need to call the `add_pipeline` method passing:

* `name (str)`: Name given to this pipeline
* `template (Template or ObjectID)`: Template or the corresponding id.
* `hyperparameters (dict or str)`: dict containing the hyperparameter details or path to the
  corresponding JSON file. Optional.

For example, if we want to specify a different number of epochs for the LSTM primitive of the
pipeline that we just created we will run:

In [13]:
new_hyperparameters = {
   'keras.Sequential.LSTMTimeSeriesRegressor#1': {
       'epochs': 1,
       'verbose': True
   }
}
pipeline = dbex.add_pipeline(
   name='lstmdt_1_epoch',
   template=template,
   hyperparameters=new_hyperparameters,
)

And we can see how a new _Pipeline_ was created in the Database.

In [14]:
dbex.get_pipelines()

Unnamed: 0,pipeline_id,created_by,insert_time,name,template
0,620dec285c92715ecc1f88ac,dyu,2022-02-17 06:33:12.275,lstmdt,620dec285c92715ecc1f88ab
1,620dec285c92715ecc1f88ad,dyu,2022-02-17 06:33:12.857,lstmdt_1_epoch,620dec285c92715ecc1f88ab


### Add an Experiment

Once we have a _Dataset_ with _Signals_ and a _Template_, we are ready to add an
_Experiment_.

In order to run an _Experiment_ we will need to:

1. Get the _Dataset_ and the list of _Signals_ that we want to run the _Experiment_ on.
2. Get the _Template_ which we want to use for the _Experiment_
3. Call the `add_experiment` method passing all these with an experiment, a project name and a
   username.

For example, if we want to create an experiment using the _Dataset_, the _Signals_ and the
_Template_ that we just created, we will use:

In [15]:
experiment = dbex.add_experiment(
    name='Demo Experiment',
    project='Demo Project',
    template=template,
    dataset=dataset,
)

This will create an _Experiment_ object in the database using the indicated _Template_
and all the _Signals_ from the given _Dataset_.

In [16]:
dbex.get_experiments()

Unnamed: 0,experiment_id,created_by,dataset,insert_time,name,project,signals,template
0,620dec295c92715ecc1f88ae,dyu,620dec235c92715ecc1f88a6,2022-02-17 06:33:12.899,Demo Experiment,Demo Project,"[620dec235c92715ecc1f88a7, 620dec235c92715ecc1...",620dec285c92715ecc1f88ab


## Starting a Datarun

Once we have created our _Experiment_ object we are ready to start executing _Pipelines_ on our
_Signals_.

For this we will need to use the `orion.runner.start_datarun` function, which expects:

* `orex (OrionExplorer)`: The `OrionDBExplorer` instance.
* `experiment (Experiment or ObjectID)`: Experiment object or the corresponding ID.
* `pipeline (Pipeline or ObjectID)`: Pipeline object or the corresponding ID.

This will create a _Datarun_ object for this _Experiment_ and _Pipeline_ in the database,
and then it will start creating and executing _Signalruns_, one for each _Signal_ in the _Experiment_.

Let's trigger a _Datarun_ using the `lstmdt_1_epoch` _Pipeline_ that we created.

In [17]:
from sintel.runners.anomaly_detection import start_datarun

start_datarun(dbex, experiment, pipeline)

INFO:sintel.runners.anomaly_detection:Datarun 620dec295c92715ecc1f88af started
INFO:sintel.runners.anomaly_detection:Signalrun 620dec295c92715ecc1f88b0 started
INFO:sintel.runners.anomaly_detection:Running pipeline lstmdt_1_epoch on signal S-1
2022-02-17 01:33:17.749422: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-02-17 01:33:17.762961: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9d52186d30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-02-17 01:33:17.762973: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


Train on 7919 samples, validate on 1980 samples
Epoch 1/1


INFO:sintel.runners.anomaly_detection:Processing pipeline lstmdt_1_epoch predictions on signal S-1
INFO:sintel.runners.anomaly_detection:Signalrun 620dec595c92715ecc1f88c0 started
INFO:sintel.runners.anomaly_detection:Running pipeline lstmdt_1_epoch on signal P-1


Train on 8901 samples, validate on 2226 samples
Epoch 1/1


INFO:sintel.runners.anomaly_detection:Processing pipeline lstmdt_1_epoch predictions on signal P-1
INFO:sintel.runners.anomaly_detection:Signalrun 620decb75c92715ecc1f88cf started
INFO:sintel.runners.anomaly_detection:Running pipeline lstmdt_1_epoch on signal E-1


Train on 8916 samples, validate on 2230 samples
Epoch 1/1


INFO:sintel.runners.anomaly_detection:Processing pipeline lstmdt_1_epoch predictions on signal E-1


## Add anomaly detection specific information to the database 

The following collections will be added:
- **signal_raw**: For each signal, save raw csv data with a given interval 
- **prediction**: For each signalrun, save the prediction results
- **period**: For each signalrun, save the X after preprocessing in a periodical manner (year->month->day->hours)

In [18]:
from sintel.db.utils import update_db

update_db(dbex._fs)

INFO:sintel.db.utils:1/3: Processing signal S-1
INFO:sintel.db.utils:2/3: Processing signal P-1
INFO:sintel.db.utils:3/3: Processing signal E-1
INFO:sintel.db.utils:1/3: Processing signalrun 620dec295c92715ecc1f88b0
INFO:sintel.db.utils:Pipeline name lstmdt_1_epoch
INFO:sintel.db.utils:2/3: Processing signalrun 620dec595c92715ecc1f88c0
INFO:sintel.db.utils:Pipeline name lstmdt_1_epoch
INFO:sintel.db.utils:3/3: Processing signalrun 620decb75c92715ecc1f88cf
INFO:sintel.db.utils:Pipeline name lstmdt_1_epoch


## Use RESTful APIs to explore results

Please follow the steps below to start your exploration:
1. Open `./sintel/config.yml` and ensure `db` is of the value with `db_name` used in this tutorial 
2. Launch sintel backend with command:
```bash
sintel run -v
```
3. Open a new tab in your browser and access http://localhost:3000/apidocs/
4. Start your exploration

## Use MTV — the visual interface to explore results

1.  Download MTV
```bash
git clone https://github.com/sintel-dev/MTV mtv
```
2. Follow the instruction (https://github.com/sintel-dev/MTV) to launch MTV client end
3. Go http://localhost:4200 and start your exploration