# CSV Processing Example

This is a demo notebook showing how to use the `lstm_dynamic_threshold.json` pipeline to analyze a collection of signal CSV files and later on retrieve the list of Events found.

## 1. Create an OrionExlorer Instance

In this first step, we setup the environment, import the `OrionExplorer` and create
an instance passing the name of the database which we want to connect to.

In [1]:
import logging;

logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(level=logging.ERROR)

import warnings
warnings.simplefilter("ignore")

In [2]:
from orion.explorer import OrionExplorer

In [3]:
explorer = OrionExplorer(database='orion-process-csvs')

In this case we will drop the database before starting to make sure that we are working
on a clean environment.

**WARNING**: This will remove all the data that exists in this database!

In [4]:
explorer.drop_database()

## 2. Add the pipeline that we will be using

The second step is to register the pipeline that we are going to use.

For this, we will enter:

* a pipeline name.
* the path to the `lstm_dynamic_threshold` json.

In [5]:
pipeline = explorer.add_pipeline(
    'lstm_dynamic_threshold',
    '../orion/pipelines/lstm_dynamic_threshold.json'
)

Afterwards, we can obtain the list of pipelines to see if it has been properly registered

In [6]:
explorer.get_pipelines()

Unnamed: 0,pipeline_id,insert_time,mlpipeline,name
0,5ca78d476c1cea5a4c0e5eb9,2019-04-05 17:15:51.913,{'primitives': ['mlprimitives.custom.timeserie...,lstm_dynamic_threshold


## 3. Get the list of CSV files

In this example we will use the `os` module to find the list of CSV files that exist inside the directory
`data` that we have created inside this `notebooks` folder.

Another way to do it would be to provide an explicit list of filenames

In [7]:
import os

CSVS_FOLDER = './data'

csvs = os.listdir(CSVS_FOLDER)
csvs

['S-1.csv', 'S-2.csv', 'P-1.csv', 'E-1.csv']

## 3. Register the new datasets

We will execute a loop in which, for each CSV file, we will register a new Dataset in the Database.

For each CSV, the name that we will use for dataset and the signal will be name of the file without the `.csv` extension, and will be leaving the satellite_id blank.

In this case we need no additional arguments, such as timestamp_column or value_column, but if they were required
we would add them to the `add_dataset` call.

We will also capture the output of the `add_dataset` call in a list, so we can use these datasets later on.

In [8]:
datasets = list()
cwd = os.getcwd()

for path in csvs:
    name = os.path.basename(path)[:-4]
    location = os.path.join(CSVS_FOLDER, path)
    print('Adding dataset {} for CSV {}'.format(name, location))
    dataset = explorer.add_dataset(
        name,
        name,
        location=location,
        timestamp_column=None,    # Replace if needed
        value_column=None,        # Replace if needed
    )
    datasets.append(dataset)

Adding dataset S-1 for CSV ./data/S-1.csv
Adding dataset S-2 for CSV ./data/S-2.csv
Adding dataset P-1 for CSV ./data/P-1.csv
Adding dataset E-1 for CSV ./data/E-1.csv


Afterwards we can check that the datasets were properly registered

In [9]:
explorer.get_datasets()

Unnamed: 0,dataset_id,data_location,insert_time,name,signal_set,start_time,stop_time
0,5ca78d4a6c1cea5a4c0e5eba,./data/S-1.csv,2019-04-05 17:15:54.728,S-1,S-1,1222819200,1442016000
1,5ca78d4a6c1cea5a4c0e5ebb,./data/S-2.csv,2019-04-05 17:15:54.767,S-2,S-2,1222819200,1282262400
2,5ca78d4a6c1cea5a4c0e5ebc,./data/P-1.csv,2019-04-05 17:15:54.775,P-1,P-1,1222819200,1468540800
3,5ca78d4a6c1cea5a4c0e5ebd,./data/E-1.csv,2019-04-05 17:15:54.783,E-1,E-1,1222819200,1468951200


## 4. Run the pipeline on the datasets

Once the pipeline and the datasets are registered, we can start the processing loop.

In [10]:
for dataset in datasets:
    print('Analyzing dataset {}'.format(dataset.name))
    explorer.analyze(dataset.name, pipeline.name)

Analyzing dataset S-1


Using TensorFlow backend.


Analyzing dataset S-2
Analyzing dataset P-1
Analyzing dataset E-1


## 5. Analyze the results

Once the execution has finished, we can explore the Dataruns and the detected Events.

In [11]:
explorer.get_dataruns()

Unnamed: 0,datarun_id,dataset,end_time,events,insert_time,pipeline,start_time,status
0,5ca78d4e6c1cea5a4c0e5ebe,5ca78d4a6c1cea5a4c0e5eba,2019-04-05 17:17:42.250,3,2019-04-05 17:15:58.517,5ca78d476c1cea5a4c0e5eb9,2019-04-05 17:15:58.516,done
1,5ca78db66c1cea5a4c0e5ec2,5ca78d4a6c1cea5a4c0e5ebb,2019-04-05 17:18:11.119,1,2019-04-05 17:17:42.348,5ca78d476c1cea5a4c0e5eb9,2019-04-05 17:17:42.347,done
2,5ca78dd36c1cea5a4c0e5ec4,5ca78d4a6c1cea5a4c0e5ebc,2019-04-05 17:20:10.398,1,2019-04-05 17:18:11.218,5ca78d476c1cea5a4c0e5eb9,2019-04-05 17:18:11.217,done
3,5ca78e4a6c1cea5a4c0e5ec6,5ca78d4a6c1cea5a4c0e5ebd,2019-04-05 17:22:09.128,1,2019-04-05 17:20:10.504,5ca78d476c1cea5a4c0e5eb9,2019-04-05 17:20:10.503,done


In [12]:
explorer.get_events()

Unnamed: 0,event_id,datarun,insert_time,score,start_time,stop_time,comments
0,5ca78db66c1cea5a4c0e5ebf,5ca78d4e6c1cea5a4c0e5ebe,2019-04-05 17:17:42.198,0.212172,1222840800,1222840800,0
1,5ca78db66c1cea5a4c0e5ec0,5ca78d4e6c1cea5a4c0e5ebe,2019-04-05 17:17:42.248,0.188293,1222884000,1222884000,0
2,5ca78db66c1cea5a4c0e5ec1,5ca78d4e6c1cea5a4c0e5ebe,2019-04-05 17:17:42.249,0.223725,1398060000,1399464000,0
3,5ca78dd36c1cea5a4c0e5ec3,5ca78db66c1cea5a4c0e5ec2,2019-04-05 17:18:11.118,4.906747,1256990400,1257120000,0
4,5ca78e4a6c1cea5a4c0e5ec5,5ca78dd36c1cea5a4c0e5ec4,2019-04-05 17:20:10.397,0.014155,1351728000,1351749600,0
5,5ca78ec16c1cea5a4c0e5ec7,5ca78e4a6c1cea5a4c0e5ec6,2019-04-05 17:22:09.127,0.032443,1406095200,1406138400,0
