# GreenGuard Quickstart

This notebook shows how to use GreenGuard to:

- Load demo data
- Find available pipelines and load one as a template
- Tune the template arguments to generate the optimal pipeline
- Fit the pipeline to our data
- Make predictions using the pipeline
- Evaluate the goodness-of-fit

## 0. Setup the logging

This step sets up logging in our environment to increase our visibility over
the steps that GreenGuard performs.

In [1]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Load the Data

The first step is to load the data that we are going to use.

In order to use the demo data included in GreenGuard, the `greenguard.demo.load_demo` function can be used.

In [2]:
from greenguard.demo import load_demo

target_times, readings = load_demo()

This will download some demo data from [GreenGuard S3 demo Bucket](
https://d3-ai-greenguard.s3.amazonaws.com/index.html) and load it as
the necessary `target_times` and `readings` tables.

The exact format of these tables is described in the GreenGuard README and docs:

In [3]:
target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,T001,2013-01-12,0
1,T001,2013-01-13,0
2,T001,2013-01-14,0
3,T001,2013-01-15,1
4,T001,2013-01-16,0


In [4]:
target_times.shape

(353, 3)

In [5]:
target_times.dtypes

turbine_id             object
cutoff_time    datetime64[ns]
target                  int64
dtype: object

In [6]:
readings.head()

Unnamed: 0,turbine_id,signal_id,timestamp,value
0,T001,S01,2013-01-10,323.0
1,T001,S02,2013-01-10,320.0
2,T001,S03,2013-01-10,284.0
3,T001,S04,2013-01-10,348.0
4,T001,S05,2013-01-10,273.0


In [7]:
readings.shape

(1313540, 4)

In [8]:
readings.dtypes

turbine_id            object
signal_id             object
timestamp     datetime64[ns]
value                float64
dtype: object

### Load your own Dataset

Alternatively, if you want to load your own dataset, all you have to do is load the
`target_times` and `readings` tables as `pandas.DataFrame` objects.

Make sure to parse the corresponding datetime fields!

```python
import pandas as pd

target_times = pd.read_csv('path/to/your/target_times.csv', parse_dates=['cutoff_time'])
readings = pd.read_csv('path/to/your/readings.csv', parse_dates=['timestamp'])
```

## 2. Split the data

Once we have loaded the `target_times` and before proceeding to training any Machine Learning
Pipeline, we will have split them in 2 partitions for training and testing.

In this case, we will split them using the [train_test_split function from scikit-learn](
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),
but it can be done with any other suitable tool.

In [9]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(target_times, test_size=0.25, random_state=0)

## 3. Finding a Template

The next step will be to select a template from the ones available in
GreenGuard.

For this, we can use the `greenguard.get_pipelines` function, which will
return us the list of all the available MLBlocks pipelines found in the
GreenGuard system.

In [10]:
from greenguard import get_pipelines

get_pipelines()

['resample_600s_normalize_dfs_1d_xgb_classifier',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier',
 'resample_600s_unstack_double_144_lstm_timeseries_classifier',
 'resample_3600s_unstack_24_lstm_timeseries_classifier',
 'resample_3600s_unstack_double_24_lstm_timeseries_classifier',
 'resample_600s_unstack_dfs_1d_xgb_classifier',
 'resample_600s_unstack_144_lstm_timeseries_classifier']

Optionally, we can pass a string to select the pipelines that contain it:

In [11]:
get_pipelines('dfs')

['resample_600s_normalize_dfs_1d_xgb_classifier',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier',
 'resample_600s_unstack_dfs_1d_xgb_classifier']

Additionally, we can pass the keyword `path=True` to obtain a dictionary containing
also the path to the pipelines instead of only the list of names.

In [12]:
get_pipelines('dfs', path=True)

{'resample_600s_normalize_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_normalize_dfs_1d_xgb_classifier.json',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_unstack_normalize_dfs_1d_xgb_classifier.json',
 'resample_600s_unstack_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_unstack_dfs_1d_xgb_classifier.json'}

For the rest of this tutorial, we will select and use the pipeline
`resample_600s_unstack_normalize_dfs_1d_xgb_classifier` as our template.

This templates contains the following steps:

- Resample the data using a 10 minute average aggregation
- Unstack the data by signal, so each signal is in a different column
- Normalize the Turbine IDs into a new table to assist DFS aggregations
- Use DFS on the readings based on the target_times cutoff times using a 1d window size
- Apply an XGBoost Classifier

In [13]:
template = 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier'

## 3. Finding the best Pipeline

Once we have loaded the data, we create a **GreenGuardPipeline** instance by passing:

* `template (string)`: the name of a template or the path to a template json file.
* `metric (string or function)`: The name of the metric to use or a metric function to use.
* `cost (bool)`: Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

* `stratify`
* `cv_splits`
* `shuffle`
* `random_state`

In [14]:
from greenguard.pipeline import GreenGuardPipeline

pipeline = GreenGuardPipeline(template, metric='f1', cv_splits=3)

Once we have created the pipeline, we can call its `tune` method to find the best possible
hyperparameters for our data, passing the `target_times` and `readings` variables,
as well as an indication of the number of tuning iterations that we want to perform.

In [15]:
pipeline.tune(target_times, readings, iterations=5)

INFO:greenguard.pipeline:Scoring the default pipeline
INFO:greenguard.pipeline:Running static steps before cross validation


Built 165 features
Elapsed: 00:47 | Progress: 100%|██████████
Elapsed: 00:24 | Progress: 100%|██████████
Built 165 features
Elapsed: 00:50 | Progress: 100%|██████████
Elapsed: 00:23 | Progress: 100%|██████████
Built 165 features
Elapsed: 00:46 | Progress: 100%|██████████
Elapsed: 00:23 | Progress: 100%|██████████


INFO:greenguard.pipeline:Default Pipeline score: 0.605187908496732
INFO:greenguard.pipeline:Scoring pipeline 1
INFO:btb:Using Uniform sampler as user specified r_minimum threshold is not met to start the GP based learning
INFO:greenguard.pipeline:Pipeline 1 score: 0.6188131761825791
INFO:greenguard.pipeline:Scoring pipeline 2
INFO:greenguard.pipeline:Pipeline 2 score: 0.6271095502877767
INFO:greenguard.pipeline:Scoring pipeline 3
INFO:greenguard.pipeline:Pipeline 3 score: 0.6305597783858653
INFO:greenguard.pipeline:Scoring pipeline 4
INFO:greenguard.pipeline:Pipeline 4 score: 0.6024864024864024
INFO:greenguard.pipeline:Scoring pipeline 5
INFO:greenguard.pipeline:Pipeline 5 score: 0.6141217155301661


After the tuning process has finished, the hyperparameters have been already set in the classifier.

We can see the found hyperparameters by calling the `get_hyperparameters` method,
which will return a dictionary with the best hyperparameters found so far:

In [16]:
pipeline.get_hyperparameters()

{'mlprimitives.custom.feature_extraction.CategoricalEncoder#1': {'max_labels': 82},
 'xgboost.XGBClassifier#1': {'n_estimators': 785,
  'max_depth': 7,
  'learning_rate': 0.12220259756122442,
  'gamma': 0.07359343182340616,
  'min_child_weight': 9}}

We can  also see the obtained cross validation score by looking at the `cv_score` attribute of the
`pipeline` object:

In [17]:
pipeline.cv_score

0.6305597783858653

**NOTE**: If the score is not good enough, we can call the `tune` method again as many times
as needed and the pipeline will continue its tuning process every time based on the previous
results!

In [18]:
pipeline.tune(target_times, readings, iterations=10)

INFO:greenguard.pipeline:Scoring pipeline 1
INFO:greenguard.pipeline:Pipeline 1 score: 0.6635006784260514
INFO:greenguard.pipeline:Scoring pipeline 2
INFO:greenguard.pipeline:Pipeline 2 score: 0.6845139382452815
INFO:greenguard.pipeline:Scoring pipeline 3
INFO:greenguard.pipeline:Pipeline 3 score: 0.6424425247954658
INFO:greenguard.pipeline:Scoring pipeline 4
INFO:greenguard.pipeline:Pipeline 4 score: 0.6146558553876801
INFO:greenguard.pipeline:Scoring pipeline 5
INFO:greenguard.pipeline:Pipeline 5 score: 0.6188226349516671
INFO:greenguard.pipeline:Scoring pipeline 6
INFO:greenguard.pipeline:Pipeline 6 score: 0.6213326748609891
INFO:greenguard.pipeline:Scoring pipeline 7
INFO:greenguard.pipeline:Pipeline 7 score: 0.6431577681577682
INFO:greenguard.pipeline:Scoring pipeline 8
INFO:greenguard.pipeline:Pipeline 8 score: 0.6119918008302174
INFO:greenguard.pipeline:Scoring pipeline 9
INFO:greenguard.pipeline:Pipeline 9 score: 0.670814479638009
INFO:greenguard.pipeline:Scoring pipeline 10
IN

In [19]:
pipeline.cv_score

0.6845139382452815

In [20]:
pipeline.get_hyperparameters()

{'mlprimitives.custom.feature_extraction.CategoricalEncoder#1': {'max_labels': 84},
 'xgboost.XGBClassifier#1': {'n_estimators': 788,
  'max_depth': 4,
  'learning_rate': 0.13866846579555614,
  'gamma': 0.652732260680545,
  'min_child_weight': 10}}

## 4. Fitting the pipeline

Once we are satisfied with the obtained cross validation score, we can proceed to call
the `fit` method passing again the same data elements.

This will fit the pipeline with all the training data available using the best hyperparameters
found during the tuning process:

In [21]:
pipeline.fit(train, readings)

Built 165 features
Elapsed: 00:52 | Progress: 100%|██████████


## 5. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

In [22]:
predictions = pipeline.predict(test, readings)

Elapsed: 00:17 | Progress: 100%|██████████


And evaluate its prediction performance:

In [23]:
from sklearn.metrics import f1_score

f1_score(test['target'], predictions)

0.76

## 6. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you
will probably want to save a fitted instance and load it later to analyze new signals
instead of fitting pipelines over and over again.

This can be done by using the `save` and `load` methods from the `GreenGuardPipeline`.

In order to save an instance, call its `save` method passing it the path and filename
where the model should be saved.

In [24]:
path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new `GreenGuardPipeline` by using the
`GreenGuardPipeline.load` method:

In [25]:
new_pipeline = GreenGuardPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

In [26]:
predictions = new_pipeline.predict(test, readings)
predictions[0:5]

Elapsed: 00:17 | Progress: 100%|██████████


array([0, 0, 0, 1, 0])