# GreenGuard Quickstart

This notebook shows how to use GreenGuard to:

- Load demo data
- Find available pipelines and load two of them as templates
- Tune the templates to find the best template for the given data and its hyperparameters
- Fit the found pipeline to our data
- Make predictions using the pipeline
- Evaluate the goodness-of-fit

## 0. Setup the logging

This step sets up logging in our environment to increase our visibility over
the steps that GreenGuard performs.

In [1]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Load the Data

The first step is to load the data that we are going to use.

In order to use the demo data included in GreenGuard, the `greenguard.demo.load_demo` function can be used.

In [2]:
from greenguard.demo import load_demo

target_times, readings = load_demo()

This will download some demo data from [GreenGuard S3 demo Bucket](
https://d3-ai-greenguard.s3.amazonaws.com/index.html) and load it as
the necessary `target_times` and `readings` tables.

The exact format of these tables is described in the GreenGuard README and docs:

In [3]:
target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,T001,2013-01-12,0
1,T001,2013-01-13,0
2,T001,2013-01-14,0
3,T001,2013-01-15,1
4,T001,2013-01-16,0


In [4]:
target_times.shape

(353, 3)

In [5]:
target_times.dtypes

turbine_id             object
cutoff_time    datetime64[ns]
target                  int64
dtype: object

In [6]:
readings.head()

Unnamed: 0,turbine_id,signal_id,timestamp,value
0,T001,S01,2013-01-10,323.0
1,T001,S02,2013-01-10,320.0
2,T001,S03,2013-01-10,284.0
3,T001,S04,2013-01-10,348.0
4,T001,S05,2013-01-10,273.0


In [7]:
readings.shape

(1313540, 4)

In [8]:
readings.dtypes

turbine_id            object
signal_id             object
timestamp     datetime64[ns]
value                float64
dtype: object

### Load your own Dataset

Alternatively, if you want to load your own dataset, all you have to do is load the
`target_times` and `readings` tables as `pandas.DataFrame` objects.

Make sure to parse the corresponding datetime fields!

```python
import pandas as pd

target_times = pd.read_csv('path/to/your/target_times.csv', parse_dates=['cutoff_time'])
readings = pd.read_csv('path/to/your/readings.csv', parse_dates=['timestamp'])
```

## 2. Split the data

Once we have loaded the `target_times` and before proceeding to training any Machine Learning
Pipeline, we will have split them in 2 partitions for training and testing.

In this case, we will split them using the [train_test_split function from scikit-learn](
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),
but it can be done with any other suitable tool.

In [9]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(target_times, test_size=0.25, random_state=0)

## 3. Finding the Templates

The next step will be to select a collection of templates from the ones
available in GreenGuard.

For this, we can use the `greenguard.get_pipelines` function, which will
return us the list of all the available MLBlocks pipelines found in the
GreenGuard system.

In [10]:
from greenguard import get_pipelines

get_pipelines()

['resample_600s_unstack_144_lstm_timeseries_classifier',
 'resample_3600s_unstack_24_lstm_timeseries_classifier',
 'resample_600s_unstack_dfs_1d_xgb_classifier',
 'resample_600s_normalize_dfs_1d_xgb_classifier',
 'resample_3600s_unstack_double_24_lstm_timeseries_classifier',
 'resample_600s_unstack_double_144_lstm_timeseries_classifier',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier']

Optionally, we can pass a string to select the pipelines that contain it:

In [11]:
get_pipelines('dfs')

['resample_600s_unstack_dfs_1d_xgb_classifier',
 'resample_600s_normalize_dfs_1d_xgb_classifier',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier']

Additionally, we can pass the keyword `path=True` to obtain a dictionary containing
also the path to the pipelines instead of only the list of names.

In [12]:
get_pipelines('dfs', path=True)

{'resample_600s_unstack_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_unstack_dfs_1d_xgb_classifier.json',
 'resample_600s_normalize_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_normalize_dfs_1d_xgb_classifier.json',
 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier': '/app/greenguard/pipelines/resample_600s_unstack_normalize_dfs_1d_xgb_classifier.json'}

For the rest of this tutorial, we will select and use the templates
`resample_600s_unstack_normalize_dfs_1d_xgb_classifier` and
`resample_600s_normalize_dfs_1d_xgb_classifier`.

The `resample_600s_unstack_normalize_dfs_1d_xgb_classifier` template contains the following steps:

- Resample the data using a 10 minute average aggregation
- Unstack the data by signal, so each signal is in a different column
- Normalize the Turbine IDs into a new table to assist DFS aggregations
- Use DFS on the readings based on the target_times cutoff times using a 1d window size
- Apply an XGBoost Classifier

And the `resample_600s_normalize_dfs_1d_xgb_classifier` template contains the above steps but without
unstacking the data by signal.

In [13]:
templates = [
    'resample_600s_unstack_normalize_dfs_1d_xgb_classifier', 
    'resample_600s_normalize_dfs_1d_xgb_classifier'
]

## 4. Finding the best Pipeline

Once we have loaded the data, we create a **GreenGuardPipeline** instance by passing:

* `templates (string or list)`: the name of a template, the path to a template json file or
a list that can combine both of them.
* `metric (string or function)`: The name of the metric to use or a metric function to use.
* `cost (bool)`: Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

* `stratify`
* `cv_splits`
* `shuffle`
* `random_state`

In [14]:
from greenguard.pipeline import GreenGuardPipeline

pipeline = GreenGuardPipeline(templates, metric='f1', cv_splits=3)

Once we have created the pipeline, we can find which template and which combination of hyperparameters works best for our data by calling the `tune` method of our pipeline, passing its `target_times` and `readings` variables.
This method will return a `BTBSession` session that will:
- Select and tune templates.
- If a template or hyperparameters that get a higher score than the previous one is found, automatically update our pipeline so that it uses that template with those hyperparameters.
- Remove templates that don't work with the given data and focus on tuning only the ones that do.

In [15]:
session = pipeline.tune(target_times, readings)

Once we have our `session` we can call it's method `run` with the amount of
tuning iterations that we want to perform:

In [16]:
session.run(5)

INFO:btb.session:Obtaining default configuration for resample_600s_unstack_normalize_dfs_1d_xgb_classifier


Built 165 features
Elapsed: 00:41 | Progress: 100%|██████████
Elapsed: 00:18 | Progress: 100%|██████████
Built 165 features
Elapsed: 00:37 | Progress: 100%|██████████
Elapsed: 00:18 | Progress: 100%|██████████
Built 165 features
Elapsed: 00:37 | Progress: 100%|██████████
Elapsed: 00:18 | Progress: 100%|██████████


INFO:greenguard.pipeline:New configuration found:
  Template: resample_600s_unstack_normalize_dfs_1d_xgb_classifier 
    Hyperparameters: 
      ('mlprimitives.custom.feature_extraction.CategoricalEncoder#1', 'max_labels'): 0
      ('xgboost.XGBClassifier#1', 'n_estimators'): 100
      ('xgboost.XGBClassifier#1', 'max_depth'): 3
      ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1
      ('xgboost.XGBClassifier#1', 'gamma'): 0.0
      ('xgboost.XGBClassifier#1', 'min_child_weight'): 1
INFO:btb.session:New optimal found: resample_600s_unstack_normalize_dfs_1d_xgb_classifier - 0.6079987550575785
INFO:btb.session:Obtaining default configuration for resample_600s_normalize_dfs_1d_xgb_classifier


Built 99 features
Elapsed: 02:06 | Progress: 100%|██████████
Elapsed: 01:02 | Progress: 100%|██████████
Built 99 features
Elapsed: 01:53 | Progress: 100%|██████████
Elapsed: 00:54 | Progress: 100%|██████████
Built 99 features
Elapsed: 01:55 | Progress: 100%|██████████
Elapsed: 01:10 | Progress: 100%|██████████


INFO:btb.session:Generating new proposal configuration for resample_600s_unstack_normalize_dfs_1d_xgb_classifier
INFO:greenguard.pipeline:New configuration found:
  Template: resample_600s_unstack_normalize_dfs_1d_xgb_classifier 
    Hyperparameters: 
      ('mlprimitives.custom.feature_extraction.CategoricalEncoder#1', 'max_labels'): 9
      ('xgboost.XGBClassifier#1', 'n_estimators'): 28
      ('xgboost.XGBClassifier#1', 'max_depth'): 4
      ('xgboost.XGBClassifier#1', 'learning_rate'): 0.3977560491030686
      ('xgboost.XGBClassifier#1', 'gamma'): 0.19143248884807773
      ('xgboost.XGBClassifier#1', 'min_child_weight'): 8
INFO:btb.session:New optimal found: resample_600s_unstack_normalize_dfs_1d_xgb_classifier - 0.6418782052584869
INFO:btb.session:Generating new proposal configuration for resample_600s_normalize_dfs_1d_xgb_classifier
INFO:btb.session:Generating new proposal configuration for resample_600s_unstack_normalize_dfs_1d_xgb_classifier
INFO:greenguard.pipeline:New configu

{'id': '2a494af25e2d986c9178fd47820d4b00',
 'name': 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 14,
  ('xgboost.XGBClassifier#1', 'n_estimators'): 18,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.39294364912150626,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.3393295330438333,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 9},
 'score': 0.6671775409915827}

When this is done, the `best_proposal` will be printed out. We can access it anytime
using `session.best_proposal`:

In [17]:
session.best_proposal

{'id': '2a494af25e2d986c9178fd47820d4b00',
 'name': 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 14,
  ('xgboost.XGBClassifier#1', 'n_estimators'): 18,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.39294364912150626,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.3393295330438333,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 9},
 'score': 0.6671775409915827}

You can check that the new hyperparameters are already set by calling `get_hyperparameters` method: 

In [18]:
pipeline.get_hyperparameters()

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): 14,
 ('xgboost.XGBClassifier#1', 'n_estimators'): 18,
 ('xgboost.XGBClassifier#1', 'max_depth'): 5,
 ('xgboost.XGBClassifier#1', 'learning_rate'): 0.39294364912150626,
 ('xgboost.XGBClassifier#1', 'gamma'): 0.3393295330438333,
 ('xgboost.XGBClassifier#1', 'min_child_weight'): 9}

We can also check the template name that is used to generate the pipeline:

In [19]:
pipeline.template_name

'resample_600s_unstack_normalize_dfs_1d_xgb_classifier'

We can  also see the obtained cross validation score by looking at the `cv_score` attribute of the
`pipeline` object:

In [20]:
pipeline.cv_score

0.6671775409915827

**NOTE**: If the score is not good enough, we can call the `run` method of the `session` again,
specifying the amount of iterations, and this will resume its tuning process continuing from
the previous results!

In [21]:
session.run(iterations=10)

INFO:btb.session:Generating new proposal configuration for resample_600s_normalize_dfs_1d_xgb_classifier
INFO:btb.session:Generating new proposal configuration for resample_600s_unstack_normalize_dfs_1d_xgb_classifier
INFO:greenguard.pipeline:New configuration found:
  Template: resample_600s_unstack_normalize_dfs_1d_xgb_classifier 
    Hyperparameters: 
      ('mlprimitives.custom.feature_extraction.CategoricalEncoder#1', 'max_labels'): 99
      ('xgboost.XGBClassifier#1', 'n_estimators'): 143
      ('xgboost.XGBClassifier#1', 'max_depth'): 9
      ('xgboost.XGBClassifier#1', 'learning_rate'): 0.06337107325877978
      ('xgboost.XGBClassifier#1', 'gamma'): 0.932864412690726
      ('xgboost.XGBClassifier#1', 'min_child_weight'): 10
INFO:btb.session:New optimal found: resample_600s_unstack_normalize_dfs_1d_xgb_classifier - 0.6854149434794596
INFO:btb.session:Generating new proposal configuration for resample_600s_normalize_dfs_1d_xgb_classifier
INFO:btb.session:Generating new proposal c

{'id': '9999fcb9fdc53cf7bf8f1398cea07fab',
 'name': 'resample_600s_unstack_normalize_dfs_1d_xgb_classifier',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 99,
  ('xgboost.XGBClassifier#1', 'n_estimators'): 143,
  ('xgboost.XGBClassifier#1', 'max_depth'): 9,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.06337107325877978,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.932864412690726,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 10},
 'score': 0.6854149434794596}

In [22]:
pipeline.cv_score

0.6854149434794596

In [23]:
pipeline.get_hyperparameters()

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): 99,
 ('xgboost.XGBClassifier#1', 'n_estimators'): 143,
 ('xgboost.XGBClassifier#1', 'max_depth'): 9,
 ('xgboost.XGBClassifier#1', 'learning_rate'): 0.06337107325877978,
 ('xgboost.XGBClassifier#1', 'gamma'): 0.932864412690726,
 ('xgboost.XGBClassifier#1', 'min_child_weight'): 10}

## 5. Fitting the pipeline

Once we are satisfied with the obtained cross validation score, we can proceed to call
the `fit` method passing again the same data elements.

This will fit the pipeline with all the training data available using the best hyperparameters
found during the tuning process:

In [24]:
pipeline.fit(train, readings)

Built 165 features
Elapsed: 00:48 | Progress: 100%|██████████


## 6. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

In [25]:
predictions = pipeline.predict(test, readings)

Elapsed: 00:17 | Progress: 100%|██████████


And evaluate its prediction performance:

In [26]:
from sklearn.metrics import f1_score

f1_score(test['target'], predictions)

0.7346938775510203

## 7. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you
will probably want to save a fitted instance and load it later to analyze new signals
instead of fitting pipelines over and over again.

This can be done by using the `save` and `load` methods from the `GreenGuardPipeline`.

In order to save an instance, call its `save` method passing it the path and filename
where the model should be saved.

In [27]:
path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new `GreenGuardPipeline` by using the
`GreenGuardPipeline.load` method:

In [28]:
new_pipeline = GreenGuardPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

In [29]:
predictions = new_pipeline.predict(test, readings)
predictions[0:5]

Elapsed: 00:19 | Progress: 100%|██████████


array([0, 0, 0, 1, 0])