# GreenGuard usage example

This notebook shows how to use GreenGuard to fit a pipeline and later on use it to make predictions on new data and evaluate the pipeline performance.

In [None]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Load the Data

The first step is to load the data that we are going to use.

In order to use the demo data included in GreenGuard, the `greenguard.loader.load_demo`
can be used.

In [1]:
from greenguard.loader import load_demo

X, y, tables = load_demo()

Alternatively, if you want to load your own dataset, the `GreenGuardLoader` class can be used.

For example, in order to load the data from the folder where we just downloaded the demo data
we can use this commands:

In [2]:
from greenguard.loader import GreenGuardLoader

loader = GreenGuardLoader('../greenguard/demo', gzip=True)

X, y, tables = loader.load()

For further details about the GreenGuardLoder options please check the corresponding
[API Reference page in the docs](https://d3-ai.github.io/GreenGuard/api/greenguard.loader.html#greenguard.loader.GreenGuardLoader).

The output of either of the previous commands is:

* `X`: A pandas.DataFrame with the contents of the
  target table.
* `y`: A pandas.Series with the contents of
  the target column.
* `tables`: A dictionary containing the readings, turbines and
  signals tables as pandas.DataFrames.

In [3]:
X.head()

Unnamed: 0,target_id,turbine_id,timestamp
0,1,1,2013-01-01
1,2,1,2013-01-02
2,3,1,2013-01-03
3,4,1,2013-01-04
4,5,1,2013-01-05


In [4]:
y.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: target, dtype: float64

In [5]:
tables.keys()

dict_keys(['readings', 'signals', 'turbines'])

In [6]:
tables['readings'].head()

Unnamed: 0,reading_id,turbine_id,signal_id,timestamp,value
0,1,1,1,2013-01-01,817.0
1,2,1,2,2013-01-01,805.0
2,3,1,3,2013-01-01,786.0
3,4,1,4,2013-01-01,809.0
4,5,1,5,2013-01-01,755.0


In [7]:
tables['signals'].head()

Unnamed: 0,signal_id,name
0,1,WTG01_Grid Production PossiblePower Avg. (1)
1,2,WTG02_Grid Production PossiblePower Avg. (2)
2,3,WTG03_Grid Production PossiblePower Avg. (3)
3,4,WTG04_Grid Production PossiblePower Avg. (4)
4,5,WTG05_Grid Production PossiblePower Avg. (5)


In [8]:
tables['turbines'].head()

Unnamed: 0,turbine_id,name
0,1,Turbine 1


## 2. Split the data

If we want to split the data in train and test subsets, we can do so by splitting
the X and y variables with any suitable tool.

In this case, we will do it using the `train_test_split` function from scikit-learn.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## 3. Finding the best Pipeline

Once we have loaded the data, we create a **GreenGuardPipeline** instance by passing:

* `template (string)`: the name of a template or the path to a template json file.
* `metric (string or function)`: The name of the metric to use or a metric function to use.
* `cost (bool)`: Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

* `stratify`
* `cv_splits`
* `shuffle`
* `random_state`

In this case, we will be loading the `greenguard_classification` pipeline, using
the `accuracy` metric, and using only 2 cross validation splits:

In [10]:
from greenguard.pipeline import GreenGuardPipeline

pipeline = GreenGuardPipeline(template='greenguard_classification', metric='accuracy', cv_splits=2)

Once we have created the pipeline, we can call its `tune` method to find the best possible
hyperparameters for our data, passing the `X`, `y`, and `tables` variables returned by the loader,
as well as an indication of the number of tuning iterations that we want to perform.

In [11]:
pipeline.tune(X_train, y_train, tables, iterations=1)

2019-04-18 18:57:09,190 - INFO - pipeline - Scoring the default pipeline
2019-04-18 18:59:40,757 - INFO - pipeline - Default Pipeline score: 0.6447509660798626
2019-04-18 18:59:40,758 - INFO - pipeline - Scoring pipeline 1
2019-04-18 18:59:40,759 - INFO - gp - Using Uniform sampler as user specified r_minimum threshold is not met to start the GP based learning
2019-04-18 19:02:26,829 - INFO - pipeline - Pipeline 1 score: 0.659349506225848


After the tuning process has finished, the hyperparameters have been already set in the classifier.

We can see the found hyperparameters by calling the `get_hyperparameters` method,
which will return a dictionary with the best hyperparameters found so far:

In [12]:
pipeline.get_hyperparameters()

{'pandas.DataFrame.resample#1': {'rule': '1D',
  'time_index': 'timestamp',
  'groupby': ['turbine_id', 'signal_id'],
  'aggregation': 'mean'},
 'pandas.DataFrame.unstack#1': {'level': 'signal_id', 'reset_index': True},
 'featuretools.EntitySet.entity_from_dataframe#1': {'entityset_id': 'entityset',
  'entity_id': 'readings',
  'index': 'index',
  'variable_types': None,
  'make_index': True,
  'time_index': 'timestamp',
  'secondary_time_index': None,
  'already_sorted': False},
 'featuretools.EntitySet.entity_from_dataframe#2': {'entityset_id': 'entityset',
  'entity_id': 'turbines',
  'index': 'turbine_id',
  'variable_types': None,
  'make_index': False,
  'time_index': None,
  'secondary_time_index': None,
  'already_sorted': False},
 'featuretools.EntitySet.entity_from_dataframe#3': {'entityset_id': 'entityset',
  'entity_id': 'signals',
  'index': 'signal_id',
  'variable_types': None,
  'make_index': False,
  'time_index': None,
  'secondary_time_index': None,
  'already_sorted

We can  also see the obtained cross validation score by looking at the `score` attribute of the
`pipeline` object:

In [13]:
pipeline.score

0.659349506225848

**NOTE**: If the score is not good enough, we can call the `tune` method again as many times
as needed and the pipeline will continue its tuning process every time based on the previous
results!

## 4. Fitting the pipeline

Once we are satisfied with the obtained cross validation score, we can proceed to call
the `fit` method passing again the same data elements.

This will fit the pipeline with all the training data available using the best hyperparameters
found during the tuning process:

In [14]:
pipeline.fit(X_train, y_train, tables)

## 5. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

In [15]:
predictions = pipeline.predict(X_test, tables)

And evaluate its prediction performance:

In [18]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

0.6413043478260869

## 6. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you
will probably want to save a fitted instance and load it later to analyze new signals
instead of fitting pipelines over and over again.

This can be done by using the `save` and `load` methods from the `GreenGuardPipeline`.

In order to save an instance, call its `save` method passing it the path and filename
where the model should be saved.

In [19]:
path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new `GreenGuardPipeline` by using the
`GreenGuardPipeline.load` method:

In [20]:
new_pipeline = GreenGuardPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

In [21]:
predictions = new_pipeline.predict(X_test, tables)
predictions[0:5]

array([1., 0., 0., 0., 0.])