# Draco Machine Learning

In this tutorial we will show you how to use Draco to solve a Machine Learning problem
defined via a Target Times table.

During the next steps we will:

- Load demo target times and readings
- Find available pipelines and load two of them as templates
- Use Draco AutoML to select the best template and hyperparameters for our problem
- Build and fit a Machine Learning pipeline based on the found template and hyperparameters
- Make predictions using the fitted pipeline
- Evaluate how good the predictions are

## 0. Setup the logging

This step sets up logging in our environment to increase our visibility over
the steps that Draco performs.

In [1]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Load the Data

The first step is to load the data that we are going to use.

In order to use the demo data included in Draco, the `draco.demo.load_demo` function can be used.

In [2]:
from draco.demo import load_demo

target_times, readings = load_demo()

This will download some demo data from [Draco S3 demo Bucket](
https://d3-ai-draco.s3.amazonaws.com/index.html) and load it as
the necessary `target_times` and `readings` tables.

The exact format of these tables is described in the Draco README and docs:

In [3]:
target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,T001,2013-01-12,0
1,T001,2013-01-13,0
2,T001,2013-01-14,0
3,T001,2013-01-15,1
4,T001,2013-01-16,0


In [4]:
target_times.shape

(353, 3)

In [5]:
target_times.dtypes

turbine_id             object
cutoff_time    datetime64[ns]
target                  int64
dtype: object

In [6]:
readings.head()

Unnamed: 0,turbine_id,signal_id,timestamp,value
0,T001,S01,2013-01-10,323.0
1,T001,S02,2013-01-10,320.0
2,T001,S03,2013-01-10,284.0
3,T001,S04,2013-01-10,348.0
4,T001,S05,2013-01-10,273.0


In [7]:
readings.shape

(1313540, 4)

In [8]:
readings.dtypes

turbine_id            object
signal_id             object
timestamp     datetime64[ns]
value                float64
dtype: object

### Load your own Dataset

Alternatively, if you want to load your own dataset, all you have to do is load the
`target_times` and `readings` tables as `pandas.DataFrame` objects.

Make sure to parse the corresponding datetime fields!

```python
import pandas as pd

target_times = pd.read_csv('path/to/your/target_times.csv', parse_dates=['cutoff_time'])
readings = pd.read_csv('path/to/your/readings.csv', parse_dates=['timestamp'])
```

## 2. Split the data

Once we have loaded the `target_times` and before proceeding to training any Machine Learning
Pipeline, we will have split them in 2 partitions for training and testing.

In this case, we will split them using the [train_test_split function from scikit-learn](
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),
but it can be done with any other suitable tool.

In [9]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(target_times, test_size=0.25, random_state=0)

## 3. Finding the available Templates

The next step will be to select a collection of templates from the ones
available in Draco.

For this, we can use the `draco.get_pipelines` function, which will
return us the list of all the available MLBlocks pipelines found in the
Draco system.

In [10]:
from draco import get_pipelines

get_pipelines()

['dummy',
 'dfs_xgb_prob_with_unstack',
 'dfs_xgb_with_normalization',
 'dfs_xgb',
 'dfs_xgb_with_unstack',
 'dfs_xgb_prob_with_unstack_normalization',
 'dfs_xgb_with_unstack_normalization',
 'dfs_xgb_prob_with_double_normalization',
 'dfs_xgb_with_double_normalization',
 'lstm_regressor_with_unstack',
 'lstm_regressor',
 'double_lstm_prob_with_unstack',
 'double_lstm_prob',
 'double_lstm',
 'double_lstm_with_unstack',
 'lstm_prob_with_unstack',
 'lstm_with_unstack',
 'lstm_prob',
 'lstm']

Optionally, we can pass a string to select the pipelines that contain it:

In [11]:
get_pipelines('lstm')

['lstm_regressor_with_unstack',
 'lstm_regressor',
 'double_lstm_prob_with_unstack',
 'double_lstm_prob',
 'double_lstm',
 'double_lstm_with_unstack',
 'lstm_prob_with_unstack',
 'lstm_with_unstack',
 'lstm_prob',
 'lstm']

Additionally, we can pass the keyword `path=True` to obtain a dictionary containing
also the path to the pipelines instead of only the list of names.

In [12]:
get_pipelines('lstm', path=True)

{'lstm_regressor_with_unstack': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/lstm_regressor/lstm_regressor_with_unstack.json',
 'lstm_regressor': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/lstm_regressor/lstm_regressor.json',
 'double_lstm_prob_with_unstack': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/double_lstm/double_lstm_prob_with_unstack.json',
 'double_lstm_prob': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/double_lstm/double_lstm_prob.json',
 'double_lstm': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/double_lstm/double_lstm.json',
 'double_lstm_with_unstack': '/Users/sarah/opt/anaconda3/envs/draco-mlstars/lib/python3.7/site-packages/draco/pipelines/double_lstm/double_lstm_with_unstack.json',
 'lstm_prob_with_unstack': '/Users/sarah/opt/anaconda3/e

For the rest of this tutorial, we will select and use the templates
`lstm_with_unstack` and
`double_lstm_with_unstack`.

The `lstm_with_unstack` template contains the following steps:

- Resample the data using a 10 minute average aggregation
- Unstack the data by signal, so each signal is in a different column
- Scale the data between `[-1, 1]` using MinMaxScaler
- Create training windows from readings based on the target_times cutoff times
- Apply an LSTM Classifier

And the `double_lstm_with_unstack` template contains the above steps but with two lstm layers instead of one.

In [13]:
templates = [
    'lstm_with_unstack', 
    'double_lstm_with_unstack'
]

## 4. Finding the best Pipeline

Once we have loaded the data, we create a **DracoPipeline** instance by passing:

* `templates (string or list)`: the name of a template, the path to a template json file or
a list that can combine both of them.
* `metric (string or function)`: The name of the metric to use or a metric function to use.
* `cost (bool)`: Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

* `stratify`
* `cv_splits`
* `shuffle`
* `random_state`

In [14]:
from draco.pipeline import DracoPipeline

pipeline = DracoPipeline(templates, metric='f1', cv_splits=3)

Once we have created the pipeline, we can find which template and which combination of hyperparameters works best for our data by calling the `tune` method of our pipeline, passing its `target_times` and `readings` variables.
This method will return a `BTBSession` session that will:
- Select and tune templates.
- If a template or hyperparameters that get a higher score than the previous one is found, automatically update our pipeline so that it uses that template with those hyperparameters.
- Remove templates that don't work with the given data and focus on tuning only the ones that do.

In [15]:
session = pipeline.tune(target_times, readings)

Once we have our `session` we can call it's method `run` with the amount of
tuning iterations that we want to perform:

In [16]:
session.run(5)

INFO:btb.session:Obtaining default configuration for lstm_with_unstack
2023-02-27 13:24:14.528911: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-02-27 13:24:14.610510: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe9ac63ca30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-02-27 13:24:14.610531: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:draco.pipeline:New configuration found:
  Template: lstm_with_unstack 
    Hyperparameters: 
      ('sklearn.impute.SimpleImputer#1', 'strategy'): mean
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 80
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dropout_1_rate'): 0.3
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 80
INFO:btb.session:New optimal found: lstm_with_unst

{'id': 'b67b5786315b71a0031050a4a65e588b',
 'name': 'lstm_with_unstack',
 'config': {('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 343,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1',
   'dropout_1_rate'): 0.01904463205949069,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 218},
 'score': 0.6144588856166879}

When this is done, the `best_proposal` will be printed out. We can access it anytime
using `session.best_proposal`:

In [17]:
session.best_proposal

{'id': 'b67b5786315b71a0031050a4a65e588b',
 'name': 'lstm_with_unstack',
 'config': {('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 343,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1',
   'dropout_1_rate'): 0.01904463205949069,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 218},
 'score': 0.6144588856166879}

You can check that the new hyperparameters are already set by calling `get_hyperparameters` method: 

In [18]:
pipeline.get_hyperparameters()

{('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
 ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 343,
 ('keras.Sequential.LSTMTimeSeriesClassifier#1',
  'dropout_1_rate'): 0.01904463205949069,
 ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 218}

We can also check the template name that is used to generate the pipeline:

In [19]:
pipeline.template_name

'lstm_with_unstack'

We can  also see the obtained cross validation score by looking at the `cv_score` attribute of the
`pipeline` object:

In [20]:
pipeline.cv_score

0.6144588856166879

**NOTE**: If the score is not good enough, we can call the `run` method of the `session` again,
specifying the amount of iterations, and this will resume its tuning process continuing from
the previous results!

In [21]:
session.run(iterations=10)

INFO:btb.session:Generating new proposal configuration for double_lstm_with_unstack
INFO:btb.session:Generating new proposal configuration for lstm_with_unstack
INFO:draco.pipeline:New configuration found:
  Template: lstm_with_unstack 
    Hyperparameters: 
      ('sklearn.impute.SimpleImputer#1', 'strategy'): median
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 369
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dropout_1_rate'): 0.40028004355436897
      ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 317
INFO:btb.session:New optimal found: lstm_with_unstack - 0.6236073368594607
INFO:btb.session:Generating new proposal configuration for double_lstm_with_unstack
INFO:btb.session:Generating new proposal configuration for lstm_with_unstack
INFO:draco.pipeline:New configuration found:
  Template: lstm_with_unstack 
    Hyperparameters: 
      ('sklearn.impute.SimpleImputer#1', 'strategy'): constant
      ('keras.Sequential.LSTMTimeSeries

{'id': 'be1a837b1ae453865f0cfb43dc358d40',
 'name': 'lstm_with_unstack',
 'config': {('sklearn.impute.SimpleImputer#1', 'strategy'): 'constant',
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 102,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1',
   'dropout_1_rate'): 0.29143210730762764,
  ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 425},
 'score': 0.6567251461988304}

In [22]:
pipeline.cv_score

0.6567251461988304

In [23]:
pipeline.get_hyperparameters()

{('sklearn.impute.SimpleImputer#1', 'strategy'): 'constant',
 ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'lstm_1_units'): 102,
 ('keras.Sequential.LSTMTimeSeriesClassifier#1',
  'dropout_1_rate'): 0.29143210730762764,
 ('keras.Sequential.LSTMTimeSeriesClassifier#1', 'dense_1_units'): 425}

## 5. Fitting the pipeline

Once we are satisfied with the obtained cross validation score, we can proceed to call
the `fit` method passing again the same data elements.

This will fit the pipeline with all the training data available using the best hyperparameters
found during the tuning process:

In [24]:
pipeline.fit(train, readings)

## 6. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

In [25]:
predictions = pipeline.predict(test, readings)

And evaluate its prediction performance:

In [26]:
from sklearn.metrics import f1_score

f1_score(test['target'], predictions)

0.76

## 7. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you
will probably want to save a fitted instance and load it later to analyze new signals
instead of fitting pipelines over and over again.

This can be done by using the `save` and `load` methods from the `DracoPipeline`.

In order to save an instance, call its `save` method passing it the path and filename
where the model should be saved.

In [27]:
path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new `DracoPipeline` by using the
`DracoPipeline.load` method:

In [28]:
new_pipeline = DracoPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

In [29]:
predictions = new_pipeline.predict(test, readings)
predictions[0:5]

array([0, 0, 0, 1, 0])