# Draco Regression Pipeline

In this tutorial we will show you how to use Draco Regression pipelines to solve a Machine Learning problem
defined via a Target Times table.

During the next steps we will:

- Load demo Remaining Useful Life (dataset) with training and testing target times and readings
- Find available pipelines and load one of them
- Build and fit a Machine Learning pipeline
- Make predictions using the fitted pipeline
- Evaluate how good the predictions are

## 0. Setup the logging

This step sets up logging in our environment to increase our visibility over
the steps that Draco performs.

In [1]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Load the Data

The first step is to load the data that we are going to use.

In order to use the demo data included in Draco, the `draco.demo.load_demo` function can be used.

In [2]:
from draco.demo import load_demo

train_target_times, test_target_times, readings = load_demo(name='rul')

This will download some demo data from [Draco S3 demo Bucket](
https://d3-ai-draco.s3.amazonaws.com/index.html) and load it as
the necessary `target_times` and `readings` tables.

The exact format of these tables is described in the Draco README and docs:

In [3]:
train_target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,1,2013-01-12 04:20:00,166
1,1,2013-01-12 04:30:00,165
2,1,2013-01-12 04:40:00,164
3,1,2013-01-12 04:50:00,163
4,1,2013-01-12 05:00:00,162


In [4]:
train_target_times.shape

(18131, 3)

In [5]:
train_target_times.dtypes

turbine_id              int64
cutoff_time    datetime64[ns]
target                  int64
dtype: object

In [6]:
test_target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,1,2013-01-13 13:10:00,112.0
1,2,2013-01-14 08:00:00,98.0
2,3,2013-01-14 02:50:00,69.0
3,4,2013-01-14 01:10:00,82.0
4,5,2013-01-14 13:10:00,91.0


In [7]:
test_target_times.shape

(100, 3)

In [8]:
test_target_times.dtypes

turbine_id              int64
cutoff_time    datetime64[ns]
target                float64
dtype: object

In [9]:
readings.head()

Unnamed: 0,turbine_id,timestamp,signal_id,value
0,1,2013-01-12 00:10:00,operational setting 1,-0.0007
1,1,2013-01-12 00:20:00,operational setting 1,0.0019
2,1,2013-01-12 00:30:00,operational setting 1,-0.0043
3,1,2013-01-12 00:40:00,operational setting 1,0.0007
4,1,2013-01-12 00:50:00,operational setting 1,-0.0019


In [10]:
readings.shape

(809448, 4)

In [11]:
readings.dtypes

turbine_id             int64
timestamp     datetime64[ns]
signal_id             object
value                float64
dtype: object

### Load your own Dataset

Alternatively, if you want to load your own dataset, all you have to do is load the
`target_times` and `readings` tables as `pandas.DataFrame` objects.

Make sure to parse the corresponding datetime fields!

```python
import pandas as pd

target_times = pd.read_csv('path/to/your/target_times.csv', parse_dates=['cutoff_time'])
readings = pd.read_csv('path/to/your/readings.csv', parse_dates=['timestamp'])
```

## 2. Finding the available Pipelines

The next step will be to select a collection of templates from the ones
available in Draco.

For this, we can use the `draco.get_pipelines` function, which will
return us the list of all the available MLBlocks pipelines found in the
Draco system.

In [12]:
from draco import get_pipelines

get_pipelines()

['dfs_xgb_prob_with_unstack',
 'dfs_xgb_with_normalization',
 'dfs_xgb',
 'dfs_xgb_with_unstack',
 'dfs_xgb_prob_with_unstack_normalization',
 'dfs_xgb_with_unstack_normalization',
 'dfs_xgb_prob_with_double_normalization',
 'dfs_xgb_with_double_normalization',
 'lstm_regressor_with_unstack',
 'lstm_regressor',
 'double_lstm_prob_with_unstack',
 'double_lstm_prob',
 'double_lstm',
 'double_lstm_with_unstack',
 'lstm_prob_with_unstack',
 'lstm_with_unstack',
 'lstm_prob',
 'lstm']

Optionally, we can pass a string to select the pipelines that contain it:

In [13]:
get_pipelines('regressor')

['lstm_regressor_with_unstack', 'lstm_regressor']

We will use the regression pipeline `lstm_regressor_with_unstack`

The `lstm_regressor_with_unstack` pipeline contains the following steps:

- Resample the data using a 10 minute average aggregation
- Unstack the data by signal, so each signal is in a different column
- Impute missing values in the readings table
- Normalize (scale) the data between [-1, 1].
- Create window sequences using target times.
- Apply an LSTM Regressor

In [14]:
pipeline_name = 'lstm_regressor_with_unstack'

## 3. Fitting a Draco Pipeline

Once we have loaded the data, we create a **DracoPipeline** instance by passing `pipeline_name` which is the name of a pipeline, the path to a template json file, or a list that can combine both of them.

In [15]:
from draco.pipeline import DracoPipeline

pipeline = DracoPipeline(pipeline_name)

To train a pipeline we use the `fit` method passing the `target_times` and the `readings` table:

In [16]:
pipeline.fit(train_target_times, readings)

2022-02-01 15:05:13.365367: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-01 15:05:13.379993: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe6a0ec50a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-02-01 15:05:13.380010: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


## 4. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

In [17]:
predictions = pipeline.predict(test_target_times, readings)

And evaluate its prediction performance:

In [18]:
from sklearn.metrics import r2_score

r2_score(test_target_times['target'], predictions)

0.6362969806460871

## 5. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you
will probably want to save a fitted instance and load it later to analyze new signals
instead of fitting pipelines over and over again.

This can be done by using the `save` and `load` methods from the `DracoPipeline`.

In order to save an instance, call its `save` method passing it the path and filename
where the model should be saved.

In [19]:
path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new `DracoPipeline` by using the
`DracoPipeline.load` method:

In [20]:
new_pipeline = DracoPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

In [21]:
predictions = new_pipeline.predict(test_target_times, readings)
predictions[0:5]

array([[129.89064 ],
       [139.89001 ],
       [ 39.425865],
       [110.67838 ],
       [ 98.52903 ]], dtype=float32)