# Feature Engineering
In this tutorial, we will show you how to use zephyr_ml to create EntitySets, generate label times, and do automated feature engineering. This tutorial assumes you have a folder with the mostly pre-processed data in seperate CSVs. If necessary, please update the steps and paths below.

## 1) Create EntitySet
zephyr_ml has strict assumptions about the data passed into its `create_pidata_entityset` and `create_scada_entityset` functions. It's the user's responsibility to apply the necessary pre-processing steps to get data into a format acceptable for zephyr_ml. 

For example, the demo PI data needs to be converted to a tabular format instead of a `tag` `value` format. The `turbine` column also needs too be renamed to `COD_ELEMENT` to match the rest of the data.

In [1]:
import pandas as pd
from os import path

data_path = 'data'

pidata_df = pd.read_csv(path.join(data_path, 'pidata.csv'))
pidata_df

Unnamed: 0,time,turbine,tag,val
0,2022-01-02 13:21:01,0,T0.val1,9872.0
1,2022-01-02 13:21:01,0,T0.val2,10.0
2,2022-03-08 13:21:01,0,T0.val1,559.0
3,2022-03-08 13:21:01,0,T0.val2,-7.0


In [2]:
pidata_df['tag'] = pidata_df['tag'].apply(lambda x: '.'.join(x.split('.')[1:]))
pidata_df = pd.pivot_table(pidata_df, index=['time', 'turbine'],
                            columns=['tag']).droplevel(0, axis=1).reset_index()
pidata_df.rename(columns={'turbine': 'COD_ELEMENT'}, inplace=True)
pidata_df

tag,time,COD_ELEMENT,val1,val2
0,2022-01-02 13:21:01,0,9872.0,10.0
1,2022-03-08 13:21:01,0,559.0,-7.0


Once the necessary preprocessing steps have been done, the dataframes can be passed to the respective create EntitySet function. The keys used for the data dictionary are significant, and must match the ones used in this example. Default column names and entity keywork arguments can be overwritten by passing in a dictionary mapping entity names to keyword arguments for adding the dataframe to the EntitySet.

In [3]:
from zephyr_ml import create_pidata_entityset

data = {
    'turbines': pd.read_csv(path.join(data_path, 'turbines.csv')),
    'alarms': pd.read_csv(path.join(data_path, 'alarms.csv')),
    'stoppages': pd.read_csv(path.join(data_path, 'stoppages.csv')),
    'work_orders': pd.read_csv(path.join(data_path, 'work_orders.csv')),
    'notifications': pd.read_csv(path.join(data_path, 'notifications.csv')),
    'pidata': pidata_df
}

pidata_es = create_pidata_entityset(data)
pidata_es

Entityset: PI data
  DataFrames:
    turbines [Rows: 1, Columns: 10]
    alarms [Rows: 2, Columns: 10]
    stoppages [Rows: 2, Columns: 16]
    work_orders [Rows: 2, Columns: 20]
    notifications [Rows: 2, Columns: 15]
    pidata [Rows: 2, Columns: 5]
  Relationships:
    alarms.COD_ELEMENT -> turbines.COD_ELEMENT
    stoppages.COD_ELEMENT -> turbines.COD_ELEMENT
    work_orders.COD_ELEMENT -> turbines.COD_ELEMENT
    pidata.COD_ELEMENT -> turbines.COD_ELEMENT
    notifications.COD_ORDER -> work_orders.COD_ORDER

## 2) Generating Labels and Cutoff Times
The `DataLabeler` is used to generate labels and label times for an EntitySet. It is instantiated with a labeling function, and labels can be generated by calling the `generate_label_times` method. The list of available labeling functions can be found using `zephyr_ml.labeling.get_labeling_functions()`. Custom labeling functions can also be created, provided they follow the expected format of returning the deserialized dataframe, the actual labeling function to use for the dataslice, and additional metadata.

In [4]:
from zephyr_ml import DataLabeler, labeling

data_labeler = DataLabeler(labeling.total_power_loss)

label_times, _ = data_labeler.generate_label_times(pidata_es)
label_times

Unnamed: 0,COD_ELEMENT,time,label
0,0,2022-01-01,45801.0


## 3) Feature Engineering with Featuretools
Using EntitySets and LabelTimes allows us to easily use Featuretools for automatic feature generation. For example, we can set interesting categorical values in our EntitySet and use them to generate aggregation features grouped by those interesting values. We can also set which primitives we want to use and control which columns and entities those primitives can be applied to. Featuretools can also use label times as cutoff times, ensuring that data after the label times is not used in feature generation. 

For additonal help using Featuretools, please see the documentation: https://featuretools.alteryx.com/en/stable/index.html

In [5]:
import featuretools as ft

interesting_alarms = ['Alarm1', 'Alarm2']
pidata_es.add_interesting_values(dataframe_name='alarms', values={'DES_NAME': interesting_alarms})

feature_matrix, features = ft.dfs(
    entityset=pidata_es,
    target_dataframe_name='turbines',
    cutoff_time_in_index=True,
    cutoff_time=label_times,
    where_primitives=['count', 'sum'],
    agg_primitives=['count', 'min', 'max', 'sum'],
    trans_primitives=['num_words'],
    ignore_dataframes=['notifications', 'work_orders']    
)

features

[<Feature: TURBINE_PI_ID>,
 <Feature: TURBINE_LOCAL_ID>,
 <Feature: TURBINE_SAP_COD>,
 <Feature: DES_CORE_ELEMENT>,
 <Feature: SITE>,
 <Feature: DES_CORE_PLANT>,
 <Feature: COD_PLANT_SAP>,
 <Feature: PI_COLLECTOR_SITE_NAME>,
 <Feature: PI_LOCAL_SITE_NAME>,
 <Feature: COUNT(alarms)>,
 <Feature: MAX(alarms.IND_DURATION)>,
 <Feature: MIN(alarms.IND_DURATION)>,
 <Feature: SUM(alarms.IND_DURATION)>,
 <Feature: COUNT(stoppages)>,
 <Feature: MAX(stoppages.COD_WO)>,
 <Feature: MAX(stoppages.IND_DURATION)>,
 <Feature: MAX(stoppages.IND_LOST_GEN)>,
 <Feature: MIN(stoppages.COD_WO)>,
 <Feature: MIN(stoppages.IND_DURATION)>,
 <Feature: MIN(stoppages.IND_LOST_GEN)>,
 <Feature: SUM(stoppages.COD_WO)>,
 <Feature: SUM(stoppages.IND_DURATION)>,
 <Feature: SUM(stoppages.IND_LOST_GEN)>,
 <Feature: COUNT(pidata)>,
 <Feature: MAX(pidata.val1)>,
 <Feature: MAX(pidata.val2)>,
 <Feature: MIN(pidata.val1)>,
 <Feature: MIN(pidata.val2)>,
 <Feature: SUM(pidata.val1)>,
 <Feature: SUM(pidata.val2)>,
 <Feature: COU

In [6]:
feature_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,TURBINE_PI_ID,TURBINE_LOCAL_ID,TURBINE_SAP_COD,DES_CORE_ELEMENT,SITE,DES_CORE_PLANT,COD_PLANT_SAP,PI_COLLECTOR_SITE_NAME,PI_LOCAL_SITE_NAME,COUNT(alarms),...,MAX(stoppages.NUM_WORDS(DES_COMMENTS)),MAX(stoppages.NUM_WORDS(DES_DESCRIPTION)),MAX(stoppages.NUM_WORDS(DES_WO_NAME)),MIN(stoppages.NUM_WORDS(DES_COMMENTS)),MIN(stoppages.NUM_WORDS(DES_DESCRIPTION)),MIN(stoppages.NUM_WORDS(DES_WO_NAME)),SUM(stoppages.NUM_WORDS(DES_COMMENTS)),SUM(stoppages.NUM_WORDS(DES_DESCRIPTION)),SUM(stoppages.NUM_WORDS(DES_WO_NAME)),label
COD_ELEMENT,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,2022-01-01,TA00,A0,LOC000,T00,LOCATION,LOC,ABC,LOC0,LOC0,1,...,4.0,2.0,3.0,4.0,2.0,3.0,4.0,2.0,3.0,45801.0


## 3) Feature Engineering with SigPro

Process signals with [SigPro](https://github.com/sintel-dev/SigPro) for PI signals or SCADA signals.

Processing signals is done by specifying the `transformations` and `aggregations` we wish to apply to the data. To look at some of the primitives readily available, we use `get_primitives` function from `SigPro`.

In [7]:
from sigpro import get_primitives

get_primitives()

['sigpro.SigPro',
 'sigpro.aggregations.amplitude.statistical.crest_factor',
 'sigpro.aggregations.amplitude.statistical.kurtosis',
 'sigpro.aggregations.amplitude.statistical.mean',
 'sigpro.aggregations.amplitude.statistical.rms',
 'sigpro.aggregations.amplitude.statistical.skew',
 'sigpro.aggregations.amplitude.statistical.std',
 'sigpro.aggregations.amplitude.statistical.var',
 'sigpro.aggregations.frequency.band.band_mean',
 'sigpro.transformations.amplitude.identity.identity',
 'sigpro.transformations.amplitude.spectrum.power_spectrum',
 'sigpro.transformations.frequency.band.frequency_band',
 'sigpro.transformations.frequency.fft.fft',
 'sigpro.transformations.frequency.fft.fft_real',
 'sigpro.transformations.frequency_time.stft.stft',
 'sigpro.transformations.frequency_time.stft.stft_real']

Suppose we are interested in finding the amplitude mean for each month of readings in the signal. We first specify the `name` and respective `primitive` we want to apply for both `transformations` and `aggregations`.

In this case, we are interested in an identity transformation and mean aggregation.

In [8]:
aggregations = [{
    "name":"mean",
    "primitive":"sigpro.aggregations.amplitude.statistical.mean"
}]

transformations = [{
    "name":"fft",
    "primitive":"sigpro.transformations.amplitude.identity.identity"
}]

We use `process_signals` function to accomplish our goal. We pass the following:
- `es`: the entityset we are working with.
- `signal_dataframe_name`: the name of the dataframe whether `pidata` or `scada`.
- `signal_column`: the name of the signal column in the dataframe.
- `window_size`: the size of the bin we want to process the signals over, e.g. each month.
- `replace_dataframe`: an indicator whether we want to replace the current dataframe or add it as a new one.

In [9]:
from zephyr_ml.feature_engineering import process_signals

process_signals(es=pidata_es, 
                signal_dataframe_name='pidata', 
                signal_column='val1', 
                transformations=transformations, 
                aggregations=aggregations,
                window_size='1m', 
                replace_dataframe=False)

pidata_es['pidata_processed']

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,_index,COD_ELEMENT,time,fft.mean.mean_value
0,0,0,2022-01-31,9872.0
1,1,0,2022-02-28,
2,2,0,2022-03-31,559.0


Based on our original observations of `val1`, we now have `pidata_processed` with an entry for each month and the respective mean value of observations we see in that month.

**Note**: in the months we don't have observations, the value becomes null.