# Feature Engineering
In this tutorial, we will show you how to use `zephyr_ml`'s `Zephyr` class to create EntitySets, generate label times, and do automated feature engineering. This tutorial assumes you have a folder with the mostly pre-processed data in seperate CSVs. If necessary, please update the steps and paths below.

## 1) Create EntitySet
zephyr_ml has strict assumptions about the data passed into its `create_entityset` method. It's the user's responsibility to apply the necessary pre-processing steps to get data into a format acceptable for zephyr_ml. 

For example, the demo PI data needs to be converted to a tabular format instead of a `tag` `value` format. The `turbine` column also needs too be renamed to `COD_ELEMENT` to match the rest of the data.

In [1]:
import pandas as pd
from os import path

data_path = 'data'

pidata_df = pd.read_csv(path.join(data_path, 'pidata.csv'))
pidata_df

Unnamed: 0,time,turbine,tag,val
0,2022-01-02 13:21:01,0,T0.val1,9872.0
1,2022-01-02 13:21:01,0,T0.val2,10.0
2,2022-03-08 13:21:01,0,T0.val1,559.0
3,2022-03-08 13:21:01,0,T0.val2,-7.0


In [2]:
pidata_df['tag'] = pidata_df['tag'].apply(lambda x: '.'.join(x.split('.')[1:]))
pidata_df = pd.pivot_table(pidata_df, index=['time', 'turbine'],
                            columns=['tag']).droplevel(0, axis=1).reset_index()
pidata_df.rename(columns={'turbine': 'COD_ELEMENT'}, inplace=True)
pidata_df

tag,time,COD_ELEMENT,val1,val2
0,2022-01-02 13:21:01,0,9872.0,10.0
1,2022-03-08 13:21:01,0,559.0,-7.0


Once the necessary preprocessing steps have been done, the dataframes can be passed to the respective create EntitySet function. The keys used for the data dictionary are significant, and must match the ones used in this example. Default column names and entity keywork arguments can be overwritten by passing in a dictionary mapping entity names to keyword arguments for adding the dataframe to the EntitySet.

In [3]:
from zephyr_ml import Zephyr

zephyr = Zephyr()
data = {
    'turbines': pd.read_csv(path.join(data_path, 'turbines.csv')),
    'alarms': pd.read_csv(path.join(data_path, 'alarms.csv')),
    'stoppages': pd.read_csv(path.join(data_path, 'stoppages.csv')),
    'work_orders': pd.read_csv(path.join(data_path, 'work_orders.csv')),
    'notifications': pd.read_csv(path.join(data_path, 'notifications.csv')),
    'pidata': pidata_df
}

pidata_es = zephyr.generate_entityset(dfs = data, es_type = "pidata")
pidata_es

[GUIDE] Successfully performed generate_entityset.
	You can perform the next step by calling generate_label_times.


Entityset: pidata
  DataFrames:
    turbines [Rows: 1, Columns: 10]
    alarms [Rows: 2, Columns: 10]
    stoppages [Rows: 2, Columns: 16]
    work_orders [Rows: 2, Columns: 20]
    notifications [Rows: 2, Columns: 15]
    pidata [Rows: 2, Columns: 5]
  Relationships:
    alarms.COD_ELEMENT -> turbines.COD_ELEMENT
    stoppages.COD_ELEMENT -> turbines.COD_ELEMENT
    work_orders.COD_ELEMENT -> turbines.COD_ELEMENT
    pidata.COD_ELEMENT -> turbines.COD_ELEMENT
    notifications.COD_ORDER -> work_orders.COD_ORDER

To visualize the entityset and its relationships, use `.plot` functionality.

In [4]:
# pidata_es.plot('viz.png')

## 2) Generating Labels and Cutoff Times
The `DataLabeler` is used to generate labels and label times for an EntitySet. It is instantiated with a labeling function, and labels can be generated by calling the `generate_label_times` method. The list of available labeling functions can be found using `zephyr_ml.labeling.get_labeling_functions()`. Custom labeling functions can also be created, provided they follow the expected format of returning the deserialized dataframe, the actual labeling function to use for the dataslice, and additional metadata.

In [5]:
zephyr.GET_LABELING_FUNCTIONS()

{'brake_pad_presence': {'obj': <function zephyr_ml.labeling.labeling_functions.brake_pad_presence.brake_pad_presence(es, column_map={})>,
  'desc': 'Determines if brake pad present in stoppages.'},
 'converter_replacement_presence': {'obj': <function zephyr_ml.labeling.labeling_functions.converter_replacement_presence.converter_replacement_presence(es, column_map={})>,
  'desc': 'Calculates the converter replacement presence.'},
 'gearbox_replace_presence': {'obj': <function zephyr_ml.labeling.labeling_functions.planet_bearing.gearbox_replace_presence(es, column_map={})>,
  'desc': 'Determines if gearbox replacement/exchange is present in stoppages.'},
 'total_power_loss': {'obj': <function zephyr_ml.labeling.labeling_functions.total_power_loss.total_power_loss(es, column_map={})>,
  'desc': 'Calculates the total power loss over the data slice.'}}

In [6]:
label_times, _ = zephyr.generate_label_times("total_power_loss")
label_times

[GUIDE] Successfully performed generate_label_times.
	You can perform the next step by calling generate_feature_matrix.


Unnamed: 0,COD_ELEMENT,time,label
0,0,2022-01-01,45801.0


## 3) Feature Engineering with SigPro and Featuretools

The feature engineering process in zephyr_ml combines signal processing with SigPro and automated feature generation with Featuretools into a single method, `generate_feature_matrix`. This unified approach allows for efficient processing of both time series signals and relational data.

### Signal Processing with SigPro
To perform signal processing in the `generate_feature_matrix` method, we pass in the following parameters:
- `signal_aggregations`: the specifications of the aggregation primitives
- `signal_transformations`: the specifications of the transformation priimitives
- `signal_dataframe_name`: the name of the dataframe whether `pidata` or `scada`.
- `signal_column`: the name of the signal column in the dataframe.
- `signal_window_size`: the size of the bin we want to process the signals over, e.g. each month.
- `signal_replace_dataframe`: an indicator whether we want to replace the current dataframe or add it as a new one.

To look at some of the primitives readily available, we use `get_primitives` function from `SigPro`.

In [7]:
from sigpro import get_primitives

get_primitives()

['sigpro.SigPro',
 'sigpro.aggregations.amplitude.statistical.crest_factor',
 'sigpro.aggregations.amplitude.statistical.kurtosis',
 'sigpro.aggregations.amplitude.statistical.mean',
 'sigpro.aggregations.amplitude.statistical.rms',
 'sigpro.aggregations.amplitude.statistical.skew',
 'sigpro.aggregations.amplitude.statistical.std',
 'sigpro.aggregations.amplitude.statistical.var',
 'sigpro.aggregations.frequency.band.band_mean',
 'sigpro.transformations.amplitude.identity.identity',
 'sigpro.transformations.amplitude.spectrum.power_spectrum',
 'sigpro.transformations.frequency.band.frequency_band',
 'sigpro.transformations.frequency.fft.fft',
 'sigpro.transformations.frequency.fft.fft_real',
 'sigpro.transformations.frequency.fftfreq.fft_freq',
 'sigpro.transformations.frequency_time.stft.stft',
 'sigpro.transformations.frequency_time.stft.stft_real']

Suppose we are interested in finding the amplitude mean for each month of readings in the signal. We first specify the `name` and respective `primitive` we want to apply for both `transformations` and `aggregations`.

In this case, we are interested in an identity transformation and mean aggregation.

In [8]:
signal_aggregations = [{
    "name":"mean",
    "primitive":"sigpro.aggregations.amplitude.statistical.mean"
}]

signal_transformations = [{
    "name":"fft",
    "primitive":"sigpro.transformations.amplitude.identity.identity"
}]

### Automated Feature Generation with Featuretools
The `generate_feature_matrix` method also leverages Featuretools to automatically generate features from the previously generated EntitySet and use label times as cutoff times, ensuring temporal validity. For example, we can set interesting categorical values in our EntitySet and use them to generate aggregation features grouped by those interesting values. We can also set which primitives we want to use and control which columns and entities those primitives can be applied to. 

In [9]:
feature_matrix, features, processed_es =zephyr.generate_feature_matrix(
    # signal processing parameters
    signal_dataframe_name = "pidata",
    signal_column = "val1",
    signal_transformations = signal_transformations,
    signal_aggregations = signal_aggregations,
    signal_window_size = "1m",
    signal_replace_dataframe = False,
    
    # feature generation parameters
    target_dataframe_name = "turbines", 
    cutoff_time_in_index=True,
    where_primitives=['count', 'sum'],
    agg_primitives=['count', 'min', 'max', 'sum'],
    trans_primitives=['num_words'],
    ignore_dataframes=['notifications', 'work_orders'],
    add_interesting_values = True,
    interesting_dataframe_name = "alarms",
    interesting_values = {'DES_NAME': ['Alarm1', 'Alarm2']}
)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
[GUIDE] Successfully performed generate_feature_matrix.
	You can perform the next step by calling generate_train_test_split.


`generate_feature_matrix` returns three outputs: `feature_matrix`, `features`, and `processed_es`. `processed_es` is a deepcopy of our Zephyr instance's original generated entityset, containing the signal processing and interesting values. `feature_matrix` is the generated feature matrix and `features` is a list of the generated features. 

Based on our original observations of `val1`, we now have `pidata_processed` with an entry for each month and the respective mean value of observations we see in that month.

**Note**: in the months we don't have observations, the value becomes null.

In [10]:
processed_es["pidata_processed"]

Unnamed: 0,_index,COD_ELEMENT,time,fft.mean.mean_value
0,0,0,2022-01-31,9872.0
1,1,0,2022-02-28,
2,2,0,2022-03-31,559.0


In [11]:
features

[<Feature: TURBINE_PI_ID>,
 <Feature: TURBINE_LOCAL_ID>,
 <Feature: TURBINE_SAP_COD>,
 <Feature: DES_CORE_ELEMENT>,
 <Feature: SITE>,
 <Feature: DES_CORE_PLANT>,
 <Feature: COD_PLANT_SAP>,
 <Feature: PI_COLLECTOR_SITE_NAME>,
 <Feature: PI_LOCAL_SITE_NAME>,
 <Feature: COUNT(alarms)>,
 <Feature: MAX(alarms.IND_DURATION)>,
 <Feature: MIN(alarms.IND_DURATION)>,
 <Feature: SUM(alarms.IND_DURATION)>,
 <Feature: COUNT(stoppages)>,
 <Feature: MAX(stoppages.COD_WO)>,
 <Feature: MAX(stoppages.IND_DURATION)>,
 <Feature: MAX(stoppages.IND_LOST_GEN)>,
 <Feature: MIN(stoppages.COD_WO)>,
 <Feature: MIN(stoppages.IND_DURATION)>,
 <Feature: MIN(stoppages.IND_LOST_GEN)>,
 <Feature: SUM(stoppages.COD_WO)>,
 <Feature: SUM(stoppages.IND_DURATION)>,
 <Feature: SUM(stoppages.IND_LOST_GEN)>,
 <Feature: COUNT(pidata)>,
 <Feature: MAX(pidata.val1)>,
 <Feature: MAX(pidata.val2)>,
 <Feature: MIN(pidata.val1)>,
 <Feature: MIN(pidata.val2)>,
 <Feature: SUM(pidata.val1)>,
 <Feature: SUM(pidata.val2)>,
 <Feature: COU

In [12]:
feature_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,COUNT(alarms),MAX(alarms.IND_DURATION),MIN(alarms.IND_DURATION),SUM(alarms.IND_DURATION),COUNT(stoppages),MAX(stoppages.COD_WO),MAX(stoppages.IND_DURATION),MAX(stoppages.IND_LOST_GEN),MIN(stoppages.COD_WO),MIN(stoppages.IND_DURATION),...,TURBINE_PI_ID_TA00,TURBINE_LOCAL_ID_A0,TURBINE_SAP_COD_LOC000,DES_CORE_ELEMENT_T00,SITE_LOCATION,DES_CORE_PLANT_LOC,COD_PLANT_SAP_ABC,PI_COLLECTOR_SITE_NAME_LOC0,PI_LOCAL_SITE_NAME_LOC0,label
COD_ELEMENT,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,2022-01-01,1,,,0.0,1,12345.0,,,12345.0,,...,1,1,1,1,1,1,1,1,1,45801.0
