In [1]:
# %load_ext autoreload
# %autoreload 2

In [2]:
import pandas as pd
import numpy as np
import pandas as pd
import scipy.stats as ss

In [3]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
warnings.simplefilter('ignore', np.RankWarning)

import logging, sys
logging.disable(sys.maxsize)

In [4]:
try: 
    from tsad.base.pipeline import Pipeline
    from tsad.base.datasets import load_skab
    from tsad.tasks.eda import HighLevelDatasetAnalysisTask, TimeDiscretizationTask
    # from tsad.tasks.preprocess import FeatureProcessingTask, SplitByNaNTask, TrainTestSplitTask
    from tsad.tasks.feature_generation import FeatureGenerationTask
except:
    import sys
    sys.path.append('../')
    from tsad.base.pipeline import Pipeline
    from tsad.base.datasets import load_skab
    from tsad.tasks.eda import HighLevelDatasetAnalysisTask, TimeDiscretizationTask
    # from tsad.tasks.preprocess import FeatureProcessingTask, SplitByNaNTask, TrainTestSplitTask
    from tsad.tasks.feature_generation import FeatureGenerationTask




# Load Data

In [5]:
dataset = load_skab()
df = dataset.frame
df = df.reset_index(level=[0])
df = df[df['experiment']=='valve1/6']
df = df.drop(columns='experiment')
df.shape

(1154, 10)

In [6]:
#TODO use task in pipeline to resample dataframe
df = df.resample('10s').mean().ffill()
df.shape

(121, 10)

# Feature generation pipeline

The `FeatureGenerationTask` class is a key component of a time series feature engineering pipeline. It is responsible for generating features from time series data according to a user-defined or default configuration.

## Default configuration

In [7]:
features = dataset.feature_names
target = dataset.target_names[0]

__Default feature generation functions__:

By default, this method uses the [EfficientFCParameters](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#tsfresh.feature_extraction.settings.EfficientFCParameters) function for feature generation, which provides most common optimized set of feature extraction functions.


__Default Windows__:

The default window sizes for feature generation are determined based on the index frequency of the input DataFrame (_freq_df_). The following window sizes are used:

- Window 1: 4 times the frequency of the DataFrame (4 * _freq_df_)
- Window 2: 10 times the frequency of the DataFrame (10 * _freq_df_)

These window sizes are selected to capture a range of temporal patterns in the time series data.

In [8]:
inference_pipeline = Pipeline([
    FeatureGenerationTask(features=features, config=None),
], show=True)
df_fit = inference_pipeline.fit(df)

'Total features generated: 11088'

In [9]:
df_predict = inference_pipeline.predict(df)

'Total features generated: 11088'

## Custom configuration

When performing feature generation on time series data, you have the flexibility to define a custom configuration tailored to your specific needs. This custom configuration allows you to select a set of feature extraction functions, specify the series (columns) to which these functions will be applied, and define the windows for calculating these features.

### Custom Configuration Example

In [10]:
import scipy.stats as ss

from tsflex.chunking import chunk_data
from tsflex.features import FeatureCollection, MultipleFeatureDescriptors, FuncWrapper
from tsflex.features.utils import make_robust

def slope(x): return (x[-1] - x[0]) / x[0] if x[0] else 0
def abs_diff_mean(x): return np.mean(np.abs(x[1:] - x[:-1])) if len(x) > 1 else 0
def diff_std(x): return np.std(x[1:] - x[:-1]) if len(x) > 1 else 0

funcs = [make_robust(f) for f in [np.min, np.max, np.std, np.mean, slope, ss.skew, abs_diff_mean, diff_std, sum, len,]]

custom_config = [
    {"functions": funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["1s", "60s"],
    }
]

- `functions`: This is a list of feature extraction functions that will be applied to the selected series. These functions are defined in the `funcs` list, which includes functions like minimum, maximum, standard deviation, mean, slope, skewness, and more. You can customize this list to include the specific functions that are relevant to your analysis.

- `series_names`: This is a list of column names in your DataFrame to which the feature extraction functions will be applied. In this example, the functions will be applied to the `Pressure` and `Temperature` series. You can modify this list to include the names of the series you want to analyze.

- `windows`: This is a list of window sizes for feature calculation. In this example, two window sizes are specified: "1s" (1 second) and "60s" (60 seconds). These window sizes determine how the time series data will be segmented for feature extraction. Adjust these window sizes based on your analysis requirements.

### Feature Extraction Functions

Feature Extraction functions compute various statistical, temporal, spectral, and other characteristics of time series data. In your feature generation task, you can use a variety of feature extraction functions from libraries like tsfresh, tsfel, numpy, scipy, or even custom functions. 


__Feature Extraction Categories__:

1. _Statistical Features_:
These features capture statistical properties of the time series data.
Common statistical features include mean, median, standard deviation, skewness, kurtosis, variance, and more.
Example: `np.mean`, `tsfresh.feature_extraction.feature_calculators.median`

2. _Temporal Features_:
Temporal features describe patterns over time within the time series.
Examples include autocorrelation, mean absolute difference, mean difference, distance, absolute energy, and more.
Example: `tsfresh.feature_extraction.features.autocorr`, `tsfel.features.mean_abs_diff`


3. _Spectral Features_:
Spectral features provide insights into the frequency domain characteristics of the time series.
These features include wavelet entropy, spectral entropy, power spectral density, and more.
Example: `tsfresh.feature_extraction.features.wavelet_entropy`, `tsfel.features.spectral_entropy`

4. _Custom Functions_:
You can define custom feature extraction functions tailored to your specific analysis requirements.
These functions can capture domain-specific insights or unique patterns in the data.
Example: Custom functions like `slope(x)`, `abs_diff_mean(x)`, and `diff_std(x)` defined in code.

5. _External Libraries_:
You can leverage external libraries like tsfresh and tsfel for a wide range of pre-defined feature extraction functions.
These libraries offer functions for calculating advanced features such as entropy, time-domain, and frequency-domain features.
Example: `tsfresh.feature_extraction.features.entropy`, `tsfel.features.abs_energy`

In [11]:
# tsfel
from tsfel.feature_extraction.features import (
    # Some temporal features
    autocorr, mean_abs_diff, mean_diff, distance, zero_cross,
    abs_energy, pk_pk_distance, entropy, neighbourhood_peaks,
    # Some statistical features
    interq_range, kurtosis, skewness, calc_max, calc_median, 
    median_abs_deviation, rms, 
    # Some spectral features
    #  -> Almost all are "advanced" features
    wavelet_entropy
)

tsfel_funcs = [
    # Temporal
    autocorr, mean_abs_diff, mean_diff, distance,
    abs_energy, pk_pk_distance, neighbourhood_peaks,
    FuncWrapper(entropy, prob="kde", output_names="entropy_kde"),
    FuncWrapper(entropy, prob="gauss", output_names="entropy_gauss"),
    # # Statistical
    interq_range, kurtosis, skewness, calc_max, calc_median, 
    median_abs_deviation, rms,
    # Spectral
    wavelet_entropy,  
]

# tsfresh
from tsfresh.feature_extraction.feature_calculators import (
    cid_ce,
    variance_larger_than_standard_deviation,
)

tsfresh_funcs=[
        variance_larger_than_standard_deviation,
        FuncWrapper(cid_ce, normalize=True),
    ]

__Choosing Feature Extraction Functions__

When choosing feature extraction functions for your analysis, consider the following factors:

- Relevance: Select functions that are relevant to your analysis goals. For instance, if you're interested in detecting periodicity, consider using autocorrelation or spectral features.

- Computational Efficiency: Consider the computational cost of the functions, especially when dealing with large datasets. Some functions may be computationally expensive.
  
- Domain Knowledge: Leverage your domain knowledge to identify features that have interpretability and meaning in your specific domain.
  
- Customization: Don't hesitate to define custom functions if the standard functions do not capture the patterns you're interested in.

### Applying Custom Configuration

To apply the custom configuration for feature generation, you can use the FeatureGenerationTask class. Here's an example of how to use it:

In [12]:
# Define your custom configuration
custom_config = [
    {"functions": funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["1s", "60s"],
    },
    {"functions": tsfel_funcs,
     'series_names': ['Pressure', 'Temperature', 'Thermocouple', 'Voltage'],
     "windows": ["30s", "60s"],
    },
    {"functions": tsfresh_funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["15s", "60s"],
    },
]

# Create a FeatureGenerationTask with your custom configuration
inference_pipeline = Pipeline([
    FeatureGenerationTask(features=features, config=custom_config),
], show=True)

# Fit the pipeline to your DataFrame
df_fit = inference_pipeline.fit(df)

'Total features generated: 184'

In [13]:
df_fit.columns

Index(['Accelerometer1RMS', 'Accelerometer2RMS', 'Current', 'Pressure',
       'Temperature', 'Thermocouple', 'Voltage', 'Volume Flow RateRMS',
       'anomaly', 'changepoint',
       ...
       'Voltage__neighbourhood_peaks__w=1m',
       'Voltage__neighbourhood_peaks__w=30s', 'Voltage__pk_pk_distance__w=1m',
       'Voltage__pk_pk_distance__w=30s', 'Voltage__rms__w=1m',
       'Voltage__rms__w=30s', 'Voltage__skewness__w=1m',
       'Voltage__skewness__w=30s', 'Voltage__wavelet_entropy__w=1m',
       'Voltage__wavelet_entropy__w=30s'],
      dtype='object', length=194)