# tsfresh - "Time Series Feature Extraction based on Scalable Hypothesis Tests"


## Feature Extraction and Selection

This example shows how to use **tsfresh** to extract useful features from timeseries and use them to improve classification performance.

We use the robot execution failure data set as an example

In [None]:
!pip install tsfresh

In [2]:
%matplotlib inline

import matplotlib.pylab as plt

from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## Load and visualize data

The data set documents 88 robot executions (each has a `unique id` between 1 and 88), which is a subset of the [Robot Execution Failures Data Set](https://archive.ics.uci.edu/dataset/138/robot+execution+failures). For the purpose of simplicity we are only differentiating between successfull and failed executions (`y`).

For each execution 15 force (F) and torque (T) samples are given, which were measured at regular time intervals for the spatial dimensions x, y, and z. Therefore each row of the data frame references a specific execution (`id`), a time index (`index`) and documents the respective measurements of 6 sensors (`F_x`, `F_y`, `F_z`, `T_x`, `T_y`, `T_z`).

In [4]:
from tsfresh.examples import robot_execution_failures

robot_execution_failures.download_robot_execution_failures()
df, y = robot_execution_failures.load_robot_execution_failures()
df.head(20)



Unnamed: 0,id,time,F_x,F_y,F_z,T_x,T_y,T_z
0,1,0,-1,-1,63,-3,-1,0
1,1,1,0,0,62,-3,-1,0
2,1,2,-1,-1,61,-3,0,0
3,1,3,-1,-1,63,-2,-1,0
4,1,4,-1,-1,63,-3,-1,0
5,1,5,-1,-1,63,-3,-1,0
6,1,6,-1,-1,63,-3,0,0
7,1,7,-1,-1,63,-3,-1,0
8,1,8,-1,-1,63,-3,-1,0
9,1,9,-1,-1,61,-3,0,0


In [12]:
df['id'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
       86, 87, 88])

In [7]:
# y has the labels. True means successful execution, False is Failed execution.
y

1      True
2      True
3      True
4      True
5      True
      ...  
84    False
85    False
86    False
87    False
88    False
Length: 88, dtype: bool

## Extract Features

We can use the data to extract time series features using `tsfresh`. We want to extract features for each time series, that means for each robot execution (which is our id) and for each of the measured sensor values (`F_*` and `T_*`).

For machine learning, we need to transform the values for each robot in a single line. `tsfresh` will result in a single row for each id and will calculate the features for each columns separately. It uses methods like

* Statistics
* time series analysis
* signal processing

For e.g., it might take the average of all 15 F_x for id=1

The `time` column is our sorting column. For an overview on the data formats of `tsfresh`, please have a look at the [documentation](https://tsfresh.readthedocs.io/en/latest/text/data_formats.html).

In [8]:
# We are very explicit here and specify the `default_fc_parameters`. If you remove this argument,
# the ComprehensiveFCParameters (= all feature calculators) will also be used as default.
# Have a look into the documentation (https://tsfresh.readthedocs.io/en/latest/text/feature_extraction_settings.html) to learn more about this.
extraction_settings = ComprehensiveFCParameters()

X = extract_features(df, column_id='id', column_sort='time',
                     default_fc_parameters=extraction_settings,
                     # we impute = remove all NaN features automatically
                     impute_function=impute)

Feature Extraction: 100%|██████████| 528/528 [00:28<00:00, 18.43it/s]


The `default_fc_parameters` parameter defines which features to extract and potentially any specific parameters for those features. Each feature extractor can have parameters that influence its behavior. By setting the `default_fc_parameters`, you can control which features are extracted and how they are calculated.

The **`ComprehensiveFCParameters()`** function from the tsfresh library returns a dictionary containing a comprehensive set of default feature extraction parameters. When you set default_fc_parameters to `ComprehensiveFCParameters()`, you're essentially instructing tsfresh to extract a wide variety of features from the time series data with the default parameters for each feature calculator.

* This comprehensive set includes a wide variety of features, including basic statistics, linearity measures, periodicity measures, and more.

* However, it's worth noting that because it is comprehensive, the extraction process can be time-consuming, especially for large datasets.

<br>
There are also other predefined parameter sets in tsfresh, such as:

**`MinimalFCParameters()`:** A smaller set of features for quicker extraction.
<br>
**`EfficientFCParameters()`:** A set of features that are considered to be a balance between comprehensiveness and efficiency.

X now contains for each robot execution (= id) a single row, with all the features tsfresh calculated based on the measured times series values for this id.

In [9]:
X.head()

Unnamed: 0,F_x__variance_larger_than_standard_deviation,F_x__has_duplicate_max,F_x__has_duplicate_min,F_x__has_duplicate,F_x__sum_values,F_x__abs_energy,F_x__mean_abs_change,F_x__mean_change,F_x__mean_second_derivative_central,F_x__median,...,T_z__fourier_entropy__bins_5,T_z__fourier_entropy__bins_10,T_z__fourier_entropy__bins_100,T_z__permutation_entropy__dimension_3__tau_1,T_z__permutation_entropy__dimension_4__tau_1,T_z__permutation_entropy__dimension_5__tau_1,T_z__permutation_entropy__dimension_6__tau_1,T_z__permutation_entropy__dimension_7__tau_1,T_z__query_similarity_count__query_None__threshold_0.0,T_z__mean_n_absolute_max__number_of_maxima_7
1,0.0,0.0,1.0,1.0,-14.0,14.0,0.142857,0.0,-0.038462,-1.0,...,0.974315,1.288185,1.906155,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0
2,0.0,1.0,1.0,1.0,-13.0,25.0,1.0,0.0,-0.038462,-1.0,...,1.073543,1.494175,2.079442,0.937156,1.234268,1.540306,1.748067,1.83102,0.0,0.571429
3,0.0,0.0,1.0,1.0,-10.0,12.0,0.714286,0.0,-0.038462,-1.0,...,1.386294,1.732868,2.079442,1.265857,1.704551,2.019815,2.163956,2.197225,0.0,0.571429
4,0.0,1.0,1.0,1.0,-6.0,16.0,1.214286,-0.071429,-0.038462,0.0,...,1.073543,1.494175,2.079442,1.156988,1.907284,2.397895,2.302585,2.197225,0.0,1.0
5,0.0,0.0,0.0,1.0,-9.0,17.0,0.928571,-0.071429,0.038462,-1.0,...,0.900256,1.320888,2.079442,1.156988,1.86368,2.271869,2.302585,2.197225,0.0,0.857143


In [14]:
# Now we have 88 robots and their features in each row
X.shape

(88, 4698)

4698 is a huge nnumber of features. We have to do some feature selection to choose the relevant features.

## Select Features

Using the hypothesis tests implemented in tsfresh (see [here](https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) for more information) it is now possible to select only the relevant features out of this large dataset.

`tsfresh` will do a hypothesis test for each of the features to check, if it is relevant for your given target

In [15]:
X_filtered = select_features(X, y)


In [16]:
X_filtered.shape

(88, 682)

The features reduced from 4698 to 682

## Model Training

In [17]:
# Train the mmodel witthh 4698 features
X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.2)

# Train the mmodel with 682 features
X_filtered_train, X_filtered_test = X_full_train[X_filtered.columns], X_full_test[X_filtered.columns]

In [18]:
# All features
classifier_full = DecisionTreeClassifier()
classifier_full.fit(X_full_train, y_train)
print(classification_report(y_test, classifier_full.predict(X_full_test)))

              precision    recall  f1-score   support

       False       0.94      1.00      0.97        15
        True       1.00      0.67      0.80         3

    accuracy                           0.94        18
   macro avg       0.97      0.83      0.88        18
weighted avg       0.95      0.94      0.94        18



In [19]:
# Selected features
classifier_filtered = DecisionTreeClassifier()
classifier_filtered.fit(X_filtered_train, y_train)
print(classification_report(y_test, classifier_filtered.predict(X_filtered_test)))

              precision    recall  f1-score   support

       False       0.94      1.00      0.97        15
        True       1.00      0.67      0.80         3

    accuracy                           0.94        18
   macro avg       0.97      0.83      0.88        18
weighted avg       0.95      0.94      0.94        18



Compared to using all features (classifier_full), using only the relevant features (classifier_filtered) achieves the same classification performance, but with less data.

Please remember that the hypothesis test in `tsfresh` is a statistical test. You might get better performance with other feature selection methods (e.g. training a classifier with all but one feature to find its importance) - but in general the feature selection implemented in `tsfresh` will give you a very reasonable set of selected features.

# Extracting and Selecting Features in one step

We performed the feature extraction and selection independently. If you are only interested in the list of selected features, you can run this in one step.

In [20]:
X_filtered_2 = extract_relevant_features(df, y, column_id='id', column_sort='time',
                                         default_fc_parameters=extraction_settings)

Feature Extraction: 100%|██████████| 528/528 [00:36<00:00, 14.37it/s]


In [21]:
# compare and test
 (X_filtered.columns == X_filtered_2.columns).all()

True