# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
47,0 -1.4233 1 -1.3883 2 -1.3233 3 ...
168,0 -1.5317 1 -1.5413 2 -1.5150 3 ...
32,0 -1.6737 1 -1.6715 2 -1.6602 3 ...
97,0 -2.1468 1 -2.1483 2 -2.1332 3 ...
137,0 -1.6130 1 -1.6113 2 -1.5963 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.59s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.55s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.53s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.51s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.42s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.44s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.001804,80.006922,0.268863,0.30957,0.11221,0.247175,-0.174947,-1.079735,0.207072,1.028496,...,1.0,0.026014,0.009221,0.001828,0.0,0.0,0.0,0.996023,0.0,711648.9
1,250.000137,74.267442,0.377671,0.444606,0.092169,-0.226793,-0.669387,-1.186474,0.095314,0.709578,...,1.0,0.027791,0.00408,-0.010069,0.0,0.0,0.0,0.996016,0.0,1278059.0
2,249.998742,78.042719,0.350495,0.393683,0.078272,-0.345638,-0.696779,-1.236708,0.108092,0.657413,...,1.0,0.036067,0.004697,-0.014438,0.0,0.0,0.0,0.996011,0.0,-703648.5
3,249.999331,81.465244,0.294482,0.297854,0.057714,-0.325546,-0.633322,-1.040047,0.079833,0.533229,...,1.0,0.050728,-0.004668,-0.049672,0.0,0.0,0.0,0.996013,0.0,800316.8
4,249.99957,75.740292,0.370202,0.429531,0.085676,-0.29049,-0.68024,-1.199385,0.099466,0.663122,...,1.0,0.031119,0.005069,-0.012406,0.0,0.0,0.0,0.996014,0.0,-4554532.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.68s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.63s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.59s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.57s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.48s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.49s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.62s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.60s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.59s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.57s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.50s/it]




0.8867924528301887

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
1,0 -0.247409 1 -0.247409 2 -0.771290 3...,0 -0.060459 1 -0.060459 2 -0.047618 3...,0 -0.608565 1 -0.608565 2 -0.294411 3...,0 -0.023970 1 -0.023970 2 -0.269001 3...,0 0.101208 1 0.101208 2 0.111862 3...,0 0.071911 1 0.071911 2 0.135832 3...
30,0 -0.623875 1 -0.623875 2 -1.081529 3...,0 -2.123436 1 -2.123436 2 -0.121519 3...,0 -0.513654 1 -0.513654 2 0.809464 3...,0 -0.143822 1 -0.143822 2 -1.081329 3...,0 0.058594 1 0.058594 2 -0.127842 3...,0 1.086656 1 1.086656 2 0.066584 3...
26,0 -0.761604 1 -0.761604 2 0.121078 3...,0 0.260125 1 0.260125 2 -1.423255 3...,0 -0.064487 1 -0.064487 2 0.075600 3...,0 0.069248 1 0.069248 2 -0.282318 3...,0 0.242367 1 0.242367 2 -0.332922 3...,0 -0.007990 1 -0.007990 2 0.239704 3...
36,0 -1.801504 1 -1.801504 2 -0.480725 3...,0 2.344990 1 2.344990 2 -0.994385 3...,0 0.281253 1 0.281253 2 0.378807 3...,0 0.716447 1 0.716447 2 -0.870923 3...,0 0.162466 1 0.162466 2 0.095881 3...,0 0.921527 1 0.921527 2 -0.474080 3...
27,0 -0.255266 1 -0.255266 2 -0.792226 3...,0 -0.154748 1 -0.154748 2 -1.176848 3...,0 -0.273293 1 -0.273293 2 -0.709993 3...,0 -0.050604 1 -0.050604 2 -0.237040 3...,0 0.015980 1 0.015980 2 -0.314278 3...,0 0.013317 1 0.013317 2 0.170456 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.92s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.90s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.89s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.90s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.91s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.90s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,10.812641,11.473969,-0.012083,-0.005872,0.007912,-0.119671,-0.323769,-0.578856,0.014234,-0.020851,...,1.0,-0.000157,-0.000877,-0.001491,0.0,0.0,0.0,0.015948,0.0,12.814993
1,4402.264342,368.61827,-0.02578,-0.068173,0.027653,14.404859,3.550912,-2.012951,33.151709,28.820866,...,1.0,-81.172944,21.164744,10.776898,0.0,0.0,0.0,15.749446,1.0,-112.371986
2,220.949429,104.677565,-0.009585,-0.050995,0.059427,2.753245,0.62732,-0.862156,1.634518,3.490647,...,1.0,0.029599,0.139377,0.368081,0.0,0.0,0.0,1.671577,1.0,62.315369
3,5716.535296,375.788586,-0.010798,-0.08417,0.054152,19.190912,5.136329,-1.663644,38.880657,28.832684,...,1.0,-22.050941,-18.937862,3.202976,0.0,0.0,0.0,12.342621,1.0,-42.197112
4,176.508713,96.948338,-0.002837,-0.05035,0.059774,2.667797,0.628195,-1.039077,1.588622,4.086922,...,1.0,0.101057,0.271404,0.256932,0.0,0.0,0.0,1.357756,1.0,-69.665962
