# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
138,0 -1.0925 1 -1.0926 2 -1.0908 3 ...
25,0 -0.96939 1 -0.97211 2 -0.97262 3...
6,0 -1.2612 1 -1.2949 2 -1.3101 3 ...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...
73,0 -1.0511 1 -1.0516 2 -1.0529 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.94s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.82s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.77s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.72s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.69s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.67s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,148.999228,36.72962,0.551482,0.601332,0.041841,-0.871806,-1.188182,-1.301264,0.028721,0.588908,...,1.0,0.00319,0.003291,-0.001642,0.0,0.0,0.0,0.993328,0.0,-917170.4
1,149.000073,33.65817,0.473236,0.497919,0.051479,-0.967036,-0.976869,-0.987975,0.010249,-0.540492,...,1.0,-0.000756,0.004565,0.011839,0.0,0.0,0.0,0.993334,0.0,970774.0
2,148.999845,41.401452,0.340617,0.332036,0.066636,-0.339935,-0.799107,-1.5327,0.324976,0.863113,...,1.0,1.1e-05,-0.007809,-0.023125,0.0,0.0,0.0,0.993332,0.0,4671846.0
3,148.999167,24.23109,0.346763,0.35825,0.12687,-1.038005,-0.931154,-0.610837,-0.097404,-0.489313,...,1.0,-0.002373,-0.003888,-0.004423,0.0,0.0,0.0,0.993328,0.0,-695343.0
4,149.000186,34.89025,0.545607,0.573108,0.039417,-0.988734,-1.216355,-1.212862,0.032217,0.078117,...,1.0,0.008262,0.009546,0.005033,0.0,0.0,0.0,0.993335,0.0,402963.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.74s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.72s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.71s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.70s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.69s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.69s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.29s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.27s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.25s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.26s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.25s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.25s/it]




0.96

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
11,0 1.916536 1 1.916536 2 -4.50699...,0 -0.465923 1 -0.465923 2 -6.69169...,0 -0.374403 1 -0.374403 2 -1.330023 3...,0 0.125179 1 0.125179 2 0.708457 3...,0 -0.034624 1 -0.034624 2 -0.055931 3...,0 -0.229050 1 -0.229050 2 2.104064 3...
22,0 0.175924 1 0.175924 2 0.194403 3...,0 0.548757 1 0.548757 2 -3.699192 3...,0 -1.191314 1 -1.191314 2 -0.554051 3...,0 0.039951 1 0.039951 2 0.042614 3...,0 0.263674 1 0.263674 2 -0.178446 3...,0 0.937507 1 0.937507 2 0.071911 3...
26,0 -0.098166 1 -0.098166 2 -0.665304 3...,0 -0.117578 1 -0.117578 2 -1.194660 3...,0 -0.401143 1 -0.401143 2 1.228442 3...,0 -0.061258 1 -0.061258 2 -0.567298 3...,0 0.090555 1 0.090555 2 0.029297 3...,0 0.018644 1 0.018644 2 -0.005327 3...
15,0 -0.159076 1 -0.159076 2 -0.97770...,0 0.376722 1 0.376722 2 0.38349...,0 -0.445368 1 -0.445368 2 1.695360 3...,0 -0.029297 1 -0.029297 2 -0.255684 3...,0 0.029297 1 0.029297 2 0.375536 3...,0 -0.047941 1 -0.047941 2 0.516694 3...
9,0 0.126160 1 0.126160 2 1.771871 3...,0 0.102733 1 0.102733 2 -3.798484 3...,0 0.308964 1 0.308964 2 0.141369 3...,0 0.002663 1 0.002663 2 -1.427568 3...,0 0.000000 1 0.000000 2 -0.167792 3...,0 -0.007990 1 -0.007990 2 -1.643301 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:32,  8.03s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:24,  8.00s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:16,  8.01s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  8.00s/it]

Feature Extraction: 100%|██████████| 5/5 [00:40<00:00,  8.02s/it]

Feature Extraction: 100%|██████████| 5/5 [00:40<00:00,  8.01s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,10767.729354,687.537549,0.006588,-0.123053,0.199384,14.549712,3.754886,-12.12853,71.768286,17.872845,...,1.0,10.906173,28.61971,40.284538,0.0,0.0,0.0,24.600487,1.0,63.170176
1,383.560959,127.01821,-0.000228,0.001823,0.125313,3.757798,1.359549,-0.84666,2.566076,4.442557,...,1.0,0.2535,0.439563,1.125258,0.0,0.0,0.0,2.834777,1.0,-56.442891
2,175.315925,107.473488,-0.025106,-0.071767,0.06377,3.16034,0.603238,-1.260671,1.831095,4.159463,...,1.0,0.064035,0.213084,0.33441,0.0,0.0,0.0,1.410792,1.0,94.483551
3,20089.782616,936.012458,-0.031604,-0.070448,0.144797,24.032611,6.174375,-16.526685,200.755496,27.548164,...,1.0,5.090285,19.718272,76.965414,0.0,1.0,0.0,34.822337,1.0,-24.347572
4,9.735453,15.20245,-0.011942,-0.006515,0.005243,0.542816,-0.133406,-0.447754,0.122703,1.771871,...,1.0,-0.007607,-0.01442,-0.010061,0.0,0.0,0.0,0.123327,0.0,57.578839
