# Time series classification with `sktime`

[Github weblink](https://github.com/alan-turing-institute/sktime/blob/dev/examples/time-series-classification_walkthrough.ipynb)

## Preliminaries

In [1]:
import sktime
print(sktime.__version__)

0.1.0


Import classes and functions from sktime, some of them extend the familiar `sklearn` methods to the time series classification setting.

In [2]:
from sktime.highlevel import TSCTask
from sktime.highlevel import TSCStrategy
from sktime.transformers.compose import RowwiseTransformer
from sktime.transformers.compose import ColumnTransformer
from sktime.transformers.compose import Tabulariser
from sktime.transformers.series_to_series import RandomIntervalSegmenter
from sktime.pipeline import Pipeline
from sktime.pipeline import FeatureUnion
from sktime.classifiers.ensemble import TimeSeriesForestClassifier
from sktime.datasets import load_gunpoint
from sktime.utils.time_series import time_series_slope

from statsmodels.tsa.stattools import acf
from statsmodels.tsa.ar_model import AR

from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import numpy as np
import pandas as pd

## Load data
You can find more information on the dataset in the docstring of the loading function. 

In [3]:
X_train, y_train = load_gunpoint(split='TRAIN', return_X_y=True)
X_test, y_test = load_gunpoint(split='TEST', return_X_y=True)

Throughout `sktime`, the expected data format is a pandas DataFrame, in which single columns can contain not only primitives as for the classification labels, but also pandas Series and numpy arrays as for the time series observations. 

In [4]:
X_train.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


## Low level interface

### Fully modular time-series forest classifier (TSF)
We can specify the time-series tree classifier as a fully modular pipeline using series-to-primitive feature extraction transformers and a final decision tree classifier.

In [5]:
steps = [
    ('segment', RandomIntervalSegmenter(n_intervals='sqrt')),
    ('transform', FeatureUnion([
        ('mean', RowwiseTransformer(FunctionTransformer(func=np.mean, validate=False))),
        ('std', RowwiseTransformer(FunctionTransformer(func=np.std, validate=False))),
        ('slope', RowwiseTransformer(FunctionTransformer(func=time_series_slope, validate=False)))
    ])),
    ('clf', DecisionTreeClassifier())
]
base_estimator = Pipeline(steps)

We can direclty fit and evaluate the single tree, which itself is simply a pipeline.

In [6]:
base_estimator.fit(X_train, y_train)
base_estimator.score(X_test, y_test)

0.76

For time series forest, we can simply use the single tree as the base estimator in the forest ensemble.

In [7]:
tsf = TimeSeriesForestClassifier(base_estimator=base_estimator, 
                                 n_estimators=100,
                                 criterion='entropy',
                                 bootstrap=True, 
                                 oob_score=True)

Fit and obtain the out-of-bag score:

In [8]:
tsf.fit(X_train, y_train)
if tsf.oob_score:
    print(tsf.oob_score_)

1.0


In [9]:
tsf.score(X_test, y_test)

0.9733333333333334

### RISE
Another popular variant of time series forest is the so-called Random Interval Spectral Ensemble (RISE), which makes use of several series-to-series feature extraction transformers, including:

* Fitted auto-regressive coefficients,  
* Estimated autocorrelation coefficients,
* Power spectrum coefficients.

In [10]:
def ar_coefs(x, maxlag=100):
    nlags = np.minimum(len(x) - 1, maxlag)
    model = AR(endog=x) 
    return model.fit(maxlag=nlags, trend='nc').params

def acf_coefs(x, maxlag=100):
    nlags = np.minimum(len(x) - 1, maxlag)
    return acf(x, nlags=nlags)

def powerspectrum(x, **kwargs):
    fft = np.fft.fft(x)
    ps = fft.real * fft.real + fft.imag * fft.imag
    return ps[:ps.shape[0] // 2]

The full pipeline of a single tree in RISE is then specified as follows:

In [11]:
steps = [
    ('segment', RandomIntervalSegmenter(n_intervals=1, min_length=5)),
    ('transform', FeatureUnion([
        ('ar', RowwiseTransformer(FunctionTransformer(func=ar_coefs, validate=False))),
        ('acf', RowwiseTransformer(FunctionTransformer(func=acf_coefs, validate=False))),
        ('ps', RowwiseTransformer(FunctionTransformer(func=powerspectrum, validate=False)))
    ])),
    ('tabularise', Tabulariser()),
    ('clf', DecisionTreeClassifier())

]
base_estimator = Pipeline(steps)

In [12]:
rise = TimeSeriesForestClassifier(base_estimator=base_estimator,
                                  n_estimators=50)

In [13]:
rise.fit(X_train, y_train)
if rise.oob_score:
    print(rise.oob_score_)

In [14]:
rise.score(X_test, y_test)

0.98

## High level interface 
The high level create a unified interface between different but related time series methods, while still closely following the `sklearn` estimator design whenever possible. On the high level, two new classes are introduced: 
* A *task*, which encapsulates the information about the learning task, for example the name of the target variable, and any additional necessary instructions on how to run fit and predict.
* A *strategy* which wraps the low level estimators and takes a task and the whole dataframe as input in fit. 

In [15]:
train = load_gunpoint(split='TRAIN')
test = load_gunpoint(split='TEST')

In [16]:
task = TSCTask(target='class_val', metadata=train)

In [17]:
clf = TimeSeriesForestClassifier(n_estimators=50)
strategy = TSCStrategy(clf)

Fit using task and training data

In [18]:
strategy.fit(task, train)

TSCStrategy(TimeSeriesForestClassifier(base_estimator=Pipeline(check_input=False, memory=None, random_state=None,
     steps=[('transform', RandomIntervalFeatureExtractor(check_input=False,
                features=[<function mean at 0x114293e18>, <function std at 0x114293ea0>, <function time_series_slope at 0x11c5d7488>],
                min_length=1, n_intervals='sqrt', random_state=None)), ('clf', DecisionTreeCla...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))]),
      bootstrap=False, check_input=True, class_weight=None,
      criterion='entropy', max_depth=None, max_features=None,
      max_leaf_nodes=None, min_impurity_decrease=0.0,
      min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
      min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
      oob_score=False, random_state=None, verbose=0, warm_start=False))

Predict and evaluate fitted strategy on test data

In [19]:
y_pred = strategy.predict(test)
y_test = test[task.target]
accuracy_score(y_test, y_pred)

0.9466666666666667