### Overview of this notebook

* overview of `sktime` as a framework - estimators modules, library, data
* overview of learning tasks in `sktime`
* data formats used in `sktime`, import and validity checking
* basic vignettes for learning tasks - forecasting, classification, regression, clustering & more
* searching the library for estimators, tag system
* estimator level dependency management
* creating your own estimator, to `sktime`, or for third party use (closed or open) - short primer

# 2. `sktime` in a nutshell - learning tasks, modules, library, data

**A) `sktime` is a modular framework for multiple learning tasks**

Example: forecasting (predict future of ts), classification (predict label of ts)

**B) estimators/algorithms are of a scientific type = which task do they solve?**

Example: ARIMA is a forecaster; knn with time series distance is classifier

**C) all estimators of a certain scitype have the same module interface**

Example: all forecasters classes have `fit` / `predict` with same contract

**D) `sktime` is a library which allows browsing of integrated estimators**

Example: search for all forecasters that are natively multivariate

**E) `sktime` is a mini-package manager for estimators and their dependencies**

Example: `ARIMA` class requires `pmdarima`, but `sktime` itself does not

**F) `sktime` is extensible, write your own 3rd party plugins (open or closed)**

Example: forecaster in 3rd party codebase, plug & plays to `sktime` and test framework

## 2.1 Showcase with code vignettes

The above, with code. We revisit in more detail later.

### **A) `sktime` is a modular framework for multiple learning tasks**

Vignettes for forecasting and classification:
(we'll go into data types etc later)

In [None]:
from sktime.datasets import load_airline
from sktime.forecasting.naive import NaiveForecaster
import numpy as np

# step 1: data specification
y = load_airline()
# y is a pd.Series at monthly frequency

# step 2: specifying forecasting horizon
fh = np.arange(1, 37)
# this specifies a prediction 3 years ahead

# step 3: specifying the forecasting algorithm
forecaster = NaiveForecaster(strategy="last", sp=12)
# forecaster is now a forecaster object of type NaiveForecaster

# step 4: fitting the forecaster
forecaster.fit(y, fh=fh)
# forecaster changes state to "fitted"

# step 5: querying predictions
y_pred = forecaster.predict()
# y_pred is the forecasted time series, a pd.Series

In [None]:
from sktime.datasets import load_osuleaf
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

# step 1 - specify training data
X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
# X_train is 3D numpy array holding multiple instances of time series
# y_train is 1D numpy array with training labels for these instances

# step 2 - specify data to predict labels for
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")
X_new = X_new[:2]
# X_new is a 3D numpy array with the instances to label

# step 3 - specify the classifier
mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)
# clf is a classifier object of type KNeighborsTimeSeriesClassifier
# it consists of other sktime objects, mean_eucl_dist is a distance object

# step 4 - fitting the classifier
clf.fit(X_train, y_train)
# clf changes state to "fitted"

# step 5 - predict labels on new data
y_pred = clf.predict(X_new)
# y_pred is the predicted labels, an 1D numpy array

### **B) estimators/algorithms are of a scientific type = which task do they solve?**

`NaiveForecaster` is a forecaster; `KNeighborsTimeSeriesClassifier` is a classifier

In [None]:
from sktime.forecasting.naive import NaiveForecaster
from sktime.registry import scitype

scitype(NaiveForecaster)

In [None]:
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.registry import scitype

scitype(KNeighborsTimeSeriesClassifier)

**C) all estimators of a certain scitype have the same module interface**

C1 - the `NaiveForecaster` can be switched out for any forecaster in the base vignette

Only step 3 - specification changes!

In [None]:
from sktime.datasets import load_airline
from sktime.forecasting.arima import ARIMA
import numpy as np

# step 1: data specification
y = load_airline()

# step 2: specifying forecasting horizon
fh = np.arange(1, 37)

# step 3: specifying the forecasting algorithm
# forecaster = NaiveForecaster(strategy="last", sp=12)
forecaster = ARIMA()

# step 4: fitting the forecaster
forecaster.fit(y)

# step 5: querying predictions
y_pred = forecaster.predict(fh)

C2 - `fit` and `predict` are the same for both!

`fit(self, y, X=None, fh=None)`

In [None]:
?ARIMA.fit

In [None]:
?NaiveForecaster.fit

for classifiers, signature is different: `fit(X, y)`

but the same for all classifiers!

In [None]:
KNeighborsTimeSeriesClassifier.fit

### **D) `sktime` is a library which allows browsing of integrated estimators**

Example: search for all forecasters that can make probabilistic predictions

In [None]:
from sktime.registry import all_estimators

all_estimators("forecaster", filter_tags={"capability:pred_int": True}, as_dataframe=True)

all objects in `sktime` are tagged with metadata:

In [None]:
ARIMA().get_tags()

list all tags that apply to forecasters:

In [None]:
from sktime.registry import all_tags

all_tags("forecaster", as_dataframe=True)

### **E) `sktime` is a mini-package manager for estimators and their dependencies**

Example: `ARIMA` class requires `pmdarima`, but `sktime` itself does not

In [None]:
from sktime.forecasting.arima import ARIMA

ARIMA.get_class_tag("python_dependencies")
# this requires the pmdarima package
# the result is a PEP 440 compatible requirement string

by default, dependencies are checked at instantiation:

In [None]:
# from sktime.forecasting.fbprophet import Prophet

# Prophet()

# this would result in:

this would result in exception:

```
ModuleNotFoundError: Prophet requires package 'prophet' to be present in the python environment,
but 'prophet' was not found. 'prophet' is a soft dependency and not included in the base
sktime installation. Please run: `pip install prophet` to install the prophet package.
To install all soft dependencies, run: `pip install sktime[all_extras]`
```

### **F) `sktime` is extensible, write your own 3rd party plugins (open or closed)**

Example: forecaster in 3rd party codebase, plug & plays to `sktime` and test framework

snippet from forecaster extension template (in `extension_templates` dir):
```
How to use this implementation template to implement a new estimator:
- make a copy of the template in a suitable location, give it a descriptive name.
- work through all the "todo" comments below
- fill in code for mandatory methods, and optionally for optional methods
- do not write to reserved variables: is_fitted, _is_fitted, _X, _y, cutoff, _fh,
    _cutoff, _converter_store_y, forecasters_, _tags, _tags_dynamic, _is_vectorized
- you can add more private methods, but do not override BaseEstimator's private methods
    an easy way to be safe is to prefix your methods with "_custom"
- change docstrings for functions and the file
- ensure interface compatibility by sktime.utils.estimator_checks.check_estimator
- once complete: use as a local library, or contribute to sktime via PR
- more details:
  https://www.sktime.net/en/stable/developer_guide/add_estimators.html
```

![](./img/implementing_estimators.png)

In [None]:
from sktime.transformations.series.boxcox import BoxCoxTransformer
from sktime.utils.estimator_checks import check_estimator

res = check_estimator(BoxCoxTransformer)

In [None]:
res

## 2.2 Learning tasks in sktime

Step 1 - what is your learning task?

sktime estimator type support:

| Task | Status | Links |
|---|---|---|
| **Forecasting** | stable | [Tutorial](https://www.sktime.net/en/latest/examples/01_forecasting.html) · [API Reference](https://www.sktime.net/en/latest/api_reference/forecasting.html) · [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/forecasting.py)  |
| **Time Series Classification** | stable | [Tutorial](https://github.com/sktime/sktime/blob/main/examples/02_classification.ipynb) · [API Reference](https://www.sktime.net/en/latest/api_reference/classification.html) · [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/classification.py) |
| **Time Series Regression** | stable | [API Reference](https://www.sktime.net/en/latest/api_reference/regression.html) |
| **Transformations** | stable | [Tutorial](https://github.com/sktime/sktime/blob/main/examples/03_transformers.ipynb) · [API Reference](https://www.sktime.net/en/latest/api_reference/transformations.html) · [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/transformer.py)  |
| **Parameter fitting** | maturing | [API Reference](https://www.sktime.net/en/latest/api_reference/param_est.html) · [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/transformer.py)  |
| **Time Series Clustering** | maturing | [API Reference](https://www.sktime.net/en/latest/api_reference/clustering.html) ·  [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/clustering.py) |
| **Time Series Distances/Kernels** | maturing | [Tutorial](https://github.com/sktime/sktime/blob/main/examples/03_transformers.ipynb) · [API Reference](https://www.sktime.net/en/latest/api_reference/dists_kernels.html) · [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/dist_kern_panel.py) |
| **Annotation** | experimental | [Extension Template](https://github.com/sktime/sktime/blob/main/extension_templates/annotation.py) |
| **Distributions and simulation** | experimental |  |

rough overview of time series related learning tasks

![](./img/ts-tasks.jpg)

first some basic terminology on time series (required for the above)

### 2.2.1 time series - terminology

* time series
* variables, univariate, multivariate
* time index
* panel of time series, instances
* hierarchical time series

#### **time series**, **time index**, **variables**

time series = recorded observations of one object or process at different time points.

observations at different time points are of same kind/type.

observations recorded with **time index** (= recorded time stamp)

(index could be not time but ordered - for simplicity, still call this time series)

observations are of **variables** (= recording of an observable)

time series with 2 or more variables is called **multivariate**

with 1 variable is called **univariate**

**Example: airline data**

one time series recording number of airline passengers

one observable = number of passengers in given calendar month

index = which calendar month (period = span of calendar month)

In [13]:
from sktime.datasets import load_airline

y = load_airline()
y

Period
1949-01    112.0
1949-02    118.0
1949-03    132.0
1949-04    129.0
1949-05    121.0
           ...  
1960-08    606.0
1960-09    508.0
1960-10    461.0
1960-11    390.0
1960-12    432.0
Freq: M, Name: Number of airline passengers, Length: 144, dtype: float64

In [19]:
# pandas models the time index as a separate object:
y.index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='Period', length=203)

**Example: macroeconomic data**

one time series recording various macroeconomic variables over time

multiple observables = GDP, unemployment, etc

index = which calendar quarter (period = span of three calendar months)

In [14]:
from sktime.datasets import load_macroeconomic

y = load_macroeconomic()
y

Unnamed: 0_level_0,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1959Q1,2710.349,1707.4,286.898,470.045,1886.9,28.980,139.7,2.82,5.8,177.146,0.00,0.00
1959Q2,2778.801,1733.7,310.859,481.301,1919.7,29.150,141.7,3.08,5.1,177.830,2.34,0.74
1959Q3,2775.488,1751.8,289.226,491.260,1916.4,29.350,140.5,3.82,5.3,178.657,2.74,1.09
1959Q4,2785.204,1753.7,299.356,484.052,1931.3,29.370,140.0,4.33,5.6,179.386,0.27,4.06
1960Q1,2847.699,1770.5,331.722,462.199,1955.5,29.540,139.6,3.50,5.2,180.007,2.31,1.19
...,...,...,...,...,...,...,...,...,...,...,...,...
2008Q3,13324.600,9267.7,1990.693,991.551,9838.3,216.889,1474.7,1.17,6.0,305.270,-3.16,4.33
2008Q4,13141.920,9195.3,1857.661,1007.273,9920.4,212.174,1576.5,0.12,6.9,305.952,-8.79,8.91
2009Q1,12925.410,9209.2,1558.494,996.287,9926.4,212.671,1592.8,0.22,8.1,306.547,0.94,-0.71
2009Q2,12901.504,9189.0,1456.678,1023.528,10077.5,214.469,1653.6,0.18,9.2,307.226,3.37,-3.19


common abstract data model:

data frame, with row index = time index; column index = variable index

#### **panel of time series**

panel of time series is a collection of multiple time series instances

different time series in the collection = **instances**

instances usually assumed independent, or conditionally independent

**instance index** = names/tags of the different instances

**Example: basic motions data**

multiple time series, each time series (or sequence) comes from one trial

each trial involves smartwatch recording of a person while running etc

six observables = 3 accelerometer, 3 gyroscope

index = time stamp of the observable recording

instance = which trial

In [22]:
from sktime.datasets import load_basic_motions

X, _ = load_basic_motions(return_type="pd-multiindex")
X

Unnamed: 0_level_0,Unnamed: 1_level_0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
Unnamed: 0_level_1,timepoints,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
0,1,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
0,2,-0.903497,-3.666397,-0.282844,-0.095881,-0.319605,0.972131
0,3,1.116125,-0.656101,0.333118,1.624657,-0.569962,1.209171
0,4,1.638200,1.405135,0.393875,1.187864,-0.271664,1.739182
...,...,...,...,...,...,...,...
79,95,28.459024,-16.633770,3.631869,8.978229,-3.611533,-1.491489
79,96,10.260094,0.102775,1.269261,-1.645964,-3.377157,1.283746
79,97,4.316471,-3.574319,2.063831,-1.717875,-1.843054,0.484734
79,98,0.704446,-4.920444,2.851857,-2.982977,-0.809665,-0.721774


abstract data model: one value per instance number, time stamp, variable

no common data model, so `sktime` support multiple (will revisit later)

#### **panel vs multivariate?**

important to distinguish independent instance from variable!

instances indicate different observations; variables indicate different observables!

Example - macroeconomic data. Different *variables* because observe the same thing - the economy.

Variables in the same economy are highly interdependent.

Example - basic motions data. Different *instances* because observe different things - different humans.

Motion/gait data of different humans is independent, they do not influence each other (causally or by confounder).

#### **hierarchical time series**

are collections of time series with nested/hierarchical instance index

example: runner & trial repetition = index; observables = motion data of that runner in repetition

example: hospital & patient = index; observables = clinical variables of that patient

example: store & product = index; observable = sales over time period in store of product

will revisit later

### 2.2.2 learning task guide - primer

panel tasks

deal with *collections of time series* = "panel data"

Classification = try to assign one *category* per time series, after training on time series/category examples

Example: osuleaf - circumference point distance of leaves. Predict type of tree

Regression = try to assign one *category* per time series, after training on time series/category examples

Example: temperature/pressure/time profile of chemical reactor. Predict total purity (fraction of 1)

Clustering = put different time series in a small number of similarity buckets

## 2.3 `sktime` in-memory data formats

* `sktime` supports and recognizes multiple data formats for convenience and internal use, e.g., `dask`, `xarray`
* abstract data type = "scitype"; in-memory specification = "mtype"
* More information in tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading)



Preferred format 1: `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols= variables

Preferred format 2: 3D `np.ndarray` with index (instance, variable, time)


### 2.3.1 preferred format 1 - `pd-multiindex` specification

`pd-multiindex` = `pd.DataFrame` with 2-level `MultiIndex`, (instance, time), cols= variables

In [None]:
from sktime.datasets import load_osuleaf

# load an example time series panel in pd-multiindex mtype
X, _ = load_osuleaf(return_type="pd-multiindex")

The osuleaf dataset has:

* 412 individual time series instances
* one single variable per time series instances, `dim_0`
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X

In [None]:
from sktime.datasets import load_basic_motions

# load an example time series panel in pd-multiindex mtype
X, _ = load_basic_motions(return_type="pd-multiindex")

The basic motions dataset has:

* 6 individual time series instances
* six variables per time series instance, `dim_0` to `dim_5`
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X

### 2.3.2 preferred format 2 - `numpy3D` specification

`numpy3D` = 3D `np.ndarray` with index (instance, variable, time)

instance/time index is interpreted as integer

IMPORTANT: unlike `pd-multiindex`, this assumes:

* all individual series have the same length
* all individual series have the same index

In [None]:
from sktime.datasets import load_osuleaf

# load an example time series panel in numpy mtype
X, _ = load_osuleaf(return_type="numpy3D")

The osuleaf dataset has:

* 412 individual time series instances
* one single variable per time series instances
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X.shape

In [None]:
from sktime.datasets import load_basic_motions

# load an example time series panel in numpy mtype
X, _ = load_basic_motions(return_type="numpy3D")

The basic motions dataset has:

* 6 individual time series instances
* six variables per time series instance
* individual time series are observed at 100 time points (the same number for all instances)

In [None]:
X.shape

### 2.3.3 loading and validity checking

for custom data sets:

1. use `pandas` `read_csv` or similar utilities to obtain a `pd.DataFrame` or `np.ndarray`
2. try to bring the result in one of the preferred specifications
3. use the `check_is_mtype` utility to check compliance - inspect informative error messages
4. repeate 2-3 until the data format check passes

In [None]:
# let's pretend we just loaded this from csv
from sktime.datasets import load_osuleaf

X_pd, _ = load_osuleaf(return_type="pd-multiindex")

let's now check whether it complies with the `pd-multiindex` specification

In [None]:
from sktime.datatypes import check_is_mtype

valid, error_msg, metadata = check_is_mtype(X_pd, "pd-multiindex", return_metadata=True)

In [None]:
# is it valid?
valid

In [None]:
# helpful metadata, check if this is as per expectations
metadata

let's see what happens if it is not in the expected format.

We have a `pd.DataFrame`, so if we check against `numpy3D`, it should complain:

In [None]:
valid, error_msg, metadata = check_is_mtype(X_pd, "numpy3D", return_metadata=True)

In [None]:
valid

In [None]:
error_msg

This tells us that we should first convert into `np.ndarray` as expected.

For further details on data formats, see the tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading).

The "datatypes" tutorial also contains:

* full formal specifications of the mtypes (= machine representations)
* common examples for loading from csv and formatting
* utilities for loading data for commonly used benchmark problems

All supported in-memory representations are python inspectable in `sktime.datatypes.MTYPE_REGISTER`

Note that this includes "exotic", rarely used ones and representations of objects that aren't time series.
Formats for time series panels are indicated by the `Panel` mtype.


## 2.4 Time Series Classification, Regression, Clustering - Basic Vignettes

Above tasks are very similar to "tabular" classification, regression, clustering, as in `sklearn`

Main distinction:
* in "tabular" classification etc, one (feature) instance row vector of features
* in TSC, one (feature) instance is a full time series, possibly unequal length, distinct index set

TODO: INSERT HELPFUL PICTURE HERE


More formally:

* "tabular" classification: training pairs $(x_1, y_1), \dots, (x_n, y_n)$, where $x_i$ are rows of a `pd.DataFrame` (same col types), and $y_i \in \mathcal{C}$ for a finite set $y_i \in \mathcal{C}$. We use these to train a classifier that predicts $y_* \in \mathcal{C}$ for a `pd.DataFrame` row $x_*$
* time series classification: training pairs $(x_1, y_1), \dots, (x_n, y_n)$, where $x_i$ are time series from a certain domain, and $y_i \in \mathcal{C}$ for a finite set $y_i \in \mathcal{C}$. We use these to train a classifier that predicts $y_* \in \mathcal{C}$ for time series $x_*$

very similar for time series regression, clustering - exercise left to reader :-)

`sktime` design implications:

* need representation of collections of time series (panels), see Section 2.1
    * same as in "adjacent" learning tasks, e.g., panel forecasting
    * same as for transformation estimators
* algorithms that use sequentiality, can deal with unequal length etc
* algorithms usually based on distances or kernels between time series - need to cover that in framework
* but we can use familiar `fit` / `predict` and `scikit-learn` / `scikit-base` interface!

### 2.2.3 Time Series Classification - deployment vignette

Basic deployment vignette for TSC:

1. load/setup training data, `X` in a `Panel` format, `y` as 1D `np.ndarray`
2. load/setup new data for prediction (can be done after 2 too)
3. specify the classifier using `sklearn`-like syntax
4. fit classifier to training data, `fit(X, y)`
5. predict labels on new data, `predict(X_new)`

In [None]:
# steps 1, 2 - prepare osuleaf dataset (train and new)
from sktime.datasets import load_osuleaf

X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")
X_new = X_new[:2]  # smaller dataset for faster notebook runtime

In [None]:
# this is in numpy3D format, but could also be pd-multiindex or other
X_train.shape

In [None]:
# y is a 1D np.ndarray of labels - same length as number of instances in X_train
y_train.shape

In [None]:
# step 3 - specify the classifier
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

# example 1 - 3-NN with simple dynamic time warping distance (requires numba)
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3)

# example 2:
# 3-nearest neighbour classifier with mean (over time points) pairwise Euclidean distance
# (requires scipy)
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)

we could specify any `sktime` classifier here - the rest remains the same!

In [None]:
# all classifiers is scikit-learn / scikit-base compatible!
# nested parameter interface via get_params, set_params
clf.get_params()

In [None]:
# step 4 - fit/train the classifier
clf.fit(X_train, y_train)

In [None]:
# the classifier is now fitted
clf.is_fitted

In [None]:
# and we can inspect fitted parameters if we like
clf.get_fitted_params()

In [None]:
# step 5 - predict labels on new data
y_pred = clf.predict(X_new)

In [None]:
# y_pred is an 1D np.ndarray, similar to sklearn classification output
y_pred

all together in one cell:

In [None]:
# steps 1, 2 - prepare osuleaf dataset (train and new)
from sktime.datasets import load_osuleaf

X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")
X_new = X_new[:2]  # smaller dataset for faster notebook runtime

# step 3 - specify the classifier
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)

# step 4 - fit/train the classifier
clf.fit(X_train, y_train)

# step 5 - predict labels on new data
y_pred = clf.predict(X_new)

### 2.2.4 Time Series Classification - simple evaluation vignette

Evaluation is simila to `sklearn` classifiers - we split a dataset and evaluate performance on the test set.

This includes as additional steps:

* splitting the initial, historical data, e.g., using `train_test_split`
* comparing predictions with a held out data set

In [None]:
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.datasets import load_osuleaf

# data should be split into train/test
X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_test, y_test = load_osuleaf(split="test", return_type="numpy3D")
X_test = X_test[:2]
y_test = y_test[:2]

# step 3-5 are the same
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# for simplest evaluation, compare ground truth to predictions
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

### 2.2.5 Time Series Regression - basic vignettes

TSR vignettes are exactly the same as TSC, except that:

* `y` in `fit` input and `predict` output should be float 1D `np.ndarray`, not categorical
* other algorithms are commonly used and/or performant

In [None]:
# steps 1, 2 - prepare dataset (train and new)
from sktime.datasets import load_covid_3month

X_train, y_train = load_covid_3month(split="train")
y_train = y_train.astype("float")
X_new, _ = load_covid_3month(split="test")
X_new = X_new.loc[:2]  # smaller dataset for faster notebook runtime

# step 3 - specify the regressor
from sktime.regression.distance_based import KNeighborsTimeSeriesRegressor

clf = KNeighborsTimeSeriesRegressor(n_neighbors=3, distance=mean_eucl_dist)

# step 4 - fit/train the regressor
clf.fit(X_train, y_train)

# step 5 - predict labels on new data
y_pred = clf.predict(X_new)

In [None]:
y_pred  # not too interesting but float

### 2.2.6 Time Series Clustering - basic vignettes

TS clustering is similar - 1st step is also `fit`, but unsupervised

i.e., no labels `y`, and next step is inspecting clusters

In [None]:
from sktime.clustering.dbscan import TimeSeriesDBSCAN

# step 1 - prepare dataset (train and new)
X, _ = load_osuleaf(split="train", return_type="numpy3D")
X = X[:10]

# step 2 - specify the clusterer
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clst = TimeSeriesDBSCAN(distance=mean_eucl_dist)

# step 3 - fit the clusterer to the data
clst.fit(X)

# step 4 - inspect the clustering
clst.get_fitted_params()

## 2.5 Searching for estimators, estimator tags

Estimators in `sktime` are tagged.

Tags starting with "capability" indicate things the estimator can or cannot do, e.g.,

* `"capability:missing_values"` - dealing with missing values
* `"capability:multivariate"` - daling with multivariate input
* `"capability:unequal_length"` - deaing with time series panels where the individual time series have unequal length and/or unequal index

all tags for an estimator scitype (e.g., classifier, regressor) can be inspected by `sktime.registry.all_tags`:

In [None]:
from sktime.registry import all_tags

all_tags("classifier", as_dataframe=True)

valid estimator types are listed in the `all_tags` docstring, or `sktime.registry.BASE_CLASS_REGISTER`

In [None]:
from sktime.registry import BASE_CLASS_REGISTER

# get only fist table column, the list of types
list(zip(*BASE_CLASS_REGISTER))[0]

to find all estimators of a certain type, use `sktime.registry.all_estimators`

In [None]:
# list all classifiers in sktime
from sktime.registry import all_estimators

all_estimators("classifier", as_dataframe=True)

for listing all estimators of a certain type with a certain capability,
use the `filter_tags` argument of `all_estimators`:

In [None]:
# list all classifiers in sktime
# that can classify panels of time series containing missing data
from sktime.registry import all_estimators

all_estimators("classifier", as_dataframe=True, filter_tags={"capability:missing_values": True})

side note:

don't worry about how short the list is - when in doubt, it is always possible to pipeline with `Imputer`

as in the next section :-)

## 2.6 Pipelines, Feature Extraction, Tuning, Composition


similar to `sklearn` for "tabular" classification, regression, etc,

`sktime` has a rich set of tools for:

* feature extraction via transformers
* pipeline transformers with any estimator
* tuning individual estimators or pipelines via grid search and similar
* building ensembles out of individual estimators, or other composites

`sktime` is also fully interoperable with `sklearn` interface if `numpy` based data mtypes are used

(although this loses support for unequal length time series)

### 2.6.1 Primer on `sktime` transformers for feature extraction

all `sktime` transformers work natively with panel data:

In [None]:
from sktime.datasets import load_osuleaf
from sktime.transformations.series.detrend import Detrender

# load some panel data
X, _ = load_osuleaf(return_type="pd-multiindex")

# specify a linear detrender
detrender = Detrender()

# detrend X by removing linear trend from each instance
X_detrended = detrender.fit_transform(X)
X_detrended.head()

two transformer distinctions to be aware of:

* series-to-series transformers transform individual series to series, panels to panels. E.g., instance-wise detrender above
* series-to-primitive transformers transform individual series to a set of tabular features. E>g., summary feature extractor

either type of transform can be instance-wise:

* instance-wise transforms use only the i-th series to transform the i-th series. E.g., instance-wise detrender
* non-instance-wise transforms train on all series to transform the i-th series. E.g., PCA, overall mean detrender

In [None]:
# example of a series-to-primitive transformer
from sktime.transformations.series.summarize import SummaryTransformer

# specify summary transformer
summary_trafo = SummaryTransformer()

# extract summary features - one per instance in the panel
X_summaries = summary_trafo.fit_transform(X)
X_summaries

just like classifiers, we can search for transformers of either type via the right tag:

* `"scitype:transform-input"` and `"scitype:transform-output"` define input and output, e.g., "series-to-series" (both are scitype strings)
* `"scitype:instancewise"` is boolean and tells us whether the transform is instance-wise

In [None]:
# example: looking for all series-to-primitive transformers that are instance-wise
from sktime.registry import all_estimators

all_estimators(
    "transformer",
    as_dataframe=True,
    filter_tags={
        "scitype:transform-input": "Series",
        "scitype:transform-output": "Primitives",
        "scitype:instancewise": True,
    },
)

Further details on transformations and feature extraction later.

All composition steps and syntax (e.g., chaining, column subsetting) work together with all estimator types in `sktime` - forecasting, classification, regression, clustering etc.

### 2.6.2 Pipelines primer

all panel estimators pipeline with `sktime` transformers, via the `*` dunder or `make_pipeline`.

All pipelines does the following:

* in `fit`: runs the transformers' `fit_transform` in sequence, then `fit` of the panel estimator
* in `predict` (or other method), runs the fitted transformers' `transform` in sequence, then `predict` (or other method) of the estimator

(same logic as for `sklearn` pipelines)

In [None]:
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.transformations.series.exponent import ExponentTransformer

pipe = ExponentTransformer() * KNeighborsTimeSeriesClassifier()

# this constructs a ClassifierPipeline, which is also a classifier
pipe

In [None]:
# alternative to construct:
from sktime.pipeline import make_pipeline

pipe = make_pipeline(ExponentTransformer(), KNeighborsTimeSeriesClassifier())

In [None]:
from sktime.datasets import load_unit_test

X_train, y_train = load_unit_test(split="TRAIN")
X_test, _ = load_unit_test(split="TEST")

# this is a forecaster with the same interface as knn-classifier
# first applies exponent transform, then knn-classifier
pipe.fit(X_train, y_train)

`sktime` transformers also pipeline with `sklearn` transformers and estimators, e.g., classifiers!

This allows to build "time series feature extraction then `sklearn` classify`" pipelines:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sktime.transformations.series.summarize import SummaryTransformer

# specify summary transformer
summary_rf = SummaryTransformer() * RandomForestClassifier()

summary_rf.fit(X_train, y_train)

### 2.6.3 Tuning primer

## 2.7 Custom estimators - extension guide

`sktime` is meant to be easily extensible, for direct contribution to `sktime` as well as for local/private extension with custom methods.

To extend `sktime` with a new local or contributed estimator, a good workflow to follow is:

0. find the right extension template for the type of estimator you want to add - e.g., classifier, regressor, clusterer, etc. The extension templates are located in the [`extension_templates`](https://github.com/sktime/sktime/blob/main/extension_templates) directory
1. read through the extension template - this is a `python` file with `todo` blocks that mark the places in which changes need to be added.
2. optionally, if you are planning any major surgeries to the interface: look at the base class - note that "ordinary" extension (e.g., new algorithm) should be easily doable without this.
3. copy the extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the `sktime` or affiliated repository (if contributed extension), inside `sktime.[name_of_task]`; rename the file and update the file docstring appropriately.
4. address the "todo" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in `__init__`, `_fit`, `_predict` and/or other methods (for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.
5. to test your estimator manually: import your estimator and run it in the basic vignettes above.
6. to test your estimator automatically: call `sktime.tests.test_all_estimators.check_estimator` on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the `get_test_params` method, according to the extension template.

In case of direct contribution to `sktime` or one of its affiliated packages, additionally:
* add yourself as an author to the code, and to the `CODEOWNERS` for the new estimator file(s).
* create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.
* in the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.
* before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project.

---

### Credits: notebook 2 - sktime features and overview

notebook creation: fkiraly