# Advanced extension patterns in `sktime`

### Overview of this notebook

* using the advanced extension templates

    * example: forecaster with update and proba functionality
    * a closer look at tags, internal data formats
    * hierarchical data, automated vectorization
    * example: min-max scaler, but across multiple series
    * composite estimators
    * example: MA-of-transformed-data


* automated testing

    * using `check_estimator` as part of a test suite
    * using `sktime` test classes

In [1]:
from os import sys
sys.path.append("..")

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Advanced example 1: full extension template, optional methods

Let's look at the advanced forecaster extension template: *[extension_templates/forecasting.py](https://github.com/sktime/sktime-workshop-pydata-london-2022/blob/main/extension_templates/forecasting.py)*

Lists optional method to implement at the top:
```python
    updating                    - _update(self, y, X=None, update_params=True):
    predicting quantiles        - _predict_quantiles(self, fh, X=None, alpha=None)
    OR predicting intervals     - _predict_interval(self, fh, X=None, coverage=None)
    predicting variance         - _predict_var(self, fh, X=None, cov=False)
    distribution forecast       - _predict_proba(self, fh, X=None)
    fitted parameter inspection - get_fitted_params()
```

Extender contract ("what to implement") is defined in method docstrings.

The private "underscore" methods mirror public, user-facing methods.

Let's look at the forecaster base class `BaseForecaster` ([BaseForecaster definition](https://github.com/alan-turing-institute/sktime/blob/7a660e56489bed980546155932c30185c89f5327/sktime/forecasting/base/_base.py#L73)):

```python
    updating              - update(y, X=None, update_params=True)
    forecast intervals    - predict_interval(fh=None, X=None, coverage=0.90)
    forecast quantiles    - predict_quantiles(fh=None, X=None, alpha=[0.05, 0.95])
    forecast variance     - predict_var(fh=None, X=None, cov=False)
    distribution forecast - predict_proba(fh=None, X=None, marginal=True)
```

*update* is always implemented, and defaults to "refit on all seen data".

The proba forecast methods are not necessarily implemented.

If one is implemented (by implementing one of `_predict_quantiles`, `_predict_interval`, etc),
the other ones will be implemented via defaulting.

`capability:pred_int` tag should be set to `True` if this is done.

In [3]:
from pydata_sktime._3_1_2_forecaster_with_extras_buggy import EasyMAWithUpdateAndProba

from sktime.utils.estimator_checks import check_estimator

results = check_estimator(EasyMAWithUpdateAndProba)

FAILED: test_pred_int_tag[EasyMAWithUpdateAndProba-0]
FAILED: test_pred_int_tag[EasyMAWithUpdateAndProba-1]
FAILED: test_pred_int_tag[EasyMAWithUpdateAndProba-2]
FAILED: test_raises_not_fitted_error[EasyMAWithUpdateAndProba-0]
FAILED: test_raises_not_fitted_error[EasyMAWithUpdateAndProba-1]
FAILED: test_raises_not_fitted_error[EasyMAWithUpdateAndProba-2]


In [4]:
# what's happening? Let's inspect the error returned
results["test_pred_int_tag[EasyMAWithUpdateAndProba-0]"]
# fortunately this is helpful:

ValueError('EasyMAWithUpdateAndProba does implement probabilistic forecasting, but "capability:pred_int" flag has been set to False incorrectly. The flag "capability:pred_int" should instead be set to True.')

clear: we need to set the `capability:pred_int` tag to True, because we have implemented a proba method.

In [5]:
results["test_raises_not_fitted_error[EasyMAWithUpdateAndProba-0]"]

AttributeError("'EasyMAWithUpdateAndProba' object has no attribute '_forecast_value'")

not very clear... need to look at traceback and test
1. what does test do?
    code inspection: "Test that calling post-fit methods before fit raises error"
    
    -> `get_fitted_params` should raise `NotFittedError` if not fitted
2. what does traceback indicate?

    -> a fitted parameter is accessed before fitting, consistent with test failure

In [6]:
# obtain traceback by uncommenting and running this:
check_estimator(EasyMAWithUpdateAndProba, tests_to_run="test_raises_not_fitted_error", return_exceptions=False)

AttributeError: 'EasyMAWithUpdateAndProba' object has no attribute '_forecast_value'

let's fix this (see "forecaster with extras, complete"), and let's check again:

In [7]:
from pydata_sktime._3_1_3_forecaster_with_extras_complete import EasyMAWithUpdateAndProba

from sktime.utils.estimator_checks import check_estimator

results = check_estimator(EasyMAWithUpdateAndProba, return_exceptions=False)

All tests PASSED!


### :-)

**Caveats**:

* note the differences in signatures of public/private methods
    * private methods have simpler interface due to "plumbing" layer that replaces boilerplate
    * example: `fh` is guaranteed `ForecastingHorizon`; `alpha` is guaranteed list
    * examples: `y` and `X` are in specified mtype formats (more below)
* when implementing optional methods, check and possibly set capability tags
    * `check_estimator` will often catch an incorrectly set capability tag
* when calling `self.method` from `self.method2`, mind conversions. Often `self._method` is better for self-call

## `sktime` tags explained

all `sktime` estimators have tags, similar to `sklearn`

three types of tags:

* capability tags, e.g., "has `predict_proba`"
* property and type tags, e.g., "is a tree-based method", or "outputs series"
* behavioural tags, e.g., "instruction: convert to `numpy` for inner `_fit` method"

Tag related methods, via `BaseObject`:
* `get_tags`, `get_tag` - retrieve tag values
* `set_tags`, `clone_tags` - set tags, *developer use only* for implementing estimators

Tag values may depend on estimator parameter values!

In [8]:
# example: tags of ARIMA forecaster
from sktime.forecasting.arima import ARIMA

ARIMA().get_tags()

{'scitype:y': 'univariate',
 'ignores-exogeneous-X': False,
 'capability:pred_int': True,
 'handles-missing-data': True,
 'y_inner_mtype': 'pd.Series',
 'X_inner_mtype': 'pd.DataFrame',
 'requires-fh-in-fit': False,
 'X-y-must-have-same-index': True,
 'enforce_index_type': None,
 'fit_is_empty': False}

**Forecaster tags:**

Capability tags:
* `scitype:y`: which y are fine? univariate/multivariate/both
* `handles-missing-data`: can the estimator handle missing data? boolean, True or False
* `capability:pred_int`: does forecaster implement proba forecasts? boolean, True or False.

Property and type tags:
* `ignores-exogeneous-X`: does the estimator ignore the exogeneous X? boolean, True or False
* `requires-fh-in-fit`: is the forecasting horizon already required in fit? boolean, True or False.  
* `X-y-must-have-same-index`: can the estimator handle different X/y index? boolean, True or False.

Behavioural tags:
* `y_inner_mtype`: `sktime` data format (mtype) used in internal methods `_fit`, `_predict`. Example: `"pd.Series"`
* `X_inner_mtype`: `sktime` data format (mtype) used in internal methods `_fit`, `_predict`.  Example: `"pd.DataFrame"`
* `enforce_index_type`: index type that needs to be enforced in X/y. None if index type is not enforced.
* `fit_is_empty`: is fit empty and can be skipped? boolean, True or False

**Transformer tags:**

Capability tags:
* `capability:inverse_transform`: can the transformer inverse transform? boolean, True or False
* `univariate-only`: can the transformer handle multivariate X? boolean, True or False
* `capability:unequal_length`: can the transformer handle unequal length time series (if passed Panel)? boolean, True or False
* `capability:unequal_length:removes`: is transform result always guaranteed to be equal length (and series)? not relevant for transformers that return Primitives in transform-output. boolean, True or False
* `handles-missing-data`: can estimator handle missing data? boolean, True or False
* `capability:missing_values:removes`: is transform result always guaranteed to contain no missing values? boolean, True or False

Property and type tags:
* `scitype:transform-input`: what is the scitype of X: Series, or Panel
* `scitype:transform-output`: what scitype is returned: Primitives, Series, Panel
* `scitype:transform-labels`: what is the scitype of y: None (not needed), Primitives, Series, Panel
* `scitype:instancewise`: is this an instance-wise transform? boolean, True or False  
    for example the [LogTransformer](https://github.com/alan-turing-institute/sktime/blob/4bf649b9a55861f8e7f61f017384d3e035a7d689/sktime/transformations/series/boxcox.py#L211) is applied on each time point individually.
* `requires_y`: does y need to be passed in fit? boolean, True or False
* `X-y-must-have-same-index`: can estimator handle different X/y index? boolean, True or False
* `transform-returns-same-time-index`: does transform return have the same time index as input X boolean, True or False

Behavioural tags:
* `X_inner_mtype`: `sktime` data format (mtype) used in internal methods `_fit`, `_predict`.  Example: `"pd.DataFrame"`
* `y_inner_mtype`: `sktime` data format (mtype) used in internal methods `_fit`, `_predict`. Should be `"None"` if `y` is not used.
* `enforce_index_type`: index type that needs to be enforced in X/y. None if no idex type is enforced
* `fit_is_empty`: is fit empty and can be skipped? boolean, True or False
* `skip-inverse-transform`: is inverse-transform skipped when called? boolean, True or False

## `sktime` data formats - scitypes and mtypes

`sktime` supports multiple specifications for in-memory time series containers:
* **scitypes:** Short for scientific types, also: abstract data type. Type on the mathematical level.
    * example: a time series (mathematical) = `Series` scitype in `sktime`
* **mtypes:** Short for machine types. Specific representation types, for a scitype.
    * Scitype can have multiple mtypes
    * example: a time series, represented as a `pd.DataFrame` in a specific way
    * example: a time series, represented as a `np.ndarray` in a specific way

`sktime` typically accepts mtypes of the same scitype interchangeably.
E.g., can pass `numpy` or `pandas` representation to methods.

The currently supported scitypes for time series are:
* **Series:** uni- or multivariate time series
* **Panel:** panel of uni- or multivariate time series
* **Hierarchical:** hierarchical panel of time series with 3 or more levels

(there are more scitypes, for other objects)

Some of the mtypes for the `Panel` scitype:
* **pd-multiindex:** `pd.DataFrame` with multi-index (instances, timepoints)
* **nested_univ:** `pd.DataFrame` with one column per variable, `pd.Series` in cells
* **numpy3D:** 3D `np.array` of format (n_instances, n_columns, n_timepoints)
* **df-list:** `list` of `pd.DataFrame`, each `DataFrame` a series

Full list:

In [9]:
import pandas as pd
from sktime.datatypes import MTYPE_REGISTER

pd.DataFrame(MTYPE_REGISTER, columns=["mtype string", "scitype", "explanation"])

Unnamed: 0,mtype string,scitype,explanation
0,pd.Series,Series,pd.Series representation of a univariate series
1,pd.DataFrame,Series,pd.DataFrame representation of a uni- or multi...
2,np.ndarray,Series,"2D numpy.ndarray with rows=samples, cols=varia..."
3,nested_univ,Panel,"pd.DataFrame with one column per variable, pd...."
4,numpy3D,Panel,"3D np.array of format (n_instances, n_columns,..."
5,numpyflat,Panel,"WARNING: only for internal use, not a fully su..."
6,pd-multiindex,Panel,"pd.DataFrame with multi-index (instances, time..."
7,pd-wide,Panel,"pd.DataFrame in wide format, cols = (instance*..."
8,pd-long,Panel,"pd.DataFrame in long format, cols = (index, ti..."
9,df-list,Panel,list of pd.DataFrame


`sktime` considers different mtypes of the same scitype interchangeably.
This is similar to manual use of the `convert_to` functions.

In [10]:
from IPython.display import display

from sktime.datatypes import convert_to
from sktime.datatypes import get_examples

example_panel = get_examples("pd-multiindex")[0]

print("pd-multiindex")
display(example_panel)
print("")

print("numpy3D")
example_panel = convert_to(obj=example_panel, to_type="numpy3D")
display(example_panel)
print("")

pd-multiindex


Unnamed: 0_level_0,Unnamed: 1_level_0,var_0,var_1
instances,timepoints,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,1,4
0,1,2,5
0,2,3,6
1,0,1,4
1,1,2,55
1,2,3,6
2,0,1,42
2,1,2,5
2,2,3,6



numpy3D


array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 1,  2,  3],
        [ 4, 55,  6]],

       [[ 1,  2,  3],
        [42,  5,  6]]])




How does this relate to tags?

`X_inner_mtype`, `y_inner_mtype` and similar tags specify "inner mtypes".

= tags take mtype strings, guarantee mtype that is seen for `X`, `y`, in `_fit`, `_predict`, etc.

No need for conversion boilerplate!

---
## Advanced example 2: hierarchical data, automated vectorization

`sktime` automatically "vectorizes" forecasts and transformations across instances in a hierarchical time series or panel.

= apply transformer on each individual series and collect the results in same structure

But what if we want to write a `MinMaxScaler` that computes min/max across the entire data?

In [11]:
import pandas as pd
from sktime.datatypes import get_examples

X = get_examples("pd_multiindex_hier")[0][["var_1"]]
display(X)

from pydata_sktime._2_1_2_simple_transformer_complete import MinMaxScaler

t = MinMaxScaler()
Xt = t.fit_transform(X)
display(Xt)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,var_1
foo,bar,timepoints,Unnamed: 3_level_1
a,0,0,4
a,0,1,5
a,0,2,6
a,1,0,4
a,1,1,55
a,1,2,6
a,2,0,42
a,2,1,5
a,2,2,6
b,0,0,4


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,var_1
foo,bar,timepoints,Unnamed: 3_level_1
a,0,0,0.0
a,0,1,0.5
a,0,2,1.0
a,1,0,0.0
a,1,1,1.0
a,1,2,0.039216
a,2,0,1.0
a,2,1,0.0
a,2,2,0.027027
b,0,0,0.0


Let's look at the simple transformer again: [pydata_sktime/_3_2_2_transformer_hierarchical_buggy.py](https://github.com/sktime/sktime-workshop-pydata-london-2022/blob/main/pydata_sktime/_3_2_2_transformer_hierarchical_buggy.py)

We change the `X_inner_mtype` to `pd_multiindex_hier`.

This will allow to work with `pd.DataFrame` with hierarchy levels internally.

The full `X` will be passed to `_transform` and can be handled there.

In [12]:
from pydata_sktime._3_2_2_transformer_hierarchical_buggy import MinMaxScalerHierarchical

t = MinMaxScalerHierarchical()
t.fit_transform(X)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,var_1
foo,bar,timepoints,Unnamed: 3_level_1
a,0,0,0.0
a,0,1,0.019608
a,0,2,0.039216
a,1,0,0.0
a,1,1,1.0
a,1,2,0.039216
a,2,0,0.745098
a,2,1,0.019608
a,2,2,0.039216
b,0,0,0.0


In [13]:
X_simple = get_examples("pd.Series")[0]
display(X_simple)

t = MinMaxScalerHierarchical()
Xt_simple = t.fit_transform(X_simple)
display(Xt_simple)


0    1.0
1    4.0
2    0.5
3   -3.0
Name: a, dtype: float64

0    0.571429
1    1.000000
2    0.500000
3    0.000000
dtype: float64

let's test this!

In [14]:
from sktime.utils.estimator_checks import check_estimator

results = check_estimator(MinMaxScalerHierarchical)

FAILED: test_fit_idempotent[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY-transform]
FAILED: test_methods_do_not_change_state[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY-transform]
FAILED: test_methods_have_no_side_effects[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY-transform]
FAILED: test_persistence_via_pickle[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY-transform]
FAILED: test_fit_transform_output[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY]
FAILED: test_transform_inverse_transform_equivalent[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY]


Why is this failing tests?

In [15]:
# what's happening? Let's inspect the error returned
results["test_fit_idempotent[MinMaxScalerHierarchical-TransformerFitTransformPanelUnivariateWithClassY-transform]"]

KeyError('Level [-2, -1] not found')

seems like attempted `pandas.MultiIndex` access in a case where the passed object has only a simple `pandas` index

probably vectorization or conversion between scitypes is failing

let's try: set `X_inner_mtype` to the list `["pd_multiindex_hier", "pd-multiindex", "pd.DataFrame"]`

= the inner functions get:
* `pd_multiindex_hier` formatted `X` if `X` is `Hierarchical`
* `pd-multiindex` formatted `X` if `X` is `Panel`
* `pd.DataFrame` formatted `X` if `X` is `Series`

the rule is:
* if an mtype of the same scitype is on the list, convert to that
* if no mtype of the same scitype is on the list, try to vectorize or coerce

the code in `_fit`, `_transform` should work for any `pandas` format,

so this way we avoid conversions between scitypes

let's have a look at the transformer after we make the chage:

In [16]:
from sktime.utils.estimator_checks import check_estimator

from pydata_sktime._3_2_3_transformer_hierarchical_complete import MinMaxScalerHierarchical

t = MinMaxScalerHierarchical()
results = check_estimator(MinMaxScalerHierarchical)

All tests PASSED!


---
## Advanced example 3: a composite estimator

Let's look at the advanced forecaster extension template: *[extension_templates/forecasting.py](https://github.com/sktime/sktime-workshop-pydata-london-2022/blob/main/extension_templates/forecasting.py)*

Extra things to do for composites.

In constructor:

* pass component(s) to constructor, write to `self`
* create clone, this will be fitted. Don't fit constructor arg!
* set dynamic tags

In methods:

* usually composite `_fit` calls component `fit`, composite `_update` calls component `update` etc
* non-state-changing methods `_predict`, `_transform` should not call inner state-changing `fit`, `update`

In [17]:
from pydata_sktime._3_3_2_forecaster_composite_buggy import CompositeMovingAverage

from sktime.utils.estimator_checks import check_estimator

results = check_estimator(CompositeMovingAverage)

FAILED: test_fit_does_not_overwrite_hyper_params[CompositeMovingAverage-1-ForecasterFitPredictUnivariateWithX]


Tests fail, what's wrong?

In [18]:
results["test_fit_does_not_overwrite_hyper_params[CompositeMovingAverage-1-ForecasterFitPredictUnivariateWithX]"]

AssertionError('Estimator CompositeMovingAverage should not change or mutate  the parameter transformer from ExponentTransformer(power=3) to ExponentTransformer(power=3) during fit.')

Common error!

We have changed the component that was passed to the constructor in the course of `fit`.

See in the code of the buggy forecasters' `_fit`:

```python
if self.transformer is not None:
    self.transformer.fit(y)
```

and `transformer` was the component.

According to the `sklearn` interface specification (which `sktime` follows),

*components must never be mutated*

Instead: make a clone and use that for fitting, etc.

Let's fix that:

In [19]:
from pydata_sktime._3_3_3_forecaster_composite_complete import CompositeMovingAverage

from sktime.utils.estimator_checks import check_estimator

results = check_estimator(CompositeMovingAverage)

All tests PASSED!


### :-)

**Caveats**:

* *never* call `fit` etc on components passed to the constructor
    * always clone and use the cloned object for fitting etc
* when calling methods of components, use the *public* ones, i.e., `fit`, not `_fit`, etc
* do not overwrite conversion tags `X_inner_mtype`, `y_inner_mtype` etc dynamically!
    * if you do, this will typically get you in trouble and break assumed interface patterns
* `_update`: composites usually need this implemented, call `update` of components (to update data memory)

---
## Test framework integration

Ways to integrate `sktime` tests with local test framework:

* import and use `sktime.utils.estimator_checks.check_estimator`
* import and extend `sktime` test classes

importing `check_estimator`: let's look at an example in the repository *[pydata_sktime/test_estimators.py](https://github.com/sktime/sktime-workshop-pydata-london-2022/blob/main/pydata_sktime/test_estimators.py)*

---
## Summary

* `sktime` is an extensible, sklearn-like framework for learning with time series
* use extension template to build `sktime`-compatible components
* use `sktime` testing integration to ensure compatibility and in CI/CD
* `sktime`-compatible components work out-of-the-box with `sktime` framework machinery

---

### Credits: notebook 3 - advanced extension

notebook creation: fkiraly, ltsaprounis

extension templates: fkiraly\
extension guide (developer docs): fkiraly\
forecaster base class: mloning, big-o, fkiraly, sveameyer13, miraep8\
transformer base class: mloning, fkiraly\
testing framework: mloning, fkiraly

---

## Join sktime!

* openly governed, approx equal academia/industry/early career split
    * 19 core developers
    * community council
* numfocus-affiliated, affiliated academic centers in UK (and expanding)

**EVERYONE CAN JOIN! EVERYONE CAN BECOME A COMMUNITY LEADER!**

* join our slack (developers) and discord (events)!
    * regular **community collaboration sessions** and stand-ups on Fridays
    * next **developer sprint**: July 11 - July 15 -> [register online!](https://www.eventbrite.com/e/dev-days-2022-tickets-366909134097?utm-campaign=social&utm-content=attendeeshare&utm-medium=discovery&utm-term=listing&utm-source=cp&aff=escb)

Opportunities:
* job opportunity, **maintainer role**, watch the jobs channel
* sktime **mentoring programme**: github.com/sktime/mentoring

**sktime developer sprint 2022**

<img src="img/devdaysQR2022.png"/>