# Package Building with `skbase`

### Overview of this notebook

building an sklearn-like package

* use of base classes
* strategy pattern, template pattern
* tags and configs
* retrieval via `all_objects`
* heterogeneous meta-estimators
* testing

In [None]:
from os import sys
sys.path.append("..")

In [None]:
import numpy as np
import pandas as pd

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Using ``skbase`` to implement an ``sklearn``-like

Recommended recipe:

1. implement a few examples of algorithms
2. use that to come up with a sensible template design
3. write down the user contract as a base class sketch, with full docstrings (incl assumes, guarantees)
4. write down the extension contract as an extension template with full docstrings (incl assumes, guarantees)
5. implement base classes using ``BaseObject``, ``BaseEstimator``, meta classes if applicable
    * public interface, boilerplate methods, base functionality
    * template methods, extension interface
    * tags, configs
6. implement concrete classes, start with reinterfacing examples
    * subject to tests
7. implement test framework for interface, e.g., `check_estimator`

Advanced functionality:

* estimator retrieval via `all_objects`
* systematic testing via test class framework

### Iteration 0 - examples, initial design

at the start, we want to:

* implement examples of the algorithms to inform the design
* use the examples to abstract a public unified interface, extender pattern

IMPORTANT: "3 example rule"

= before abstracting a design, ensure you have at least 3 substantially distinct examples!

e.g., 3 regressors, 3 transformers, 3 pipelines, etc

(in this repository, there is only 1 example per, for ease of reading
- but reader is expected to be familiar with others from `sklearn`!)

Examples in the repo:

* linear regression "naive implementation"
* preprocessing/scaling "naive implementation

*[pydata_skbase/mini_sklearn_v0](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v1)*

### Loading data used throughout

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
y = pd.DataFrame(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Iteration 1 - base class interface, strategy pattern

our 1st iteration - class design based on iteration 0

* in `base.py`, we define the public contract
    * `fit`/`predict` for regressors, `fit`/`transform` for transformers
* `base.py` also has the extender contract
    * overriding `fit`/`predict` directly
    * ensuring input and output checks are in place

*[pydata_skbase/mini_sklearn_v1](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v1)*

In [None]:
from pydata_skbase.mini_sklearn_v1.regression import LinReg
from pydata_skbase.mini_sklearn_v1.transform import Scaler

t = Scaler(strategy="std")
clf = LinReg()

In [None]:
clf.get_params()

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.get_fitted_params()

In [None]:
clf.predict(X_test)

In [None]:
t.get_params()

In [None]:
t.fit(X)

In [None]:
t.get_fitted_params()

In [None]:
t.transform(X_test)

### Iteration 2 - extender interface, template pattern, tags and configs

2nd iteration - improved

* input check boilerplate becomes repetitive!
    * use of template pattern for extenders, `fit`/`_fit` with boilerplate in `fit`
    * extension templates become much simpler!
* config: controlling the output type in `predict`, `transform` - `numpy` or `pandas`?
    * output conversion added in boilerplate layer (between `predict` and `_predict` etc)
    * config dict needs to be added to base classes
* tag: what kind of estimator, e.g., regressor, transformer?
    * tag dict needs to be added in base classes

*[pydata_skbase/mini_sklearn_v2](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v2)*

why separate public user (`fit`, `predict`) from private extender interface (`_fit`, `_predict`)?

* both should follow unified pattern - user = strategy pattern; extender = template pattern
* localizing the boilerplate in one place is DRY
    * DRY boilerplate is easier to maintain
    * repetitive boilerplate is a risk for errors, bugs, etc
    * boilerplate in one place also allows richer boilerplate!
* extension templates become much simpler and less error prone
    * easier to contribute new estimators!
    * easier for power users to maintain their own estimators!

example diagram, `sktime` public interface and extender interface:

<img src="img/interfaces.jpg"/>

In [None]:
from pydata_skbase.mini_sklearn_v2.regression import LinReg
from pydata_skbase.mini_sklearn_v2.transform import Scaler

t = Scaler(strategy="std")
clf = LinReg()

In [None]:
clf.fit(X_train, y_train)
clf.predict(X_test)

In [None]:
t.fit(X_train)
t.transform(X_test)

### Iteration 3 - pipelines, meta-estimators

3rd iteration - adding a meta-estimator

* inheriting from `BaseMetaEstimator` in `skbase`
* requires setting the `named_object_parameters` and `fitted_named_object_parameters` tags
* factor out `_CommonTags` to allow the meta-estimator and estimator base class to inherit

*[pydata_skbase/mini_sklearn_v3](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v3)*

In [None]:
from pydata_skbase.mini_sklearn_v3.pipeline import RegressorPipeline
from pydata_skbase.mini_sklearn_v3.regression import LinReg
from pydata_skbase.mini_sklearn_v3.transform import Scaler

t1 = Scaler(strategy="std")
t2 = Scaler(strategy="minmax")
clf = LinReg()

pipe = RegressorPipeline([t1, t2, clf])

In [None]:
pipe.get_params()

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.get_fitted_params()

In [None]:
pipe.predict(X_test)

---
## Summary

* `skbase` is a reusable workbench for creating `sklearn`-like, `sklearn` compatible libraries
* recipe to create `sklearn`-like libraries - public interface, extension template, tags/configs, testing
* `skbase` provides out of the box: parameter handling, tag/config handling, composition, retrieval, testing

---

### Credits: notebook 3 - advanced extension

notebook creation: fkiraly

skbase: fkiraly, rnkuhns, mloning, sktime & sklearn developers (by code reuse)\
sktime style design: fkiraly, mloning, sktime developers

---

## Join sktime!

* openly governed - users, developers, early career data scientists
* world-wide contributor and user footprint

**EVERYONE CAN JOIN! EVERYONE CAN BECOME A COMMUNITY LEADER!**

* join our discord (developers and community)!
    * regular **community collaboration sessions** and stand-ups on Fridays
    * next **developer sprint**: 

Opportunities:

* sktime **mentoring programme**: github.com/sktime/mentoring

**sktime developer sprint 2023**