# Package Building with `skbase`

### Overview of this notebook

building an sklearn-like package

* use of base classes
* strategy pattern, template pattern
* tags and configs
* retrieval via `all_objects`
* heterogeneous meta-estimators
* testing

In [1]:
from os import sys
sys.path.append("..")

In [2]:
import numpy as np
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Using ``skbase`` to implement an ``sklearn``-like

Recommended recipe:

1. implement a few examples of algorithms
2. use that to come up with a sensible template design
3. write down the user contract as a base class sketch, with full docstrings (incl assumes, guarantees)
4. write down the extension contract as an extension template with full docstrings (incl assumes, guarantees)
5. implement base classes using ``BaseObject``, ``BaseEstimator``, meta classes if applicable
    * public interface, boilerplate methods, base functionality
    * template methods, extension interface
    * tags, configs
6. implement concrete classes, start with reinterfacing examples
    * subject to tests
7. implement test framework for interface, e.g., `check_estimator`

Advanced functionality:

* estimator retrieval via `all_objects`
* systematic testing via test class framework

### Iteration 0 - examples, initial design

at the start, we want to:

* implement examples of the algorithms to inform the design
* use the examples to abstract a public unified interface, extender pattern

IMPORTANT: "3 example rule"

= before abstracting a design, ensure you have at least 3 substantially distinct examples!

e.g., 3 regressors, 3 transformers, 3 pipelines, etc

(in this repository, there is only 1 example per, for ease of reading
- but reader is expected to be familiar with others from `sklearn`!)

Examples in the repo:

* linear regression "naive implementation"
* preprocessing/scaling "naive implementation

*[pydata_skbase/mini_sklearn_v0](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v1)*

### Loading data used throughout

In [4]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True, as_frame=True)
y = pd.DataFrame(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Iteration 1 - base class interface, strategy pattern

our 1st iteration - class design based on iteration 0

* in `base.py`, we define the public contract
    * `fit`/`predict` for regressors, `fit`/`transform` for transformers
* `base.py` also has the extender contract
    * overriding `fit`/`predict` directly
    * ensuring input and output checks are in place

*[pydata_skbase/mini_sklearn_v1](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v1)*

In [5]:
from pydata_skbase.mini_sklearn_v1.regression import LinReg
from pydata_skbase.mini_sklearn_v1.transform import Scaler

t = Scaler(strategy="std")
clf = LinReg()

In [6]:
clf.get_params()

{'shrink': 0.0}

In [7]:
clf.fit(X_train, y_train)

In [8]:
clf.get_fitted_params()

{'beta':          target
 age  -46.556324
 sex -274.657701
 bmi  620.813306
 bp   294.572797
 s1  -254.264131
 s2   -46.015739
 s3  -105.810988
 s4    80.610654
 s5   509.663735
 s6   340.441126}

In [9]:
clf.predict(X_test)

Unnamed: 0,target
8,12.954765
385,-14.866782
46,-25.660355
166,-110.856014
88,-48.006016
...,...
221,-22.705096
206,2.389099
284,-13.269047
32,114.198178


In [10]:
t.get_params()

{'strategy': 'std'}

In [11]:
t.fit(X)

In [12]:
t.get_fitted_params()

{'X_max': age    0.110727
 sex    0.050680
 bmi    0.170555
 bp     0.132044
 s1     0.153914
 s2     0.198788
 s3     0.181179
 s4     0.185234
 s5     0.133597
 s6     0.135612
 dtype: float64,
 'X_mean': age   -2.511817e-19
 sex    1.230790e-17
 bmi   -2.245564e-16
 bp    -4.797570e-17
 s1    -1.381499e-17
 s2     3.918434e-17
 s3    -5.777179e-18
 s4    -9.042540e-18
 s5     9.293722e-17
 s6     1.130318e-17
 dtype: float64,
 'X_min': age   -0.107226
 sex   -0.044642
 bmi   -0.090275
 bp    -0.112399
 s1    -0.126781
 s2    -0.115613
 s3    -0.102307
 s4    -0.076395
 s5    -0.126097
 s6    -0.137767
 dtype: float64,
 'X_span': age    0.217952
 sex    0.095322
 bmi    0.260831
 bp     0.244442
 s1     0.280694
 s2     0.314401
 s3     0.283486
 s4     0.261629
 s5     0.259694
 s6     0.273379
 dtype: float64,
 'X_std': age    0.047619
 sex    0.047619
 bmi    0.047619
 bp     0.047619
 s1     0.047619
 s2     0.047619
 s3     0.047619
 s4     0.047619
 s5     0.047619
 s6     0.04

In [13]:
t.transform(X_test)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
8,0.875877,1.064282,1.295620,-0.842078,-0.293003,0.130235,-0.602160,-0.054438,-0.314154,0.238321
385,0.494461,1.064282,-0.401934,1.037719,-1.333224,-1.283630,0.093669,-0.829361,-0.545015,-0.283584
46,-1.183772,-0.937474,-0.243495,-0.697478,-0.986483,-1.000857,0.093669,-0.829361,-0.167520,-1.849301
166,-1.183772,1.064282,-1.262028,-0.769778,-1.853334,-1.487490,-0.292903,-0.829361,-1.640939,-2.197238
88,-1.107489,1.064282,-0.854615,-1.420477,-0.668638,-0.777269,0.789499,-0.829361,-0.724957,1.456101
...,...,...,...,...,...,...,...,...,...,...
221,-0.954922,-0.937474,-0.809347,-0.552878,-0.321898,0.018441,-0.679475,-0.054438,0.024099,-0.805490
206,0.036761,1.064282,0.548697,-0.191379,0.516057,0.807576,-0.447531,0.720486,0.198107,0.064353
284,0.875877,1.064282,-0.469836,0.603920,-1.391014,-0.948248,-1.297990,-0.054438,0.060087,-1.153427
32,0.723311,1.064282,2.631029,0.603920,-1.130959,-0.270908,-2.148448,2.270333,0.005722,0.586258


### Iteration 2 - extender interface, template pattern, tags and configs

2nd iteration - improved

* input check boilerplate becomes repetitive!
    * use of template pattern for extenders, `fit`/`_fit` with boilerplate in `fit`
    * extension templates become much simpler!
* config: controlling the output type in `predict`, `transform` - `numpy` or `pandas`?
    * output conversion added in boilerplate layer (between `predict` and `_predict` etc)
    * config dict needs to be added to base classes
* tag: what kind of estimator, e.g., regressor, transformer?
    * tag dict needs to be added in base classes

*[pydata_skbase/mini_sklearn_v2](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v2)*

why separate public user (`fit`, `predict`) from private extender interface (`_fit`, `_predict`)?

* both should follow unified pattern - user = strategy pattern; extender = template pattern
* localizing the boilerplate in one place is DRY
    * DRY boilerplate is easier to maintain
    * repetitive boilerplate is a risk for errors, bugs, etc
    * boilerplate in one place also allows richer boilerplate!
* extension templates become much simpler and less error prone
    * easier to contribute new estimators!
    * easier for power users to maintain their own estimators!

example diagram, `sktime` public interface and extender interface:

<img src="img/interfaces.jpg"/>

In [14]:
from pydata_skbase.mini_sklearn_v2.regression import LinReg
from pydata_skbase.mini_sklearn_v2.transform import Scaler

t = Scaler(strategy="std")
clf = LinReg()

In [15]:
clf.fit(X_train, y_train)
clf.predict(X_test)

Unnamed: 0,target
8,12.954765
385,-14.866782
46,-25.660355
166,-110.856014
88,-48.006016
...,...
221,-22.705096
206,2.389099
284,-13.269047
32,114.198178


In [16]:
t.fit(X_train)
t.transform(X_test)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
8,0.884195,1.070457,1.234154,-0.860408,-0.290716,0.135069,-0.590210,-0.058876,-0.313406,0.177748
385,0.492298,1.070457,-0.422216,1.013884,-1.312897,-1.249277,0.096199,-0.820119,-0.536440,-0.343681
46,-1.232048,-0.931358,-0.267621,-0.716232,-0.972170,-0.972408,0.096199,-0.820119,-0.171744,-1.907970
166,-1.232048,1.070457,-1.261443,-0.788320,-1.823988,-1.448880,-0.285140,-0.820119,-1.595206,-2.255590
88,-1.153668,1.070457,-0.863914,-1.437114,-0.659837,-0.753488,0.782608,-0.820119,-0.710281,1.394417
...,...,...,...,...,...,...,...,...,...,...
221,-0.996909,-0.931358,-0.819744,-0.572056,-0.319110,0.025609,-0.666478,-0.058876,0.013379,-0.865111
206,0.022022,1.070457,0.505351,-0.211615,0.504314,0.798268,-0.437675,0.702368,0.181487,0.003938
284,0.884195,1.070457,-0.488471,0.581355,-1.369685,-0.920897,-1.276619,-0.058876,0.048147,-1.212731
32,0.727436,1.070457,2.537165,0.581355,-1.114140,-0.257699,-2.115564,2.224856,-0.004375,0.525368


### Iteration 3 - pipelines, meta-estimators

3rd iteration - adding a meta-estimator

* inheriting from `BaseMetaEstimator` in `skbase`
* requires setting the `named_object_parameters` and `fitted_named_object_parameters` tags
* factor out `_CommonTags` to allow the meta-estimator and estimator base class to inherit

*[pydata_skbase/mini_sklearn_v3](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v3)*

In [17]:
from pydata_skbase.mini_sklearn_v3.pipeline import RegressorPipeline
from pydata_skbase.mini_sklearn_v3.regression import LinReg
from pydata_skbase.mini_sklearn_v3.transform import Scaler

t1 = Scaler(strategy="std")
t2 = Scaler(strategy="minmax")
clf = LinReg()

pipe = RegressorPipeline([t1, t2, clf])

In [18]:
pipe.get_params()

{'steps': [Scaler(), Scaler(strategy='minmax'), LinReg()],
 'Scaler': Scaler(strategy='minmax'),
 'LinReg': LinReg(),
 'Scaler__strategy': 'minmax',
 'LinReg__shrink': 0.0}

In [19]:
pipe.fit(X_train, y_train)

In [20]:
pipe.get_fitted_params()

{'steps': [('Scaler_1', Scaler()),
  ('Scaler_2', Scaler(strategy='minmax')),
  ('LinReg', LinReg())],
 'Scaler_1': Scaler(),
 'Scaler_2': Scaler(strategy='minmax'),
 'LinReg': LinReg(),
 'Scaler_1__X_max': age    0.096197
 sex    0.050680
 bmi    0.170555
 bp     0.132044
 s1     0.153914
 s2     0.198788
 s3     0.177497
 s4     0.185234
 s5     0.133597
 s6     0.135612
 dtype: float64,
 'Scaler_1__X_mean': age    0.000730
 sex   -0.000293
 bmi    0.001466
 bp     0.000993
 s1     0.000135
 s2    -0.000367
 s3    -0.000183
 s4     0.000262
 s5     0.000488
 s6     0.002877
 dtype: float64,
 'Scaler_1__X_min': age   -0.107226
 sex   -0.044642
 bmi   -0.089197
 bp    -0.112399
 s1    -0.126781
 s2    -0.115613
 s3    -0.098625
 s4    -0.076395
 s5    -0.126097
 s6    -0.137767
 dtype: float64,
 'Scaler_1__X_span': age    0.203422
 sex    0.095322
 bmi    0.259753
 bp     0.244442
 s1     0.280694
 s2     0.314401
 s3     0.276123
 s4     0.261629
 s5     0.259694
 s6     0.273379
 dty

In [21]:
pipe.predict(X_test)

Unnamed: 0,target
8,5.115961
385,-13.933978
46,-5.698978
166,-98.573159
88,-63.658346
...,...
221,-8.260593
206,8.859826
284,4.545835
32,104.909445


### Iteration 4 - lookup and search

4th iteration - adding estimator lookup and search utility

* direct interface to `all_objects` in `skbase`
* todo 1: point the utility to the right directory and module name
* todo 2: excluding directories, e.g., extension templates, tests
* todo 3: string/class dictionary, e.g., for searching for `"regressor"` type

*[pydata_skbase/mini_sklearn_v4](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v4)*

In [22]:
from pydata_skbase.mini_sklearn_v4.lookup import all_estimators

all_estimators(as_dataframe=True)

Unnamed: 0,name,object
0,LinReg,<class 'mini_sklearn_v4.regression.LinReg'>
1,RegressorPipeline,<class 'mini_sklearn_v4.pipeline.RegressorPipe...
2,Scaler,<class 'mini_sklearn_v4.transform.Scaler'>


In [23]:
all_estimators(as_dataframe=True, filter_tags={"estimator_type": "regressor"})

Unnamed: 0,name,object
0,LinReg,<class 'mini_sklearn_v4.regression.LinReg'>
1,RegressorPipeline,<class 'mini_sklearn_v4.pipeline.RegressorPipe...


In [24]:
from mini_sklearn_v4.base import BaseRegressor

all_estimators(BaseRegressor, as_dataframe=True)

Unnamed: 0,name,object
0,LinReg,<class 'mini_sklearn_v4.regression.LinReg'>
1,RegressorPipeline,<class 'mini_sklearn_v4.pipeline.RegressorPipe...


### Iteration 5 - test framework

5th iteration - adding test framework and extension checker

adherents of TDD may like to add this much earlier :-)

* import `TestAllObjects` from `skbase`, this is found by `pytest`
* `BaseFixtureGenerator` to generate estimator fixtures can be used to extend tests
* `QuickTester.run_tests` can be used to run tests on a particular estimator
* a `check_estimator` utility can be built by masking `run_tests`

*[pydata_skbase/mini_sklearn_v5](https://github.com/sktime/sktime-tutorial-pydata-seattle-2023/blob/main/pydata_skbase/mini_sklearn_v5)*

In [25]:
from pydata_skbase.mini_sklearn_v5.regression import LinReg
from pydata_skbase.mini_sklearn_v5.tests.test_estimators import TestAllObjects

TestAllObjects().run_tests(LinReg)

{'test_clone[LinReg]': 'PASSED',
 'test_constructor[LinReg]': 'PASSED',
 'test_create_test_instance[LinReg]': 'PASSED',
 'test_create_test_instances_and_names[LinReg]': 'PASSED',
 'test_get_params[LinReg]': 'PASSED',
 'test_inheritance[LinReg]': 'PASSED',
 'test_no_between_test_case_side_effects[LinReg-0]': 'PASSED',
 'test_no_between_test_case_side_effects[LinReg-1]': 'PASSED',
 'test_no_cross_test_side_effects_part1[LinReg]': 'PASSED',
 'test_no_cross_test_side_effects_part2[LinReg]': 'PASSED',
 'test_object_tags[LinReg]': 'PASSED',
 'test_repr[LinReg]': 'PASSED',
 'test_set_params[LinReg]': 'PASSED',
 'test_set_params_sklearn[LinReg]': 'PASSED',
 'test_valid_object_class_tags[LinReg]': 'PASSED',
 'test_valid_object_tags[LinReg]': 'PASSED'}

In [26]:
from pydata_skbase.mini_sklearn_v5.regression import LinReg
from pydata_skbase.mini_sklearn_v5.utils import check_estimator

check_estimator(LinReg)

{'test_clone[LinReg]': 'PASSED',
 'test_constructor[LinReg]': 'PASSED',
 'test_create_test_instance[LinReg]': 'PASSED',
 'test_create_test_instances_and_names[LinReg]': 'PASSED',
 'test_get_params[LinReg]': 'PASSED',
 'test_inheritance[LinReg]': 'PASSED',
 'test_no_between_test_case_side_effects[LinReg-0]': 'PASSED',
 'test_no_between_test_case_side_effects[LinReg-1]': 'PASSED',
 'test_no_cross_test_side_effects_part1[LinReg]': 'PASSED',
 'test_no_cross_test_side_effects_part2[LinReg]': 'PASSED',
 'test_object_tags[LinReg]': 'PASSED',
 'test_repr[LinReg]': 'PASSED',
 'test_set_params[LinReg]': 'PASSED',
 'test_set_params_sklearn[LinReg]': 'PASSED',
 'test_valid_object_class_tags[LinReg]': 'PASSED',
 'test_valid_object_tags[LinReg]': 'PASSED',
 'test_input_output_contract[LinReg]': 'PASSED'}

---
## Summary

* `skbase` is a reusable workbench for creating `sklearn`-like, `sklearn` compatible libraries
* recipe to create `sklearn`-like libraries - public interface, extension template, tags/configs, testing
* `skbase` provides out of the box: parameter handling, tag/config handling, composition, retrieval, testing

---

### Credits: notebook 3 - package building with `skbase`

notebook creation: fkiraly

demo package creation: fkiraly

skbase: fkiraly, rnkuhns, mloning, sktime & sklearn developers (by code reuse)\
sktime style design: fkiraly, mloning, sktime developers

---

## Join sktime!

* openly governed - users, developers, early career data scientists
* world-wide contributor and user footprint

**EVERYONE CAN JOIN! EVERYONE CAN BECOME A COMMUNITY LEADER!**

* join our discord (developers and community)!
    * regular **community collaboration sessions** and stand-ups on Fridays
    * next **onboarding session**: June 2023
    * next **developer sprint**: July 2023

Opportunities:

* sktime **mentoring programme**: github.com/sktime/mentoring
* sktime **paid summer internships**: github.com/sktime/mentoring