## skbase - a workbench for creating scikit-learn like parametric objects and libraries

Tutorial at PyData Seattle 2023

skbase: https://github.com/sktime/skbase

API Reference: https://skbase.readthedocs.io/en/latest/api_reference.html

### Presentation Goals:

 - Establish the need for `skbase` in the broader ecosystem of ML tools
 - Go over the main features of `skbase`, and demonstrate their essential use cases
 - Quickly prototype how one could use the skbase interface to **quickly** and **effortlessly** their own ML toolbox
 
Let us begin!

This notebook:

using `skbase` to create a template for *sorting algorithms* and *permutation algorithms*!

Sorting algorithms = permuters that end up sorting the list completely

examples:

* bubble sort
* merge sort
* random shuffle

### using `BaseObject` to write a template base class for sorting/permutation

recipe: inherit from `BaseObject` to template a base class!

We'll use it to demonstrate some universal functionality

 - setting object parameters
 - setting object configurations
 - retrieving tags associated with a particular object

#### defining the `BasePermuter`

In [1]:
# import the base object
import numpy as np
from skbase.base import BaseObject

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseObject):
    """Abstract Base Class for sorters and permuters"""

    _tags = {"always_sorts_completely": False}

    def fit(self, array: list) -> list:
        """Will override in inherited classes.
        
        Parameters
        ----------
        array : 1D np.ndarray

        Returns
        -------
        permuted/sorted 1D np.ndarray
        """
        raise NotImplementedError

class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        
        return array


In [2]:
# applying the bubble sort
sorter = BubbleSort(ascending = True)
array  = np.array([5, 3, 6, 2, 1])

sorted_array = sorter.fit(array)

# our array is sorted
sorted_array

array([1, 2, 3, 5, 6])

In [3]:
# get_params is available out-of-the box!
sorter.get_params()

{'ascending': True}

In [4]:
# set_params too
sorter.set_params(ascending = False)
sorter.get_params()

{'ascending': False}

the `BasePermuter` is now a template for further permuters

e.g., this class which shuffles the data randomly:

In [5]:
class Shuffler(BasePermuter):
    """Class that will shuffle your data"""

    def __init__(self, random_state = 42):
        self.random_state = random_state
        super(Shuffler, self).__init__()
        
    def fit(self, array: list) -> list:
        """Return a shuffled array according to random_state"""
        random = np.random.RandomState(self.random_state)
        return random.permutation(array)

`skbase` gives you the ability to define meta properties of an object via tags.  

Tags are a simple way to organize your codebase according to shared meta-properties

above, we have already used the `_tags` attribute for this:

In [6]:
Shuffler().get_tags()

{'always_sorts_completely': False}

In [7]:
BubbleSort().get_tags()

{'always_sorts_completely': True}

why does this return sth in both cases?

* we have set `always_sorts_completely=False` in the base class `BasePermuter`, the "general case"
* we have set `always_sorts_completely=True` in `BubbleSort`
* we have set no tags in `Shuffler`
* both `Shuffler` and `BubbleSort` inherut from `BasePermuter`

--> `skbase` `get_tags` has inheritance of the `_tags` attribute!

Tags also can be dynamically overridden or set, if desired:

In [8]:
# and you can also set
sorter.set_tags(my_new_tag = True)
sorter.get_tags()

{'always_sorts_completely': True, 'my_new_tag': True}

#### Configurations vs tags

* tags can be user facing to describe an estimator, e.g., for organising code base
* tags can also be developer or system facing, to determine functionality
* tags should not change over the lifetime of an object

* configs are user facing and determine *behaviour* (not *properties* or *functionality*)

example: printing a useful log for the user, controlled by config

In [9]:
# import the base object
import numpy as np
from skbase.base import BaseObject

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseObject):
    """Abstract Base Class for sorters and permuters"""

    _tags = {"always_sorts_completely": False}

    def fit(self, array: list) -> list:
        """Will override in inherited classes.
        
        Parameters
        ----------
        array : 1D np.ndarray

        Returns
        -------
        permuted/sorted 1D np.ndarray
        """
        raise NotImplementedError

class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}
    _config = {"print_useful_log": False}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]

        if self.get_config()["print_useful_log"]:
            print(42)

        return array


In [10]:
# and now we can see and set config variables in a similar manner
sorter = BubbleSort()
# returns error -- best to run a development version instead?
sorter.get_config()

{'display': 'diagram', 'print_changed_only': True, 'print_useful_log': False}

`display` and `print_changed_only` are from `skbase` directly, they control how the class is pretty printed

In [11]:
sorter

In [12]:
sorter.set_config(display="text")

BubbleSort()

using the custom "useful logging" config

In [13]:
# applying the bubble sort
sorter = BubbleSort(ascending = True)
array  = np.array([5, 3, 6, 2, 1])

sorted_array = sorter.fit(array)
# doesn't print

In [14]:
sorter.set_config(print_useful_log=True)

sorted_array = sorter.fit(array)
# now it prints

42


### Composition & Pipeline

common ML motif is chaining together different transformers and estimators via pipelines

works out of the box with `skbase`!

Example: composing two permuters

In [15]:
class ComposePermutations(BasePermuter):
    """Class that will first"""
    
    def __init__(self, first_permutation, second_permutation):
        self.first_permutation = first_permutation
        self.second_permutation = second_permutation
        super(ComposePermutations, self).__init__()
        
    def fit(self, array: list) -> list:
        """Return a shuffled array according to random_state"""
        first_permuted = self.first_permutation.fit(array)
        both_permuted = self.second_permutation.fit(first_permuted)
        return both_permuted

shuffler = Shuffler() 
sorter = BubbleSort()
pipe = ComposePermutations(shuffler, sorter)

# compose them together - first shuffle, then sort
pipe.fit(array)

array([1, 2, 3, 5, 6])

### `skbase` estimators

"estimators" are objects that can "fit"

we have used `fit` above, but without `BaseEstimator`

`BaseEstimator` additionally handles parameters written to `self` and `is_fitted` etc

Let's look at sorters with `BaseEstimator` class:

In [16]:
# import the base object
import numpy as np
from skbase.base import BaseEstimator

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseEstimator):
    """Abstract Base Class for sorters and permuters"""

    _tags = {"always_sorts_completely": False}

    def fit(self, array: list):
        """Will override in inherited classes.

        Writes the sorted list to self, as array_

        Parameters
        ----------
        array : 1D np.ndarray

        Returns
        -------
        self
        """
        raise NotImplementedError

    def sort(self) -> list:
        return self.array_

class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list):
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]

        self.array_ = array
        self._is_fitted = True
        return self

In [17]:
# now let's take a look at how this works
sorter = BubbleSort()

sorter.is_fitted

False

In [18]:
# but fit and it's now return True
array = np.array([5, 2, 3, 4, 1])
sorter.fit(array)
sorter.is_fitted

True

`skbase` now allows to access the `array_` stored to self via `get_fitted_params`

recall: by default attributes ending in underscore `_`

In [19]:
# you can also lookup parameters that are available after fitting
sorter.get_fitted_params()

{'array': array([1, 2, 3, 4, 5])}

### Testing

Testing is one of the most important but least desired workflows when developing OS tools.  

Over long periods of time, codebases typically suffer from inadequate test coverage that creates user issues downstream.  

Inadequate test coverage is often a primary cause of an ML library's slow drift towards obsolescence.  

How can we combat this problem?

Ideally, your codebase will include strong abstractions for making testing as painless as possible.  

Inadequate testing has a strong tendency to propagate!  It's important to nip this problem in the bud.

Let's see how you can build in testing interfaces to your custom classes to allow for easier test coverage.

In [20]:
# BubbleSort -- the last time
class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list):
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]

        self.array_ = array
        self._is_fitted = True
        return self

    @classmethod
    def get_test_params(cls, parameter_set = "default"):
        
        return [{"ascending": True}, {"ascending": False}]

In [21]:
# test parameters are now built into the class
BubbleSort.get_test_params()

[{'ascending': True}, {'ascending': False}]

In [22]:
# you can then automatically recreate a test instance of an estimator
BubbleSort.create_test_instance(parameter_set = "default")


In [23]:
# multiple test instances with names are created by create_test_instances_and_names
BubbleSort.create_test_instances_and_names(parameter_set = "default")

([BubbleSort(), BubbleSort(ascending=False)], ['BubbleSort-0', 'BubbleSort-1'])

### What Have We Covered So Far?

    - `skbase` provides a coherent interface to develop an ML toolbox
    - It builds off of scikit-learn classes, and extends them to make them easier to organize and test
    - It's meant to abstract away the tedious details of an API, allowing developers a streamlined way to focus on the primary details of pattern recognition for a particular technique
    - We'll now go into more detail about how you can use it to prototype an ML framework