## skbase - a workbench for creating scikit-learn like parametric objects and libraries

Tutorial at PyData Seattle 2023

skbase: https://github.com/sktime/skbase

API Reference: https://skbase.readthedocs.io/en/latest/api_reference.html

### Presentation Goals:

 - Establish the need for `skbase` in the broader ecosystem of ML tools
 - Go over the main features of `skbase`, and demonstrate their essential use cases
 - Quickly prototype how one could use the skbase interface to **quickly** and **effortlessly** their own ML toolbox
 
Let us begin!

This notebook:

using `skbase` to create a template for *sorting algorithms* and *permutation algorithms*!

Sorting algorithms = permuters that end up sorting the list completely

examples:

* bubble sort
* merge sort
* random shuffle

### using `BaseObject` to write a template base class for sorting/permutation

recipe: inherit from `BaseObject` to template a base class!

We'll use it to demonstrate some universal functionality

 - setting object parameters
 - setting object configurations
 - retrieving tags associated with a particular object

#### defining the `BasePermuter`

In [None]:
# import the base object
import numpy as np
from skbase.base import BaseObject

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseObject):
    """Abstract Base Class for sorters and permuters"""

    _tags = {"always_sorts_completely": False}

    def fit(self, array: list) -> list:
        """Will override in inherited classes.
        
        Parameters
        ----------
        array : 1D np.ndarray

        Returns
        -------
        permuted/sorted 1D np.ndarray
        """
        raise NotImplementedError

class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        
        return array


In [None]:
# applying the bubble sort
sorter = BubbleSort(ascending = True)
array  = np.array([5, 3, 6, 2, 1])

sorted_array = sorter.fit(array)

# our array is sorted
sorted_array

In [None]:
# get_params is available out-of-the box!
sorter.get_params()

In [None]:
# set_params too
sorter.set_params(ascending = False)
sorter.get_params()

the `BasePermuter` is now a template for further permuters

e.g., this class which shuffles the data randomly:

In [None]:
class Shuffler(BasePermuter):
    """Class that will shuffle your data"""

    def __init__(self, random_state = 42):
        self.random_state = random_state
        super(Shuffler, self).__init__()
        
    def fit(self, array: list) -> list:
        """Return a shuffled array according to random_state"""
        random = np.random.RandomState(self.random_state)
        return random.permutation(array)

`skbase` gives you the ability to define meta properties of an object via tags.  

Tags are a simple way to organize your codebase according to shared meta-properties

above, we have already used the `_tags` attribute for this:

In [None]:
Shuffler().get_tags()

In [None]:
BubbleSort().get_tags()

why does this return sth in both cases?

* we have set `always_sorts_completely=False` in the base class `BasePermuter`, the "general case"
* we have set `always_sorts_completely=True` in `BubbleSort`
* we have set no tags in `Shuffler`
* both `Shuffler` and `BubbleSort` inherut from `BasePermuter`

--> `skbase` `get_tags` has inheritance of the `_tags` attribute!

Tags also can be dynamically overridden or set, if desired:

In [None]:
# and you can also set
sorter.set_tags(my_new_tag = True)
sorter.get_tags()

#### Configurations vs tags

* tags can be user facing to describe an estimator, e.g., for organising code base
* tags can also be developer or system facing, to determine functionality
* tags should not change over the lifetime of an object

* configs are user facing and determine *behaviour* (not *properties* or *functionality*)

example: printing a useful log for the user, controlled by config

In [None]:
# import the base object
import numpy as np
from skbase.base import BaseObject

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseObject):
    """Abstract Base Class for sorters and permuters"""

    _tags = {"always_sorts_completely": False}

    def fit(self, array: list) -> list:
        """Will override in inherited classes.
        
        Parameters
        ----------
        array : 1D np.ndarray

        Returns
        -------
        permuted/sorted 1D np.ndarray
        """
        raise NotImplementedError

class BubbleSort(BasePermuter):
    """Bubble sort.

    Parameters
    ----------
    ascending : bool, default=True
        whether the bubble is ascending (True), or descending
    """
    _tags = {"always_sorts_completely": True}
    _config = {"print_useful_log": False}

    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]

        if self.get_config()["print_useful_log"]:
            print(42)

        return array


In [None]:
# and now we can see and set config variables in a similar manner
sorter = BubbleSort()
# returns error -- best to run a development version instead?
sorter.get_config()

`display` and `print_changed_only` are from `skbase` directly, they control how the class is pretty printed

In [None]:
sorter

In [None]:
sorter.set_config(display="text")

using the custom "useful logging" config

In [None]:
# applying the bubble sort
sorter = BubbleSort(ascending = True)
array  = np.array([5, 3, 6, 2, 1])

sorted_array = sorter.fit(array)
# doesn't print

In [None]:
sorter.set_config(print_useful_log=True)

sorted_array = sorter.fit(array)
# now it prints

### Configuration Examples and Compositions

You'll often need to chain together different transformers and estimators via pipelines, and you can use the same functionality for these classes as well.  

Here's an example of a scaler with our list sorter as well.  Here's the same class we had before, but put together inside a pipeline.

In [None]:
# simple pipeline with a Shuffler + Sorter
from sklearn.pipeline import make_pipeline

class Shuffler(BasePermuter):
    """Class that will shuffle your data"""
    
    def __init__(self, random_state = 42):
        self.random_state = random_state
        super(Shuffler, self).__init__()
        
    def fit(self, array: list) -> list:
        """Return a shuffled array according to random_state"""
        random = np.random.RandomState(self.random_state)
        return random.permutation(array)

shuffler = Shuffler() 
sorter   = BubbleSort()

# compose them together in a chain of events
shuffled_array = shuffler.fit(array)
sorted_array   = sorter.fit(array)

### Using SKBase With Estimators

Most ML libraries are going to be built around estimators that will extract patterns from your data.  Ie, machine learning algorithms.  

These are the methods use the `fit`, `score`, and `predict` methods associated with them.  

With SKBase, methods associated with an estimator become more streamlined across different classes.  

Let's look at our `Sorter` class, but this time with the additional inheritance of a `BaseEstimator` class.

In [None]:
# BubbleSort, redux
from sktime.base import BaseEstimator

class BasePermuter(BaseEstimator):
    """Abstract Base Class to Use for More Specialized Approaches"""
        
    def fit(self, array: list) -> list:
        """Will override in inherited classes"""
        raise NotImplementedError

class BubbleSort(BasePermuter):

    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    _config = {
        "display": "diagram",
        "print_changed_only": True,
    }
    
    def __init__(self, ascending = True):
        self.ascending = ascending
        self._is_fitted = False
        super(BubbleSort, self).__init__()
        
    def _fit(self, array: list) -> list:
        if self.method == 'bubble':
            n = len(array)
            for i in range(n):
                is_sorted = True
                for j in range(n - i - 1):
                    if array[j] > array[j + 1]:
                        array[j], array[j + 1] = array[j + 1], array[j]
                        is_sorted = False
                        
                if is_sorted:
                    self._is_fitted = True
                    
                    # adding this after fitting
                    self.array_ = array
                    break
                    
        else:
            array.sort()
            
            # methods that we're adding after doing fitting
            self.array_ = array
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, -1, -1):
            is_sorted = True
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                        
            if is_sorted:
                self._is_fitted = True
                self.array_ = array
                break
        
        return array

In [None]:
# now let's take a look at how this works
sorter = BubbleSort()

sorter.is_fitted

In [None]:
# but fit and it's now return True
array = np.array([5, 2, 3, 4, 1])
sorter.fit(array)
sorter.is_fitted

In [None]:
# you can also lookup parameters that are available after fitting
sorter.get_fitted_params()

In [None]:
# this functionality also extends to pipelines as well
# will use an sktime pipeline this time -- which is built with these classes
from sktime.pipeline import make_pipeline as pipeline
from sktime.transformations.series.exponent import ExponentTransformer
from sklearn.preprocessing import StandardScaler
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

# Pipelines also have the qualities of estimators
pipe = pipeline(StandardScaler(), ExponentTransformer())
pipe.fit(array[:, None])

In [None]:
# can use the same helper methods -- but will apply to each subsequent step
print(pipe.is_fitted)
print(pipe.get_fitted_params())

In [None]:
# pipelines are 'composite' objects
# that inherit from the BaseMetaObject / BaseMetaEstimator
pipe.is_composite()

### Testing

Testing is one of the most important but least desired workflows when developing OS tools.  

Over long periods of time, codebases typically suffer from inadequate test coverage that creates user issues downstream.  

Inadequate test coverage is often a primary cause of an ML library's slow drift towards obsolescence.  

How can we combat this problem?

Ideally, your codebase will include strong abstractions for making testing as painless as possible.  

Inadequate testing has a strong tendency to propagate!  It's important to nip this problem in the bud.

Let's see how you can build in testing interfaces to your custom classes to allow for easier test coverage.

In [None]:
# BubbleSort -- the last time
from sktime.base import BaseEstimator

class BubbleSort(BasePermuter):
    
    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    _config = {
        "display": "diagram",
        "print_changed_only": True,
    }
    
    def __init__(self, ascending = True):
        self.ascending  = ascending
        self._is_fitted = False
        
        for index in range(len(array)-1, -1, -1):
            is_sorted = True
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                        
            if is_sorted:
                self._is_fitted = True
                self.array_ = array
                break
                    
            
    @classmethod
    def get_test_params(cls, parameter_set = "default"):
        
        if parameter_set == "default":
            
            return {
                'ascending': True
            }
        
        else:
            return {
                'ascending': False
            }

In [None]:
# now let's re-run again to see how these methods are built into the classes
sorter = BubbleSort(ascending = False)

# test parameters are now built into the class
sorter.get_test_params()

In [None]:
# you can then automatically recreate a test instance of an estimator
test_instance = sorter.create_test_instance(parameter_set = "default")
test_instance.get_params()

### What Have We Covered So Far?

    - `skbase` provides a coherent interface to develop an ML toolbox
    - It builds off of scikit-learn classes, and extends them to make them easier to organize and test
    - It's meant to abstract away the tedious details of an API, allowing developers a streamlined way to focus on the primary details of pattern recognition for a particular technique
    - We'll now go into more detail about how you can use it to prototype an ML framework