## skbase - a workbench for creating scikit-learn like parametric objects and libraries

Tutorial at PyData Seattle 2023

skbase: https://github.com/sktime/skbase

API Reference: https://skbase.readthedocs.io/en/latest/api_reference.html

### Presentation Goals:

 - Establish the need for `skbase` in the broader ecosystem of ML tools
 - Go over the main features of `skbase`, and demonstrate their essential use cases
 - Quickly prototype how one could use the skbase interface to **quickly** and **effortlessly** their own ML toolbox
 
Let us begin!

**Problem Statement:** The implementation of Data Science and ML techniques is almost exclusively done through libraries, and it's critically important that these ML toolboxes provide seamless, consistent API's that are user-friendly.

`SKbase` is an attempt to forge new ground on a unified way to build ML frameworks.

`SKBase` starts with the foundation of scikit-learn, which is already familiar to most ML practitioners who use python. 

`SKBase` looks to build on this core functionality and make it easier to use.

The rest of this notebook is designed to demonstrate the most essential parts of `SKBase` to understnad its core functionality.

### Base Objects & Core Functionality

Let's start with the most low-level object inside skbase:  `BaseObject`.  

We'll use it to demonstrate some universal functionality

 - setting object parameters
 - setting object configurations
 - retrieving tags associated with a particular object

In [1]:
# current release version of skbase
from skbase import __version__
print(f"Current version of skbase is: {__version__}")

Current version of skbase is: 0.4.0


In [2]:
# import the base object
import numpy as np
from skbase.base import BaseObject

# sample class that implements a bubble sort algorithm
class BasePermuter(BaseObject):
    """Abstract Base Class to Use for More Specialized Approaches"""
        
    def fit(self, array: list) -> list:
        """Will override in inherited classes"""
        raise NotImplementedError

class BubbleSort(BasePermuter):
    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        
        return array
                        
# basic code for demonstration
sorter = BubbleSort(ascending = True)
array  = np.array([5, 3, 6, 2, 1])

sorted_array = sorter.fit(array)

# our array is sorted
sorted_array

array([1, 2, 3, 5, 6])

In [3]:
# we have a unified way of setting and configuring parameters
sorter.get_params()

{'ascending': True}

In [4]:
# and setting them
sorter.set_params(ascending = False)
sorter.get_params()

{'ascending': False}

SKBase gives you the ability to define meta properties of an object via tags.  

Tags are a simple way to organize your codebase according to shared meta-properties.

Let's re-use the previous class, but with some additional details.

In [5]:
# BubbleSort redux
class BubbleSort(BasePermuter):
    
    # we now define tags that define
    # what sort of programmatic properties
    # this estimator has
    
    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        
        return array

In [6]:
# and we have the new method
sorter = BubbleSort(ascending = True)
sorter.get_tags()

{'multi_dimensional': False, 'capability:missing_values': False}

In [7]:
# and you can also set
sorter.set_tags(multi_dimensional = True)
sorter.get_tags()

{'multi_dimensional': True, 'capability:missing_values': False}

The use of a tag system is heavily used inside sktime to be able to quickly categorize different classes across the entire codebase.  This makes it easier to organize how different classes are related to one another across a codebase.

An additional way to organize a class is with configuration variables, which determine certain behavior outside of the fitting process.  

Let's reuse the previous class, but with these set as well:

In [8]:
# ListSorted, redux
class BubbleSort(BasePermuter):
    
    # we now define tags that define
    # what sort of programmatic properties
    # this estimator has
    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    # and with config variables as well
    _config = {
        "display": "diagram",
        "print_changed_only": True,
    }
    
    def __init__(self, ascending = True):
        self.ascending = ascending
        super(BubbleSort, self).__init__()
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, 0, -1):
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        
        return array

In [9]:
# and now we can see and set config variables in a similar manner
sorter = BubbleSort()
# returns error -- best to run a development version instead?
sorter.get_config()

{'display': 'diagram', 'print_changed_only': True}

### Configuration Examples and Compositions

You'll often need to chain together different transformers and estimators via pipelines, and you can use the same functionality for these classes as well.  

Here's an example of a scaler with our list sorter as well.  Here's the same class we had before, but put together inside a pipeline.

In [10]:
# simple pipeline with a Shuffler + Sorter
from sklearn.pipeline import make_pipeline

class Shuffler(BasePermuter):
    """Class that will shuffle your data"""
    
    def __init__(self, random_state = 42):
        self.random_state = random_state
        super(Shuffler, self).__init__()
        
    def fit(self, array: list) -> list:
        """Return a shuffled array according to random_state"""
        random = np.random.RandomState(self.random_state)
        return random.permutation(array)

shuffler = Shuffler() 
sorter   = BubbleSort()

# compose them together in a chain of events
shuffled_array = shuffler.fit(array)
sorted_array   = sorter.fit(array)

### Using SKBase With Estimators

Most ML libraries are going to be built around estimators that will extract patterns from your data.  Ie, machine learning algorithms.  

These are the methods use the `fit`, `score`, and `predict` methods associated with them.  

With SKBase, methods associated with an estimator become more streamlined across different classes.  

Let's look at our `Sorter` class, but this time with the additional inheritance of a `BaseEstimator` class.

In [11]:
# BubbleSort, redux
from sktime.base import BaseEstimator

class BasePermuter(BaseEstimator):
    """Abstract Base Class to Use for More Specialized Approaches"""
        
    def fit(self, array: list) -> list:
        """Will override in inherited classes"""
        raise NotImplementedError

class BubbleSort(BasePermuter):

    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    _config = {
        "display": "diagram",
        "print_changed_only": True,
    }
    
    def __init__(self, ascending = True):
        self.ascending = ascending
        self._is_fitted = False
        super(BubbleSort, self).__init__()
        
    def _fit(self, array: list) -> list:
        if self.method == 'bubble':
            n = len(array)
            for i in range(n):
                is_sorted = True
                for j in range(n - i - 1):
                    if array[j] > array[j + 1]:
                        array[j], array[j + 1] = array[j + 1], array[j]
                        is_sorted = False
                        
                if is_sorted:
                    self._is_fitted = True
                    
                    # adding this after fitting
                    self.array_ = array
                    break
                    
        else:
            array.sort()
            
            # methods that we're adding after doing fitting
            self.array_ = array
        
    def fit(self, array: list) -> list:
        
        for index in range(len(array)-1, -1, -1):
            is_sorted = True
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                        
            if is_sorted:
                self._is_fitted = True
                self.array_ = array
                break
        
        return array

In [12]:
# now let's take a look at how this works
sorter = BubbleSort()

sorter.is_fitted

False

In [13]:
# but fit and it's now return True
array = np.array([5, 2, 3, 4, 1])
sorter.fit(array)
sorter.is_fitted

True

In [14]:
# you can also lookup parameters that are available after fitting
sorter.get_fitted_params()

{'array': array([1, 2, 3, 4, 5])}

In [17]:
# this functionality also extends to pipelines as well
# will use an sktime pipeline this time -- which is built with these classes
from sktime.pipeline import make_pipeline as pipeline
from sktime.transformations.series.exponent import ExponentTransformer
from sklearn.preprocessing import StandardScaler
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

# Pipelines also have the qualities of estimators
pipe = pipeline(StandardScaler(), ExponentTransformer())
pipe.fit(array[:, None])

In [18]:
# can use the same helper methods -- but will apply to each subsequent step
print(pipe.is_fitted)
print(pipe.get_fitted_params())

True
{'steps': [('TabularToSeriesAdaptor', TabularToSeriesAdaptor(transformer=StandardScaler())), ('ExponentTransformer', ExponentTransformer())], 'TabularToSeriesAdaptor': TabularToSeriesAdaptor(transformer=StandardScaler()), 'ExponentTransformer': ExponentTransformer(), 'TabularToSeriesAdaptor__transformer': StandardScaler(), 'TabularToSeriesAdaptor__transformer__mean': array([3.]), 'TabularToSeriesAdaptor__transformer__n_features_in': 1, 'TabularToSeriesAdaptor__transformer__n_samples_seen': 5, 'TabularToSeriesAdaptor__transformer__scale': array([1.41421356]), 'TabularToSeriesAdaptor__transformer__var': array([2.])}


In [19]:
# pipelines are 'composite' objects
# that inherit from the BaseMetaObject / BaseMetaEstimator
pipe.is_composite()

True

### Testing

Testing is one of the most important but least desired workflows when developing OS tools.  

Over long periods of time, codebases typically suffer from inadequate test coverage that creates user issues downstream.  

Inadequate test coverage is often a primary cause of an ML library's slow drift towards obsolescence.  

How can we combat this problem?

Ideally, your codebase will include strong abstractions for making testing as painless as possible.  

Inadequate testing has a strong tendency to propagate!  It's important to nip this problem in the bud.

Let's see how you can build in testing interfaces to your custom classes to allow for easier test coverage.

In [20]:
# BubbleSort -- the last time
from sktime.base import BaseEstimator

class BubbleSort(BasePermuter):
    
    _tags = {
        "multi_dimensional": False,
        "capability:missing_values": False
    }
    
    _config = {
        "display": "diagram",
        "print_changed_only": True,
    }
    
    def __init__(self, ascending = True):
        self.ascending  = ascending
        self._is_fitted = False
        
        for index in range(len(array)-1, -1, -1):
            is_sorted = True
            for i in range(index):
                if self.ascending:
                    if array[i] > array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                else:
                    if array[i] < array[i + 1]:
                        array[i], array[i + 1] = array[i + 1], array[i]
                        is_sorted = False

                        
            if is_sorted:
                self._is_fitted = True
                self.array_ = array
                break
                    
            
    @classmethod
    def get_test_params(cls, parameter_set = "default"):
        
        if parameter_set == "default":
            
            return {
                'ascending': True
            }
        
        else:
            return {
                'ascending': False
            }

In [21]:
# now let's re-run again to see how these methods are built into the classes
sorter = BubbleSort(ascending = False)

# test parameters are now built into the class
sorter.get_test_params()

{'ascending': True}

In [22]:
# you can then automatically recreate a test instance of an estimator
test_instance = sorter.create_test_instance(parameter_set = "default")
test_instance.get_params()

{'ascending': True}

### What Have We Covered So Far?

    - `skbase` provides a coherent interface to develop an ML toolbox
    - It builds off of scikit-learn classes, and extends them to make them easier to organize and test
    - It's meant to abstract away the tedious details of an API, allowing developers a streamlined way to focus on the primary details of pattern recognition for a particular technique
    - We'll now go into more detail about how you can use it to prototype an ML framework