# `smlb` mini demonstration:<br>Compare different learners on multiple datasets

Scientific Machine Learning Benchmark:<br>
A benchmark of regression models in chem- and materials informatics.<br>
2019-2020, Citrine Informatics.

This demonstration shows how a panel of learning curves can visualize performance of multiple learners on multiple datasets.<br>
Such a panel can be used for assessment of learners and as a dashboard for regular automated performance tests.

In [None]:
import os.path
import warnings

import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt

import pandas as pd

import IPython

import tqdm.notebook as tqdm

import smlb

## Setup

"Random" numbers are generated deterministically using pseudo-random number generators (PRNG).<br>
`smlb` takes reproducibility seriously: Given identical software and hardware, results will be deterministic for a given seed,<br>
even if running asynchronously, in parallel, or in a distributed environment. This supports reproducibility.<br>
As a consequence, PRNG seeds must be deterministically created and specified.<br> 
`smlb` supports this by providing a `split` method that creates new seeds for PRNGs.

In [None]:
prng = smlb.Random(rng=0)  # pseudo-random number generator
seeds = list(np.flip(prng.random.split(100)))  # for simplicity, just create a sufficiently large number of seeds

## Datasets

For this demonstration, we use six datasets that come with `smlb`, three experimental and three synthetic ones.<br>
Use tab-completion to see all available datasets.

In [None]:
from smlb.datasets.experimental.band_gaps_sc73.band_gaps_sc73 import BandGapsStrehlowCook1973Dataset
from smlb.datasets.experimental.superconductors_citrine16.superconductors_citrine16 import SuperconductorsCitrine2016Dataset
from smlb.datasets.experimental.clean_energy_project.clean_energy_project import CleanEnergyProjectDataset

from smlb.datasets.synthetic.friedman_1979.friedman_1979 import Friedman1979Data
from smlb.datasets.synthetic.friedman_silverman_1989.friedman_silverman_1989 import FriedmanSilverman1989Data
from smlb.datasets.synthetic.schwefel26_1981.schwefel26_1981 import Schwefel261981Data

Each dataset has its own characteristics and is different from the others.<br>
To get information about any of them, simply print its docstring:

In [None]:
# print(BandGapsStrehlowCook1973Dataset.__doc__, BandGapsStrehlowCook1973Dataset.__init__.__doc__)
# print(SuperconductorsCitrine2016Dataset.__doc__, SuperconductorsCitrine2016Dataset.__init__.__doc__)
# print(CleanEnergyProjectDataset.__doc__, CleanEnergyProjectDataset.__init__.__doc__)

# print(Friedman1979Data.__doc__, Friedman1979Data.__init__.__doc__)
# print(FriedmanSilverman1989Data.__doc__, FriedmanSilverman1989Data.__init__.__doc__)
# print(Schwefel261981Data.__doc__, Schwefel261981Data.__init__.__doc__)

We arrange the datasets in a 2 x 3 array, the same way the panel will present them.<br>
Any parametrization takes place at initialization (use shift+tab for information on arguments).

In [None]:
# change filename to where your local copy of the dataset exists
# if you don't have the dataset, download it from https://figshare.com/articles/moldata_csv/9640427 (0.56 GB)
cep_filename = os.path.join(os.path.expanduser("~"), "smlb-local/datasets/clean_energy_project_moldata.csv.zip")

In [None]:
datasets = np.asarray([
    [
        BandGapsStrehlowCook1973Dataset(filter_='bg', join=1, 
            samplef=lambda e: e['formula'], labelf=np.median),
        SuperconductorsCitrine2016Dataset(
            process=True,
            join=True,
            filter_=lambda e: not any(e["flagged_formula"]),
            samplef=lambda e: e["formula"],
            labelf=lambda tc: np.median(tc),
        ),
        CleanEnergyProjectDataset(
            source=cep_filename, 
            join=True,
            samplef=lambda e: e['formula'],  # reduce to stoichiometry
            labelf=lambda e: np.median(e['gap']),  # predict band gap
        ),
    ],
    [
        Friedman1979Data(dimensions=6),
        FriedmanSilverman1989Data(dimensions=10),
        Schwefel261981Data(dimensions=15),
    ],
])

dataset_names = [
    ['Strehlow & Cook (1973)', 'Citrine Superconductors (2016)', 'Clean Energy Project (2019)'],
    ['Friedman (1979)', 'Friedman & Silverman (1989)', 'Schwefel #26 (1981)'],
]

## Features

The experimental datasets require explicit featurization for use with learners that expect numerical arrays as inputs.<br>
The synthetic datasets are all vector spaces, and can be used as they are.

In [None]:
from smlb.features.matminer_composition import MatminerCompositionFeatures

In [None]:
with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)
        
            features = np.asarray([
                [
                    MatminerCompositionFeatures(ionic_fast=True),  # required for non-integer formulas
                    MatminerCompositionFeatures(ionic_fast=True),  # required for non-integer formulas
                    MatminerCompositionFeatures(ionic_fast=True),  # abuse for molecules...
                ],
                [
                    smlb.IdentityFeatures(),
                    smlb.IdentityFeatures(),
                    smlb.IdentityFeatures(),
                ],
])

# Sampling

### Split sizes

Learning curves are based on increasing training set sizes, ususally equi-distant in log-space.<br>
Since their size depends on number of samples in a dataaset, we use individual sizes.<br>
For validation we use a hold-out set containing 20% of all samples.

Arbitrarily many samples can be drawn from the synthetic datasets.<br>
We choose reasonably small training set sizes, motivated by our interest in small-data scenarios.

In [None]:
validation_fraction, num_train_min, num_training_sets, max_size = 0.2, 10, 6, 500

def calc_split_sizes(data):
    """Calculate validation and training set sizes for a dataset"""
    num_samples = data if isinstance(data, int) else data.num_samples
    num_validation = int(np.floor(validation_fraction * num_samples))
    num_validation = min(num_validation, max_size)  # cap validation set size
    num_train_max = np.floor((1 - validation_fraction) * num_samples)
    num_train_max = min(num_train_max, max_size)
    num_training = np.logspace(np.log10(num_train_min), 
        np.log10(num_train_max), num_training_sets, dtype=int)
    return num_validation, num_training

# print(calc_split_sizes(datasets[0][0]))
# print(calc_split_sizes(500))

In [None]:
split_sizes = np.asarray([
    [calc_split_sizes(data) for data in datasets[0]],  
    [calc_split_sizes(650) for _ in datasets[1]],
])

split_sizes

### Samplers

Use different samplers for finite datasets and vector-space datasets:

In [None]:
validation_samplers = np.array([
    [
        smlb.RandomSubsetSampler(size=m, rng=seeds.pop())
        for m in split_sizes[0, :, 0]
    ],
    [
        smlb.RandomVectorSampler(size=m, rng=seeds.pop())
        for m in split_sizes[1, :, 0]
    ],
])

training_samplers = np.array([
    [
        [smlb.RandomSubsetSampler(size=n, rng=seeds.pop()) for n in nn]
        for nn in split_sizes[0, :, 1]
    ],
    [
        [smlb.RandomVectorSampler(size=n, rng=seeds.pop()) for n in nn]
        for nn in split_sizes[1, :, 1]
    ],
])

## Learners

We compare `scikit-learn` random forests with `lolo` ones.<br>
In this benchmark, we use default parametrizations.

In [None]:
from smlb.learners.scikit_learn.random_forest_regression_sklearn import RandomForestRegressionSklearn
from smlb.learners.scikit_learn.extremely_randomized_trees_regression_sklearn import ExtremelyRandomizedTreesRegressionSklearn
from smlb.learners.scikit_learn.gradient_boosted_trees_regression_sklearn import GradientBoostedTreesRegressionSklearn

from smlb.learners.lolo.random_forest_regression_lolo import RandomForestRegressionLolo

In [None]:
learners = [
    RandomForestRegressionSklearn(random_state=seeds.pop()), 
    ExtremelyRandomizedTreesRegressionSklearn(random_state=seeds.pop()),
    GradientBoostedTreesRegressionSklearn(random_state=seeds.pop()),
    RandomForestRegressionLolo(),  # currently no support for passing prng seed
]

learner_names = ['random forest', 'extra-trees', 'gradient-boost', 'lolo']

## Workflow

In [None]:
from smlb.workflows.learning_curve_regression import LearningCurveRegression

Having to have `matplotlib` code and `smlb` code for the workflow in one block of code<br>
is a limitation of Jupyter notebook default settings and `matplotlib`.

In [None]:
fig, ax = plt.subplots(nrows=datasets.shape[0], ncols=datasets.shape[1], 
    squeeze=False, figsize=(16,9))

mpl.rcParams['axes.titleweight'] = 'bold'
cfg = smlb.PlotConfiguration(font_size=10)

plots = [
    [
        smlb.LearningCurvePlot(
            target=ax[row][col], 
            rectify=True,
            configuration=cfg,
            axes_labels=(
                'training set size' if row == 1 else None, 
                'Root Mean Squared Error' if col == 0 else None
            ),
        )
        for col in range(datasets.shape[1])
    ]
    for row in range(datasets.shape[0])
]

pb = tqdm.tqdm(total=datasets.shape[0]*datasets.shape[1], leave=False)

for row in range(datasets.shape[0]):
    for col in range(datasets.shape[1]):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)

            wf = LearningCurveRegression(
                data=datasets[row][col],
                features=features[row][col],
                training=training_samplers[row][col],
                validation=validation_samplers[row][col],
                learners=learners, 
                evaluations=[plots[row][col]]
            )
            
            ax[row][col].set_title(dataset_names[row][col])
            
            wf.run()
            
            pb.update(1)

ax[-1,-1].legend(learner_names, fontsize=12, loc=(1.05,0.05))

plt.show()

### Tabular results

Visual inspection of the learning curves is most useful for single analyses,<br>
but repeated evaluation calls for more easily accessible presentations.

`smlb` allows extraction of auxiliary information such as the offset and slope of the learning curves,<br>
which can then be displayed in a table for manual inspection at a glance or automated evaluation for a dashboard.

In [None]:
plots[0][0].auxiliary

In [None]:
fits = np.asfarray([
    [
        [
            (entry['offset'], entry['slope']) 
            for entry in plot.auxiliary['asymptotic_fits']
        ]
        for plot in row
    ]
    for row in plots
])
# fits [ row ] [ col ] [ learner ] [ offset/slope ]

fits = np.reshape(fits, newshape=(datasets.shape[0]*datasets.shape[1], len(learners), 2))
# fits [ row/col ] [ learner ] [ offset/slope ]

In [None]:
table_data = np.asarray(
[
    np.hstack( (rowcol[:,0], rowcol[:,1]) )
    for rowcol in fits
])



df = pd.DataFrame(
    table_data, 
    columns=learner_names + [name + " " for name in learner_names], 
    index=np.asarray(dataset_names).ravel()
)

with pd.option_context('display.float_format', '{:5.2f}'.format, 'display.width', 9999):
    IPython.display.display(IPython.display.HTML(df.to_html()))

By comparing to results tables from previous runs, a performance dashboard could be used<br>
to regularly monitor integrated performance of machine-learning algorithms on multiple datasets.

In [None]:
df2 = df.copy()
df2.iloc[1,-1] = -0.1

def _style_cell(x):
    color = 'color: red;'
    df1 = pd.DataFrame('', index=x.index, columns=x.columns)
    df1.iloc[1, -1] = color
    return df1

#with pd.option_context('display.float_format', '{:5.2f}'.format, 'display.width', 9999):
df2s = df2.style.apply(_style_cell, axis=None).format('{:5.2f}')
IPython.display.display(IPython.display.HTML(df2s.render()))