# `smlb` mini demonstration:<br>Compare different learners on the same dataset

Scientific Machine Learning Benchmark:<br>
A benchmark of regression models in chem- and materials informatics.<br>
2019-2020, Citrine Informatics.

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

import smlb

## The dataset: Friedman (1979)

Load the dataset. Use tab completion to find the right import:

In [None]:
from smlb.datasets.synthetic.friedman_1979.friedman_1979 import Friedman1979Data

Get information about the dataset.
Note that references are given.

In [None]:
print(Friedman1979Data.__doc__) # using `print` instead of `help` avoids clutter

Create the dataset:

In [None]:
dataset = Friedman1979Data(dimensions=5)  # ignore uncorrelated 6th dimension

## Sampling validation and training sets

For homogeneous measurement of error, we sample the validation set on a regular grid in 5 dimensions.<br>
Note that specification of the pseudo-random number generator seed `rng` is mandatory as `smlb` takes reproducibility seriously:<br>
results will be deterministic for a given seed, even if running in a distributed environment.

In [None]:
validation_set = smlb.GridSampler(size=7**5, domain=[0,1], rng=0) # 16807 samples on a 7 x 7 x ... x 7 grid; original hypercube domain

Training sets are of increasing size, equi-distant in log-space:

In [None]:
training_sizes = [10, 19, 37, 73, 143, 278, 542, 1056, 2055, 4000];
training_sets = tuple(smlb.GridSampler(size=size, domain=[0,1], rng=0) for size in training_sizes)

## Learners: Gaussian process regression and random forests

We use the `scikit-learn` implementations, which `smlb` wraps:

In [None]:
from smlb.learners.scikit_learn.gaussian_process_regression_sklearn import GaussianProcessRegressionSklearn
learner_gpr_skl = GaussianProcessRegressionSklearn(random_state=0) # default is Gaussian kernel

from smlb.learners.scikit_learn.random_forest_regression_sklearn import RandomForestRegressionSklearn
learner_rf_skl = RandomForestRegressionSklearn(random_state=0)

## The workflow: bringing it all together

We choose the appropriate workflow for benchmarking several learners on a single dataset:

In [None]:
from smlb.workflows.learning_curve_regression import LearningCurveRegression
workflow = LearningCurveRegression(data=dataset, training=training_sets, validation=validation_set, learners=[learner_rf_skl, learner_gpr_skl])

Now, let's run the benchmark:

In [None]:
workflow.run()

<div class="alert alert-block alert-info">
    <b>What happened?</b><br>
    <tt>smlb</tt> has detected that there is overlap between the samples in the first training set and the validation set.<br>
    In fact, the whole n=10 training set is a subset of the validation set because it lies on a subgrid.<br>
    <i>Such overlap can cause arbitrarily wrong performance estimates.</i><br>
    <t>smlb</t> emphasizes correctness, and guards against this.
</div>

## The workflow #2: correct sampling

Let's sample the training sets randomly, but keep the validation set on a grid as before.

In [None]:
training_sets = tuple(smlb.RandomVectorSampler(size=size, rng=0) for size in training_sizes) # dataset domain is used by default

Let's run the workflow again. We also tell it to produce a learning curve that we can modify within the notebook:

In [None]:
fig, ax = plt.subplots()
learning_curve = smlb.LearningCurvePlot(target=ax, axes_labels=("training set size", "root mean-squared error"))
workflow = LearningCurveRegression(data=dataset, training=training_sets, validation=validation_set, 
                                   learners=[learner_rf_skl, learner_gpr_skl], evaluations=[learning_curve,])
workflow.run()
ax.legend(["Random forest", "Gaussian process"]);

This result shows<br>
* an anomaly of the Gaussian process at intermediate training set sizes, likely due to hyperparameter optimization,
* better performance of the Gaussian process compared to the random forest on this *smooth* synthetic dataset.