# Demonstration -- A machine learning pipeline with `MEMENTO`

The benefit of using `MEMENTO`:

- Avoiding all the copy and paste when running repeated experiments;
- Experiments are running in parallel.
- Focusing on the workflow of one experiment;
- Keep all configurations in one place;
- Automatic hashing each task;
- Using checkpoints to keep tracking progress;
- Send notification when the experiments fail or finish;

## Install

- `MEMENTO` is officially available on PyPl.
- `MEMENTO` is portable and versatile. It doesn't tie to any machine learning packages, e.g., `sklearn` and `xgboost`.

```bash
# Using Python 3.9.x (Memento support Python 3.7, 3.8 and 3.9)
conda create -n memento python=3.9
conda activate memento

# Install dependencies
pip install memento-ml scikit-learn jupyterlab
```


In [1]:
import functools
import logging

import numpy as np
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, StandardScaler
from sklearn.svm import SVC

from memento import Config, ConsoleNotificationProvider, Context, Memento


In [2]:
logging.basicConfig(level=logging.INFO)


## Add Missing Values

In [3]:
def add_missing_values(X, missing_rate=0.1):
    """Add missing features to n percent of samples. Remove 1 feature per sample."""
    n_samples, n_features = X.shape
    n_missing_samples = int(n_samples * missing_rate)

    idx_missing_samples = np.random.choice(
        n_samples, size=n_missing_samples, replace=True
    )
    idx_missing_features = np.random.randint(0, n_features, n_missing_samples)

    X_missing = X.copy()
    X_missing[idx_missing_samples, idx_missing_features] = np.nan
    return X_missing


In [4]:
# A dummy preprocessing class which returns X unchanged.
DummyPreprocessor = FunctionTransformer(lambda x: x)

# Using `partial` to avoid passing parameter in the experiment function.
load_digits = functools.partial(datasets.load_digits, return_X_y=True)
load_wine = functools.partial(datasets.load_wine, return_X_y=True)


def load_breast_cancer():
    """Add missing values to Breast Cancer dataset."""
    X, y = datasets.load_breast_cancer(return_X_y=True)
    X_missing = add_missing_values(X, missing_rate=0.1)
    return X_missing, y


Imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

# Put all parameters in a configuration matrix
matrix = {
    "parameters": {
        "dataset": [
            load_digits,
            load_wine,
            load_breast_cancer,
        ],
        "preprocessing1": [
            DummyPreprocessor,
            Imputer,
        ],
        "preprocessing2": [
            DummyPreprocessor,
            MinMaxScaler(),
            StandardScaler(),
        ],
        "classifier": [
            AdaBoostClassifier,
            RandomForestClassifier,
            SVC,
        ],
    },
    "settings": {  # Set global values here
        "n_fold": 5,
    },
    "exclude": [  # Only Breast Cancer dataset requires imputation.
        {"dataset": load_breast_cancer, "preprocessing1": DummyPreprocessor},
        {"dataset": load_digits, "preprocessing1": Imputer},
        {"dataset": load_wine, "preprocessing1": Imputer},
    ],
}


The `experiment` function is the building block for **Memento**.
It takes two parameters: `Context` and `Config`.
Memento will automatically figure out how many tasks it need to create based on the configuration matrix, and execute them in parallel.
Each task will execute this `experiment` function but with different parameters (inside `Config`).

- The `Context` exposes a handler, so the user can access `checkpoint` in the `experiment` function.
- The `Config` provides one set of parameter (from the configuration matrix) to the experiment.


In [5]:
def experiment(context: Context, config: Config):
    """This block contains the experiment with one set of parameters."""
    if context.checkpoint_exist():  # Based on the hashing value.
        results = context.restore()  # Recover results from cache.
    else:  # Cached results are not found. Running the experiment here:
        # Access parameter:
        X, y = config.dataset()
        model = config.classifier()

        # Access the global constant.
        cv = config.settings["n_fold"]

        # Build and run the pipeline:
        pipeline = make_pipeline(config.preprocessing1, config.preprocessing2, model)
        results = cross_val_score(pipeline, X, y, cv=cv)

        # Save results to the checkpoint:
        # NOTE: The checkpoint can save any object. There is a list, not just a value.
        context.checkpoint(results)
    return results.mean() * 100  # Average from 5 runs in percentage.


In [6]:
notification_provider = ConsoleNotificationProvider()

# Do not actually run experiments, just log what would be run.
results = Memento(experiment, notification_provider).run(matrix, dry_run=True)


INFO:memento.memento:Running configurations:
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'classifier': <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>}
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'classifier': <class 'sklearn.ensemble._forest.RandomForestClassifier'>}
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': 

In [11]:
results = Memento(experiment, notification_provider).run(matrix)


INFO:memento.memento:Running configurations:
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'classifier': <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>}
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'classifier': <class 'sklearn.ensemble._forest.RandomForestClassifier'>}
INFO:memento.memento:  {'dataset': functools.partial(<function load_digits at 0x00000225F197F4C0>, return_X_y=True), 'preprocessing1': FunctionTransformer(func=<function <lambda> at 0x00000225F339A040>), 'preprocessing2': 

All tasks completed


If we rerun the cell above, since there is no parameter changes and all results have been save the in the cache, the code will complete instantly.


In [12]:
# Show avg. accuracy in percentage (Note that we multiple 100 in the experiment block)
avg_accs = np.round([result.inner for result in results], 2)
print(*avg_accs, sep="\n")


26.77
93.71
96.33
26.77
93.38
95.99
26.77
93.6
94.6
80.84
96.67
66.35
80.29
97.22
97.76
80.29
97.76
98.33
97.19
96.31
90.87
97.54
95.96
97.72
97.19
95.96
97.54


## Show Runtime

- The duration of the task is automatically recorded.
- Saved as `datetime.timedelta`.
- Runtime will be reserved for cached results.

In [13]:
# Convert TimeDelta to Seconds
time_deltas = [result.runtime.total_seconds() for result in results]
print(*time_deltas, sep='\n')


0.993093
1.316123
0.487049
0.980092
1.447137
0.370038
1.315127
1.386133
0.50805
0.390038
0.814079
0.037004
0.430046
0.780074
0.031003
0.385037
0.718068
0.083009
0.706064
0.881084
0.118013
0.770074
1.232118
0.054007
0.750072
1.213115
0.084006
