# Materials Project time splits for materials generative benchmarking

In this notebook, we will install the `mp_time_split` package and run through the following examples:
1. accessing the cross-validation folds and final train/test split
2. "fitting" a DummyGenerator model and comparing to validation data
3. evaluating cross-validated model accuracy
4. hyperparameter optimization of generator statistic(s)

## Installation

In [None]:
%pip install matbench-genmetrics

## Access the data and the data splits

We will use the `MPTimeSplit` class as the main interface with the benchmark dataset in
each of the examples.

In [None]:
from matbench_genmetrics.mp_time_split.splitter import MPTimeSplit

We use the default `"TimeSeriesSplit"` cross-validation splitting scheme. 

We specify `"energy_above_hull"` as the target which is surfaced in the `train_outputs`,
`val_outputs`, and `test_outputs` `Series`-s. The target variable is excluded from the
corresponding `_inputs` variables, i.e. `train_inputs`, `val_inputs`, and `test_inputs`
to prevent data leakage during conditional generation, regression/classification, and
hyperparameter optimization.

In [None]:
mpt = MPTimeSplit(mode="TimeSeriesSplit", target="energy_above_hull")

We load the full snapshot dataset (~30 MB compressed). To load and work with a much smaller dummy
dataset (~10 kB), set `dummy=True`.

In [None]:
mpt.load(dummy=False)

Similar to Matbench, we loop through each of the folds of the train and validation
splits and can also access the final train/test split. We use the default "

In [None]:
for fold in mpt.folds:
    train_inputs, val_inputs, train_outputs, val_outputs = mpt.get_train_and_val_data(
        fold
    )
final_inputs, test_inputs, final_outputs, test_outputs = mpt.get_final_test_data()


In [None]:
from matbench_genmetrics.mp_time_split import MPTimeSplit
from matbench_genmetrics.mp_time_split.utils.gen import DummyGenerator

mpt = MPTimeSplit(target="energy_above_hull")
mpt.load(dummy=True)

for fold in mpt.folds:
    train_inputs, val_inputs, train_outputs, val_outputs = mpt.get_train_and_val_data(
        fold
    )
    dg = DummyGenerator()
    dg.fit(train_inputs)
    generated_structures = dg.gen(n=100)
    # compare generated_structures and val_inputs
    # some_code_here

1 + 1


In [None]:
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error

from matbench_genmetrics.mp_time_split import MPTimeSplit

model = DummyRegressor(strategy="mean")

mpt = MPTimeSplit(target="energy_above_hull")
mpt.load(dummy=False)

maes = []
for fold in mpt.folds:
    train_inputs, val_inputs, train_outputs, val_outputs = mpt.get_train_and_val_data(
        fold
    )
    model.fit(train_inputs, train_outputs)
    predictions = model.predict(val_inputs)
    mae = mean_absolute_error(val_outputs, predictions)
    maes.append(mae)

np.mean(maes)


In [None]:
from matbench_genmetrics.mp_time_split import MPTimeSplit
from matbench_genmetrics.mp_time_split.utils.gen import DummyGenerator

mpt = MPTimeSplit(target="energy_above_hull")
mpt.load(dummy=True)

def compare(inputs, gen_inputs):
    inputs, gen_inputs
    return np.random.rand()

def fit_and_evaluate(parameterization):
    metrics = []
    for fold in mpt.folds:
        train_inputs, val_inputs, _, _ = mpt.get_train_and_val_data(
            fold
        )
        dg = DummyGenerator(**parameterization)
        dg.fit(train_inputs)
        generated_structures = dg.gen(n=100)
        # compare generated_structures and val_inputs
        metric = compare(val_inputs, generated_structures)
        metrics.append(metric)
    avg_metric = np.mean(metrics)
    return avg_metric
        
parameterization = {}

fit_and_evaluate(parameterization)

1 + 1
