# Welcome to the seventh MAST-ML tutorial notebook,

# Model predictions with guide rails with MAST-ML!

## In this notebook, we will learn how to perform simple checks on our test data:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Fit models and check elemental spaces](#task2)
3. [Fit models and check Gaussian Process Error Bars](#task3)
4. [Fit models and check Domain with MADML](#task4)
5. [Fit models and predict Domain](#task5)

We need to first install dependencies

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

In [None]:
!pip install mastml

Mount Google Drive to save output from runs.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Import needed packages and subroutines for running example.

In [4]:
from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.data_splitters import SklearnDataSplitter, NoSplit, LeaveOutPercent
from mastml.feature_selectors import EnsembleModelFeatureSelector
from mastml.mastml_predictor import make_prediction
from pathlib import Path
import mastml
import subprocess
import glob
import os
try:
    data_path = os.path.join(mastml.__path__._path[0], 'data')
except:
    data_path = os.path.join(mastml.__path__[0], 'data')


Figshare is an optional dependency. To import data from figshare, manually install figshare via git clone of git clone https://github.com/cognoma/figshare.git
scikit-lego is an optional dependency, enabling use of the LowessRegression model. If you want to use this model, do "pip install scikit-lego"
linear-tree is an optional dependency, enabling use of Linear tree, forest, and boosting models. If you want to use this model, do "pip install linear-tree"
gplearn is an optional dependency, enabling the use of genetic programming SymbolicRegressor model. If you want to use this model, do "pip install gplearn"


  from .autonotebook import tqdm as notebook_tqdm
Failed to import duecredit due to No module named 'duecredit'


CBFV is an optional dependency. To install CBFV, do pip install cbfv
DeepChem is an optional dependency used to generate molecular descriptors from RDKit. To install Deepchem, do pip install deepchem


Define the path to save data.

In [5]:
SAVEPATH = 'drive/MyDrive/MASTML_tutorial_7_ModelPredictions_with_Guide_Rails'

mastml_instance = Mastml(savepath=SAVEPATH)
savepath = mastml_instance.get_savepath

Load the standard diffusion dataset.

In [6]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2', 'Hop activation barrier']
d = LocalDatasets(file_path=data_path+'/diffusion_data_allfeatures.xlsx',
                  target=target,
                  extra_columns=extra_columns,
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

metrics = [
           'r2_score',
           'mean_absolute_error',
           'root_mean_squared_error',
           'rmse_over_stdev',
           ]




## Task 2: Fit models and check elemental spaces <a name="task1"></a>

Setup machine learning which checks if an element from a test set was observed within the training set. If all elements from the test set are observed in the training set, the case is marked as "in_domain" = 1. If only some elements from the test set are observed in training data, then the case is marked as "maybe_in_domain" = 0. If none of the test elements are observed within training data, then the case is flagged as "out_of_domain" = -1.

In [8]:
preprocessor = SklearnPreprocessor(
                                   preprocessor='StandardScaler',
                                   as_frame=True,
                                   )

model = SklearnModel(model='RandomForestRegressor')

splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=1,
                               n_splits=5,
                               )
splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[('elemental', groups)],
                  )

Running splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:35<00:00,  4.30s/it]


If you examine the domains_test.csv, you'll see that all data points are marked as in_domain (= 1). This is because test data
was randomly held out, virtually guaranteeing that the same set of elements are present in the train and test sets. Below,
we examine the extreme case of leaving out groups, where each group is a different host element. After running this, see what 
happens to the values of domains_train vs. domains_test: the test data are all marked as out of domain!

In [10]:
splitter = SklearnDataSplitter(
                               splitter='LeaveOneGroupOut',
                               )
splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  groups=groups,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[('elemental', groups)],
                  )

Running splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [01:39<00:00,  6.64s/it]




Here we use the error bars inherently in Gaussian Process Regression (GPR) to determine if we should flag a case as worrisome. Through 5-fold cross validation, we attain the maximum uncertainty from GPR and compare to test cases. If the test case uncertainty is greater than the maximum training uncertainty, we mark the observation as "out_of_domain" and "in_domain" otherwise.

## Task 3: Fit models and check Gaussian Process Error Bars <a name="task1"></a>

In [12]:
splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=1,
                               n_splits=5,
                               )
splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=['gpr'],
                  )

Running splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:40<00:00, 32.14s/it]


## Task 4: Fit models and check domain with MADML  <a name="task4"></a>

Fit the kinds of models that we need for domain evaluation using the kernel density estimate (KDE) approach in the materials application domain for machine learning (MADML) package. We use the bare minimum inputs here and mostly use default parameters. Keep in mind that feature selection may be needed for good results using the MADML package.

In [13]:
splitter = NoSplit() # Want to use NoSplit because MADML internally performs nested CV methods

params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x32f055dd0>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:12<01:51, 12.42s/it][A

fit complete
MODEL <madml.models.combine object at 0x35edd9ad0>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:22<01:29, 11.24s/it][A

fit complete
MODEL <madml.models.combine object at 0x35edb4c90>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:33<01:15, 10.82s/it][A

fit complete
MODEL <madml.models.combine object at 0x32f0efed0>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:43<01:03, 10.57s/it][A

fit complete
MODEL <madml.models.combine object at 0x35ed99890>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [00:54<00:53, 10.78s/it][A

fit complete
MODEL <madml.models.combine object at 0x32f0eec10>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [00:56<00:31,  7.84s/it][A

fit complete
MODEL <madml.models.combine object at 0x3334c1c10>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:05<00:24,  8.32s/it][A

fit complete
MODEL <madml.models.combine object at 0x31befee90>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:13<00:15,  7.92s/it][A

fit complete
MODEL <madml.models.combine object at 0x31be91450>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:20<00:07,  7.81s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bd0c850>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:30<00:00,  9.09s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:07<00:00, 127.94s/it]


Use a specific linear uncertainty model with their coefficients here.

In [14]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['uq_coeffs'] = [0.0, 2.0, 0.1]  # Starting guess for coefficients in c0+c1*x+c2*x^2+...+cn*x^n
domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x31820bc90>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:11<01:43, 11.49s/it][A

fit complete
MODEL <madml.models.combine object at 0x313b4f690>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:22<01:29, 11.18s/it][A

fit complete
MODEL <madml.models.combine object at 0x317fad0d0>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:33<01:16, 10.95s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fe614290>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:44<01:06, 11.01s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fe6dd7d0>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [00:55<00:54, 10.94s/it][A

fit complete
MODEL <madml.models.combine object at 0x35ec62f10>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [00:57<00:31,  7.89s/it][A

fit complete
MODEL <madml.models.combine object at 0x35ec62050>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:06<00:25,  8.43s/it][A

fit complete
MODEL <madml.models.combine object at 0x2c8db88d0>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:13<00:15,  7.85s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fe9d1690>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:20<00:07,  7.62s/it][A

fit complete
MODEL <madml.models.combine object at 0x31b976a90>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:30<00:00,  9.03s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:08<00:00, 128.44s/it]


Let's say that we do not want to modify the uncertainty model coefficients. We can base the specific functions for the model as follows:

In [15]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['uq_function'] = lambda x: 0.7+1.05*x  # Note that this is fixed across folds
domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x369243890>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:12<01:50, 12.29s/it][A

fit complete
MODEL <madml.models.combine object at 0x317f86a10>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:24<01:36, 12.11s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bc72050>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:36<01:25, 12.18s/it][A

fit complete
MODEL <madml.models.combine object at 0x3a5992450>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:48<01:13, 12.26s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bcb1990>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [01:00<01:00, 12.07s/it][A

fit complete
MODEL <madml.models.combine object at 0x333349a10>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [01:11<00:46, 11.57s/it][A

fit complete
MODEL <madml.models.combine object at 0x32f141450>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:13<00:25,  8.50s/it][A

fit complete
MODEL <madml.models.combine object at 0x33516e7d0>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:20<00:15,  7.94s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef25eed0>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:30<00:08,  8.54s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef2b2990>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:37<00:00,  9.79s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:14<00:00, 134.96s/it]


We can also alter the number of clusters used in agglomerative clustering to make our splits.

In [16]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['n_clusters'] = [2, 3, 4]  # A list of the number of clusters in each split

domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/14 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x39a73af50>
starting fit



  7%|████████▌                                                                                                              | 1/14 [00:12<02:36, 12.02s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bd05110>
starting fit



 14%|█████████████████                                                                                                      | 2/14 [00:22<02:16, 11.37s/it][A

fit complete
MODEL <madml.models.combine object at 0x318125650>
starting fit



 21%|█████████████████████████▌                                                                                             | 3/14 [00:33<02:02, 11.17s/it][A

fit complete
MODEL <madml.models.combine object at 0x39a745390>
starting fit



 29%|██████████████████████████████████                                                                                     | 4/14 [00:44<01:51, 11.10s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bceb010>
starting fit



 36%|██████████████████████████████████████████▌                                                                            | 5/14 [00:55<01:39, 11.08s/it][A

fit complete
MODEL <madml.models.combine object at 0x318076ed0>
starting fit



 43%|███████████████████████████████████████████████████                                                                    | 6/14 [00:57<01:03,  7.99s/it][A

fit complete
MODEL <madml.models.combine object at 0x3a5a3ef10>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 7/14 [01:07<00:59,  8.57s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fbbf4050>
starting fit



 57%|████████████████████████████████████████████████████████████████████                                                   | 8/14 [01:15<00:50,  8.47s/it][A

fit complete
MODEL <madml.models.combine object at 0x3351c1010>
starting fit



 64%|████████████████████████████████████████████████████████████████████████████▌                                          | 9/14 [01:26<00:45,  9.14s/it][A

fit complete
MODEL <madml.models.combine object at 0x35ee5e810>
starting fit



 71%|████████████████████████████████████████████████████████████████████████████████████▎                                 | 10/14 [01:33<00:33,  8.44s/it][A

fit complete
MODEL <madml.models.combine object at 0x31806a190>
starting fit



 79%|████████████████████████████████████████████████████████████████████████████████████████████▋                         | 11/14 [01:42<00:26,  8.78s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bb55310>
starting fit



 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████▏                | 12/14 [01:53<00:18,  9.34s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fe6b4450>
starting fit



 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▌        | 13/14 [02:01<00:09,  9.04s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fe8d3c10>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [02:10<00:00,  9.29s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:46<00:00, 166.63s/it]


We can also change the ground truth for our domain tests. Note that we can also combine parameters.

In [17]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['gt_residual'] = 0.75  # Ground truth for residual test
params['gt_uncertainty'] = 0.5  # Ground truth for uncertainty test

domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x2fc04af10>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:14<02:09, 14.35s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fbe46a90>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:26<01:45, 13.23s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fbd65f50>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:38<01:29, 12.75s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fbf9d950>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:50<01:14, 12.44s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef74d910>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [01:02<01:01, 12.27s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef6d1f50>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [01:17<00:52, 13.01s/it][A

fit complete
MODEL <madml.models.combine object at 0x317fae790>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:19<00:28,  9.38s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef582d10>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:26<00:17,  8.62s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef636dd0>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:32<00:07,  7.83s/it][A

fit complete
MODEL <madml.models.combine object at 0x35eec0d10>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:41<00:00, 10.13s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:16<00:00, 136.54s/it]


We can also change the number of bins used for our uncertainty test.

In [18]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['bins'] = 5

domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x39a436250>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:11<01:42, 11.37s/it][A

fit complete
MODEL <madml.models.combine object at 0x32f127b90>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:22<01:28, 11.02s/it][A

fit complete
MODEL <madml.models.combine object at 0x34e190110>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:32<01:16, 10.87s/it][A

fit complete
MODEL <madml.models.combine object at 0x33fe69910>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:43<01:05, 10.86s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef4d0410>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [00:54<00:54, 10.91s/it][A

fit complete
MODEL <madml.models.combine object at 0x3087c7f50>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [01:04<00:41, 10.47s/it][A

fit complete
MODEL <madml.models.combine object at 0x2c6cbd950>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:06<00:23,  7.68s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef687750>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:13<00:14,  7.50s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef782b90>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:24<00:08,  8.79s/it][A

fit complete
MODEL <madml.models.combine object at 0x30207b6d0>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:32<00:00,  9.21s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:07<00:00, 127.99s/it]


The kernel parameters can also be changed.

In [19]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['bandwidth'] = 1.5
params['kernel'] = 'gaussian'

domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/10 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x2ef480f10>
starting fit



 10%|███████████▉                                                                                                           | 1/10 [00:10<01:38, 10.90s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bb47150>
starting fit



 20%|███████████████████████▊                                                                                               | 2/10 [00:21<01:27, 10.89s/it][A

fit complete
MODEL <madml.models.combine object at 0x30d727690>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 3/10 [00:32<01:15, 10.82s/it][A

fit complete
MODEL <madml.models.combine object at 0x366152390>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 4/10 [00:43<01:04, 10.83s/it][A

fit complete
MODEL <madml.models.combine object at 0x317fd5390>
starting fit



 50%|███████████████████████████████████████████████████████████▌                                                           | 5/10 [00:54<00:54, 10.81s/it][A

fit complete
MODEL <madml.models.combine object at 0x300c10e90>
starting fit



 60%|███████████████████████████████████████████████████████████████████████▍                                               | 6/10 [01:04<00:42, 10.60s/it][A

fit complete
MODEL <madml.models.combine object at 0x31079ee10>
starting fit



 70%|███████████████████████████████████████████████████████████████████████████████████▎                                   | 7/10 [01:06<00:23,  7.82s/it][A

fit complete
MODEL <madml.models.combine object at 0x310a5ced0>
starting fit



 80%|███████████████████████████████████████████████████████████████████████████████████████████████▏                       | 8/10 [01:16<00:16,  8.48s/it][A

fit complete
MODEL <madml.models.combine object at 0x3693f7110>
starting fit



 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████            | 9/10 [01:23<00:08,  8.13s/it][A

fit complete
MODEL <madml.models.combine object at 0x300b6f0d0>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:29<00:00,  8.99s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:04<00:00, 124.57s/it]


## Task 5: Fit models and predict Domain <a name="task5"></a>

Here, we fit both GPR and MADML domain models

In [20]:
splitter = NoSplit()

params = {'n_repeats': 2}  # Increase if more averaging needed for convergence
domain = ('madml', params)

splitter.evaluate(
                  savepath=savepath,
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain, 'gpr'],
                  )

Running splits:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]
  0%|                                                                                                                               | 0/20 [00:00<?, ?it/s][A

MODEL <madml.models.combine object at 0x31bb6f1d0>
starting fit



  5%|█████▉                                                                                                                 | 1/20 [00:16<05:06, 16.16s/it][A

fit complete
MODEL <madml.models.combine object at 0x3333276d0>
starting fit



 10%|███████████▉                                                                                                           | 2/20 [00:31<04:37, 15.44s/it][A

fit complete
MODEL <madml.models.combine object at 0x35cd32090>
starting fit



 15%|█████████████████▊                                                                                                     | 3/20 [00:47<04:26, 15.66s/it][A

fit complete
MODEL <madml.models.combine object at 0x316283c50>
starting fit



 20%|███████████████████████▊                                                                                               | 4/20 [01:03<04:15, 15.94s/it][A

fit complete
MODEL <madml.models.combine object at 0x2c8982b10>
starting fit



 25%|█████████████████████████████▊                                                                                         | 5/20 [01:19<04:00, 16.05s/it][A

fit complete
MODEL <madml.models.combine object at 0x31870e5d0>
starting fit



 30%|███████████████████████████████████▋                                                                                   | 6/20 [01:34<03:39, 15.70s/it][A

fit complete
MODEL <madml.models.combine object at 0x39a688650>
starting fit



 35%|█████████████████████████████████████████▋                                                                             | 7/20 [01:50<03:24, 15.71s/it][A

fit complete
MODEL <madml.models.combine object at 0x33521ae50>
starting fit



 40%|███████████████████████████████████████████████▌                                                                       | 8/20 [02:06<03:08, 15.74s/it][A

fit complete
MODEL <madml.models.combine object at 0x2c8e75010>
starting fit



 45%|█████████████████████████████████████████████████████▌                                                                 | 9/20 [02:21<02:52, 15.72s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef436dd0>
starting fit



 50%|███████████████████████████████████████████████████████████                                                           | 10/20 [02:37<02:36, 15.64s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bacb490>
starting fit



 55%|████████████████████████████████████████████████████████████████▉                                                     | 11/20 [02:50<02:13, 14.80s/it][A

fit complete
MODEL <madml.models.combine object at 0x2ef7d8f90>
starting fit



 60%|██████████████████████████████████████████████████████████████████████▊                                               | 12/20 [02:52<01:28, 11.09s/it][A

fit complete
MODEL <madml.models.combine object at 0x36956b790>
starting fit



 65%|████████████████████████████████████████████████████████████████████████████▋                                         | 13/20 [03:05<01:21, 11.63s/it][A

fit complete
MODEL <madml.models.combine object at 0x33fcfd550>
starting fit



 70%|██████████████████████████████████████████████████████████████████████████████████▌                                   | 14/20 [03:08<00:53,  8.88s/it][A

fit complete
MODEL <madml.models.combine object at 0x2fbc45650>
starting fit



 75%|████████████████████████████████████████████████████████████████████████████████████████▌                             | 15/20 [03:16<00:44,  8.82s/it][A

fit complete
MODEL <madml.models.combine object at 0x39a844dd0>
starting fit



 80%|██████████████████████████████████████████████████████████████████████████████████████████████▍                       | 16/20 [03:26<00:36,  9.11s/it][A

fit complete
MODEL <madml.models.combine object at 0x313c41910>
starting fit



 85%|████████████████████████████████████████████████████████████████████████████████████████████████████▎                 | 17/20 [03:39<00:30, 10.29s/it][A

fit complete
MODEL <madml.models.combine object at 0x3162303d0>
starting fit



 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏           | 18/20 [03:48<00:19,  9.72s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bb3b6d0>
starting fit



 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████      | 19/20 [03:57<00:09,  9.50s/it][A

fit complete
MODEL <madml.models.combine object at 0x31bb51010>
starting fit



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [04:08<00:00, 12.43s/it][A

fit complete



Running splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [05:17<00:00, 317.04s/it]


Use fitted models and predict on sample data. Here we predict on the data we train on, but can be anything else.

In [26]:
import pandas as pd

path_fullfit = splitter.splitdir

model_path = os.path.join(path_fullfit, 'RandomForestRegressor.pkl')
preprocessor_path = os.path.join(path_fullfit, 'StandardScaler.pkl')
domain_path = list(map(str, Path(path_fullfit).rglob('domain_*.pkl')))

pred_df = make_prediction(
                          X_train=X,
                          y_train=pd.DataFrame(y),
                          X_test=X,
                          model=model_path,
                          preprocessor=preprocessor_path,
                          domain=domain_path,
                          )

print(pred_df)

       y_pred     y_err    y_true  domain_gpr    y_pred    d_pred  \
0   -0.004018  0.035633  0.000000           1 -0.008182  0.884227   
1    0.067530  0.228834 -0.090142           1  0.084015  0.251691   
2    0.298510  0.235262  0.259139           1  0.269200  0.283408   
3   -0.031400  0.053418 -0.022200           1 -0.026809  0.727743   
4    0.297986  0.181234  0.317672           1  0.257376  0.257772   
..        ...       ...       ...         ...       ...       ...   
403  0.030676  0.255480 -0.067020           1 -0.008405  0.816549   
404  0.181463  0.294506  0.153850           1  0.163948  0.820780   
405  0.214250  0.117850  0.248110           1  0.215429  0.800442   
406  0.177905  0.293322  0.204140           1  0.167494  0.919245   
407  0.166161  0.178316  0.248040           1  0.145869  0.934091   

     y_stdu_pred  y_stdc_pred rmse/std_y Domain Prediction from Max F1  \
0       0.056104     0.053439                                       ID   
1       0.251117     0.

We can also change the default thresholds for prediction and add our own.