# Welcome to the eighth MAST-ML tutorial notebook,

# Model predictions with guide rails with MAST-ML!

## In this notebook, we will learn how to perform simple checks on our test data:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Fit models and check elemental spaces](#task2)
3. [Fit models and check Gaussian Process Error Bars](#task3)
4. [Fit models and check Domain with MADML](#task4)
5. [Fit models and predict Domain](#task5)

We need to first install dependencies

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

In [None]:
!pip install git+https://github.com/uw-cmg/MAST-ML/@dev_lane
!pip install pyyaml==5.4.1

Mount Google Drive to save output from runs.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Import needed packages and subroutines for running example.

In [None]:
from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.data_splitters import SklearnDataSplitter, NoSplit, LeaveOutPercent
from mastml.feature_selectors import EnsembleModelFeatureSelector
from mastml.mastml_predictor import make_prediction
from pathlib import Path
import mastml
import subprocess
import glob
import os
try:
    data_path = os.path.join(mastml.__path__._path[0], 'data')
except:
    data_path = os.path.join(mastml.__path__[0], 'data')


Define the path to save data.

In [None]:
SAVEPATH = 'drive/MyDrive/MASTML_tutorial_8_ModelPredictions_with_Guide_Rails'

mastml_instance = Mastml(savepath=SAVEPATH)
savepath = mastml_instance.get_savepath

Load the standard diffusion dataset.

In [None]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2', 'Hop activation barrier']
d = LocalDatasets(file_path=data_path+'/diffusion_data_allfeatures.xlsx',
                  target=target,
                  extra_columns=extra_columns,
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

metrics = [
           'r2_score',
           'mean_absolute_error',
           'root_mean_squared_error',
           'rmse_over_stdev',
           ]


## Task 2: Fit models and check elemental spaces <a name="task1"></a>

Setup machine learning which checks if an element from a test set was observed within the training set. If all elements from the test set are observed in the training set, the case is marked as "in_domain". If only some elements from the test set are observed in training data, then the case is marked as "maybe_in_domain". If none of the test elements are observed within training data, then the case is flagged as "out_of_domain".

In [None]:
preprocessor = SklearnPreprocessor(
                                   preprocessor='StandardScaler',
                                   as_frame=True,
                                   )

model = SklearnModel(model='RandomForestRegressor')

splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=10,
                               n_splits=5,
                               )
splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[('elemental', groups)],
                  )

Here we use the error bars inherently in Gaussian Process Regression (GPR) to determine if we should flag a case as worrisome. Through 5-fold cross validation, we attain the maximum uncertainty from GPR and compare to tets cases. If the test case unceratinty is grater than the maximum training uncertainty, we mark the observation as "out_of_domain" and "in_domain" otherwise.

## Task 3: Fit models and check Gaussian Process Error Bars <a name="task1"></a>

In [None]:
splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=['gpr'],
                  )

## Task 4: Fit models and check domain with MADML  <a name="task4"></a>

Fit the kinds of models that we need for domain evaluation. We use the bare minimum inputs here and mostly use default parameters. Keep in mind that feature selection may be needed for good results using the MADML package.

In [None]:
splitter = NoSplit()

params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Use a specific linear uncerintaty model with their coefficients here.

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['uq_coeffs'] = [0.0, 2.0, 0.1]  # Starting guess for coefficients in c0+c1*x+c2*x^2+...+cn*x^n
domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Let's say that we do not want to modify the uncertainty model coefficients. We can bass the specific functions for the model as follows:

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['uq_function'] = lambda x: 0.7+1.05*x  # Note that this is fixed across folds
domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

We can also alter the number of clusters used in agglomerative clustering to make our splits.

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['n_clusters'] = [2, 3, 4]  # A list of the number of clusters in each split

domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

We can also change the ground truth for our domain tests. Note that we can also combine parameters.

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['gt_residual'] = 0.75  # Ground truth for residual test
params['gt_uncertainty'] = 0.5  # Ground truth for uncertainty test

domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

We can also change the number of bins used for our uncertainty test.

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['bins'] = 5

domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

The kernel parameters can also be changed.

In [None]:
params = {'n_repeats': 1}  # Increase if more averaging needed for convergence
params['bandwidth'] = 1.5
params['kernel'] = 'gaussian'

domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Let us now clean our work space.

In [None]:
to_delete = glob.glob('Ran*')
for d in to_delete:
    subprocess.run(['rm', '-rf', d])

## Task 5: Fit models and predict Domain <a name="task5"></a>

Here, we fit both GPR and MADML domain models

In [None]:
splitter = NoSplit()

params = {'n_repeats': 2}  # Increase if more averaging needed for convergence
domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain, 'gpr'],
                  )

Use fitted models and predict on sample data. Here we predict on the data we train on, but can be anything else.

In [13]:
file_to_move = glob.glob('Ran*')[0]
subprocess.run(['mv', file_to_move, 'output'])
path_fullfit = './output/split_0'

model_path = os.path.join(path_fullfit, 'RandomForestRegressor.pkl')
preprocessor_path = os.path.join(path_fullfit, 'StandardScaler.pkl')
domain_path = list(map(str, Path(path_fullfit).rglob('domain_*.pkl')))

pred_df = make_prediction(
                          X_train=X,
                          y_train=y,
                          X_test=X,
                          model=model_path,
                          preprocessor=preprocessor_path,
                          domain=domain_path,
                          )

print(pred_df)

       y_pred     y_err  domain_gpr  Residual for 0.95  Residual for Max F1   
0   -0.003191  0.051607          -1                  0                    1  \
1    0.064101  0.224702          -1                  1                    1   
2    0.264911  0.115056          -1                  1                    1   
3   -0.042670  0.069718          -1                  0                    1   
4    0.258767  0.131331          -1                  1                    1   
..        ...       ...         ...                ...                  ...   
403 -0.022099  0.145523          -1                  0                    1   
404  0.143181  0.229243          -1                  0                    1   
405  0.211635  0.152288          -1                  0                    1   
406  0.185900  0.103158          -1                  0                    1   
407  0.161902  0.169626          -1                  0                    1   

     Uncertainty for 0.95  Uncertainty for Max F1  

We can also change the default thresholds for prediction and add our own.

In [14]:
thresholds = [('residual', 0.75), ('uncertainty', 0.2)]

pred_df = make_prediction(
                          X_train=X,
                          y_train=y,
                          X_test=X,
                          model=model_path,
                          preprocessor=preprocessor_path,
                          domain=domain_path,
                          madml_thresholds=thresholds
                          )

print(pred_df)

       y_pred     y_err  domain_gpr  Residual for 0.75  Uncertainty for 0.2   
0   -0.003191  0.051607          -1                  0                    0  \
1    0.064101  0.224702          -1                  1                    0   
2    0.264911  0.115056          -1                  1                    0   
3   -0.042670  0.069718          -1                  1                    0   
4    0.258767  0.131331          -1                  1                    0   
..        ...       ...         ...                ...                  ...   
403 -0.022099  0.145523          -1                  0                    0   
404  0.143181  0.229243          -1                  0                    0   
405  0.211635  0.152288          -1                  0                    0   
406  0.185900  0.103158          -1                  0                    0   
407  0.161902  0.169626          -1                  0                    0   

         dist  
0    0.884227  
1    0.251691  
2  