# Welcome to the eighth MAST-ML tutorial notebook, 

# Model predictions with guide rails with MAST-ML! 

## In this notebook, we will learn how to perform simple checks on our test data:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Fit models and check elemental spaces](#task2)
3. [Fit models and check Gaussian Process Error Bars](#task3)
4. [Fit models and check Domain with MADML](#task4)

We need to first install dependencies

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

In [None]:
!pip install git+https://github.com/uw-cmg/MAST-ML/@dev_lane
!pip install pyyaml==5.4.1

Mount Google Drive to save output from runs.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Import needed packages and subroutines for running example.

In [1]:
import sys
my_path = '/home/nerve/Desktop/updates/MAST-ML'

# Use this if your path is not already in sys.path
if my_path not in sys.path:
    sys.path.append(my_path)

In [13]:
from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.data_splitters import SklearnDataSplitter, NoSplit, LeaveOutPercent
from mastml.feature_selectors import EnsembleModelFeatureSelector
from mastml.mastml_predictor import make_prediction
from pathlib import Path
import mastml
import subprocess
import glob
import os
try:
    data_path = os.path.join(mastml.__path__._path[0], 'data')
except:
    data_path = os.path.join(mastml.__path__[0], 'data')


Define the path to save data.

In [None]:
SAVEPATH = 'drive/MyDrive/MASTML_tutorial_8_ModelPredictions_with_Guide_Rails'

mastml_instance = Mastml(savepath=SAVEPATH)
savepath = mastml_instance.get_savepath

Load the standard diffusion dataset.

In [3]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2', 'Hop activation barrier']
d = LocalDatasets(file_path=data_path+'/diffusion_data_allfeatures.xlsx', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

metrics = [
           'r2_score',
           'mean_absolute_error',
           'root_mean_squared_error',
           'rmse_over_stdev',
           ]




## Task 2: Fit models and check elemental spaces <a name="task1"></a>

Setup machine learning which checks if an element from a test set was observed within the training set. If all elements from the test set are observed in the training set, the case is marked as "in_domain". If only some elements from the test set are observed in training data, then the case is marked as "maybe_in_domain". If none of the test elements are observed within training data, then the case is flagged as "out_of_domain".

In [5]:
preprocessor = SklearnPreprocessor(
                                   preprocessor='StandardScaler',
                                   as_frame=True,
                                   )

model = SklearnModel(model='RandomForestRegressor')

splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=10,
                               n_splits=5,
                               )
splitter.evaluate(
                  X=X,
                  y=y, 
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=('elemental', groups),
                  )

Here we use the error bars inherently in Gaussian Process Regression (GPR) to determine if we should flag a case as worrisome. Through 5-fold cross validation, we attain the maximum uncertainty from GPR and compare to tets cases. If the test case unceratinty is grater than the maximum training uncertainty, we mark the observation as "out_of_domain" and "in_domain" otherwise.

## Task 3: Fit models and check Gaussian Process Error Bars <a name="task1"></a>

In [None]:
splitter.evaluate(
                  X=X,
                  y=y, 
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain='gpr',
                  )

## Task 4: Fit models and check domain with MADML  <a name="task4"></a>

Fit the kinds of models that we need for domain evaluation.

In [6]:
splitter = NoSplit()

# Domain with MADML
params = {'n_repeats': 2}
domain = ('madml', params)

splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=[domain],
                  )

Running splits:   0%|                                                                                                                               | 0/1 [00:00<?, ?it/s]

MADMl - Nested CV fit Fold: 1

  0%|                                                                                                                                              | 0/30 [00:00<?, ?it/s][A
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  6.91it/s][A
MADMl - Nested CV fit Fold: 2

  0%|                                                                                                                                              | 0/30 [00:00<?, ?it/s][A
  3%|████▍                                                                                                                                 | 1/30 [00:03<01:42,  3.54s/it][A
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  8.14it/s][A
MADMl - Nested CV fit Fold: 3

  0%|                                

Running splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:32<00:00, 152.12s/it]


Use fitted models and predict on sample data. Here we predict on the data we train on, but can be anything else.

In [16]:
file_to_move = glob.glob('Ran*')[0]
print(file_to_move)
subprocess.run(['mv', file_to_move, 'output'])
path_fullfit = './output/split_0'

model_path = os.path.join(path_fullfit, 'RandomForestRegressor.pkl')
preprocessor_path = os.path.join(path_fullfit, 'StandardScaler.pkl')
domain_path = list(map(str, Path(path_fullfit).rglob('domain_*.pkl')))

pred_df = make_prediction(
                          X_train=X,
                          y_train=y,
                          X_test=X,
                          model=model_path,
                          preprocessor=preprocessor_path,
                          domain=domain_path,
                          )

print(pred_df)

RandomForestRegressor_NoSplit_SklearnPreprocessor_NoSelect_2023_11_02_11_49_16
       y_pred     y_err  Residual for 0.95  Residual for Max F1  \
0   -0.007841  0.062488                  0                    1   
1    0.056112  0.205946                  1                    1   
2    0.278220  0.141283                  1                    1   
3   -0.038416  0.052946                  0                    1   
4    0.307689  0.100968                  1                    1   
..        ...       ...                ...                  ...   
403 -0.030285  0.177594                  0                    1   
404  0.128078  0.196010                  0                    1   
405  0.213274  0.174223                  0                    1   
406  0.151207  0.161292                  0                    1   
407  0.181990  0.202337                  0                    1   

     Uncertainty for 0.95  Uncertainty for Max F1      dist  
0                       0                       0  0.8

We can also change the default thresholds for prediction and add our own.

In [17]:
thresholds = [('residual', 0.75), ('uncertainty', 0.2)]

pred_df = make_prediction(
                          X_train=X,
                          y_train=y,
                          X_test=X,
                          model=model_path,
                          preprocessor=preprocessor_path,
                          domain=domain_path,
                          madml_thresholds=thresholds
                          )

print(pred_df)

       y_pred     y_err  Residual for 0.75  Uncertainty for 0.2      dist
0   -0.007841  0.062488                  0                    0  0.884227
1    0.056112  0.205946                  1                    0  0.251691
2    0.278220  0.141283                  1                    0  0.283408
3   -0.038416  0.052946                  1                    0  0.727743
4    0.307689  0.100968                  1                    0  0.257772
..        ...       ...                ...                  ...       ...
403 -0.030285  0.177594                  0                    0  0.816549
404  0.128078  0.196010                  0                    0  0.820780
405  0.213274  0.174223                  0                    0  0.800442
406  0.151207  0.161292                  0                    0  0.919245
407  0.181990  0.202337                  0                    0  0.934091

[408 rows x 5 columns]
