# Welcome to the eighth MAST-ML tutorial notebook, 

# Model predictions with guide rails with MAST-ML! 

## In this notebook, we will learn how to perform simple checks on our test data:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Fit models and check elemental spaces](#task2)
3. [Fit models and check Gaussian Process Error Bars](#task3)

We need to first install dependencies

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

In [12]:
!pip install git+https://github.com/uw-cmg/MAST-ML/@dev_lane
!pip install pyyaml==5.4.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/uw-cmg/MAST-ML/@dev_lane
  Cloning https://github.com/uw-cmg/MAST-ML/ (to revision dev_lane) to /tmp/pip-req-build-ek04wh27
  Running command git clone --filter=blob:none --quiet https://github.com/uw-cmg/MAST-ML/ /tmp/pip-req-build-ek04wh27
  Running command git checkout -b dev_lane --track origin/dev_lane
  Switched to a new branch 'dev_lane'
  Branch 'dev_lane' set up to track remote branch 'dev_lane' from 'origin'.
  Resolved https://github.com/uw-cmg/MAST-ML/ to commit 1b5d87ef5cada0eb689e7d41783a7d4e66cb0237
  Preparing metadata (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyyaml==5.4.1
  Downloading PyYAML-5.4.1-cp39-cp39-manylinux1_x86_64.whl (630 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m630.1/630.1 KB[0m [31m10.1 MB/s[0m et

Mount Google Drive to save output from runs.

In [13]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


Import needed packages and subroutines for running example.

In [19]:
from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.data_splitters import SklearnDataSplitter, NoSplit, LeaveOutPercent
from mastml.feature_selectors import EnsembleModelFeatureSelector
import mastml
import os
try:
    data_path = os.path.join(mastml.__path__._path[0], 'data')
except:
    data_path = os.path.join(mastml.__path__[0], 'data')


Define the path to save data.

In [20]:
SAVEPATH = 'drive/MyDrive/MASTML_tutorial_8_ModelPredictions_with_Guide_Rails'

mastml_instance = Mastml(savepath=SAVEPATH)
savepath = mastml_instance.get_savepath

Load the standard diffusion dataset.

In [24]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2', 'Hop activation barrier']
d = LocalDatasets(file_path=data_path+'/diffusion_data_allfeatures.xlsx', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

metrics = [
           'r2_score',
           'mean_absolute_error',
           'root_mean_squared_error',
           'rmse_over_stdev',
           ]




## Task 2: Fit models and check elemental spaces <a name="task1"></a>

Setup machine learning which checks if an element from a test set was observed within the training set. If all elements from the test set are observed in the training set, the case is marked as "in_domain". If only some elements from the test set are observed in training data, then the case is marked as "maybe_in_domain". If none of the test elements are observed within training data, then the case is flagged as "out_of_domain".

In [25]:
preprocessor = SklearnPreprocessor(
                                   preprocessor='StandardScaler',
                                   as_frame=True,
                                   )

model = SklearnModel(model='RandomForestRegressor')

splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=10,
                               n_splits=5,
                               )
splitter.evaluate(
                  X=X,
                  y=y, 
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain=('elemental', groups),
                  )

Here we use the error bars inherently in Gaussian Process Regression (GPR) to determine if we should flag a case as worrisome. Through 5-fold cross validation, we attain the maximum uncertainty from GPR and compare to tets cases. If the test case unceratinty is grater than the maximum training uncertainty, we mark the observation as "out_of_domain" and "in_domain" otherwise.

## Task 3: Fit models and check Gaussian Process Error Bars <a name="task1"></a>

In [26]:
splitter.evaluate(
                  X=X,
                  y=y, 
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  X_extra=X_extra,
                  verbosity=3,
                  domain='gpr',
                  )

