# *INSERT TUTORIAL PIC HERE*

# Welcome to the seventh MAST-ML tutorial notebook, 

# Model Base Line Tests with MAST-ML! 

## In this notebook, we will learn how to run some baseline tests on our models. In this tutorial, we will:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Import Dataset](#task2)
3. [Run Base line tests on our model](#task3)

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

If you are working on Google Colab and need to install MAST-ML, 
begin by cloning the relevant branch of MAST-ML to the Colab session
and install the needed dependencies:

In [1]:
# !git clone --single-branch --branch dev_Ryan_2020-12-21 https://github.com/uw-cmg/MAST-ML
# !pip install -r MAST-ML/requirements.txt

Sync your Google drive to Colab so that we can save MAST-ML results to our Google
Drive. If we save to the Colab session, the data will be deleted when the session 
ends.

In [2]:
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

We need to add the MAST-ML folder to our sys path so that python can find the modules

In [3]:
# import sys
# sys.path.append('MAST-ML')

Here we import the MAST-ML modules used in this tutorial

In [11]:
import mastml
from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.metrics import Metrics
from mastml.baseline_tests import Baseline_tests
import os
data_path = os.path.join(mastml.__path__[0], 'data')

ModuleNotFoundError: No module named 'mastml.baseline_tests'

## Task 2: Import dataset <a name="task2"></a>

In this tutorial, we will again use the diffusion dataset that we examined in the previous tutorial. Here, we use the LocalDatasets module to load in the diffusion dataset.

In [None]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2']

d = LocalDatasets(file_path=data_path+'\\diffusion_data_selectfeatures.xlsx', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']

 In this tutorial, we will be using RandomForestRegressor as our model

In [None]:
model = SklearnModel(model='RandomForestRegressor', n_estimators=150)

## Task 3:  Run baseline tests on our model <a name="task3"></a>

We list which metrics we want to evaluate. If none are given, MAST-ML will default to evaulating just the root mean squared error. A complete list of metrics can be obtained from calling Metrics()._metric_zoo() in metrics.py.



In [None]:
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

As in the previous tutorial, we need to define our preprocessing function. We are just going to use the basic StandardScaler in scikit-learn to normalize each column to have mean zero and standard deviation of one.

In [None]:
preprocessor = SklearnPreprocessor(preprocessor='StandardScaler', as_frame=True)
preprocessor.evaluate(X,y)

In [None]:
model.fit(X=X, y=y)

Baseline_test takes in the argument X, y, model, and the metrics used to evaluate the model. Here we will go through on how to use the mean test

In [None]:
baseline_test = Baseline_tests()

test_mean will print out the score of the model tested with the actual y compared with the naive score of the model tested with the mean value of y

In [None]:
baseline_test.test_mean(X=X, y=y, model=model, metrics=metrics)

test_permuted will print out the score of the model tested with the actual y compared with the naive score of the model tested with the y shuffled so that it does not correspond to the X data

In [None]:
baseline_test.test_permuted(X=X, y=y, model=model, metrics=metrics)