# *INSERT TUTORIAL PIC HERE*

# Welcome to the seventh MAST-ML tutorial notebook, 

# Model Base Line Tests with MAST-ML! 

## In this notebook, we will learn how to run some baseline tests on our models. In this tutorial, we will:

1. [Set up MAST-ML on Colab and begin session](#task1)
2. [Import Dataset](#task2)
3. [Run Base line tests on our model](#task3)

## Task 1: Set up MAST-ML on Colab and begin session <a name="task1"></a>

If you are working on Google Colab and need to install MAST-ML, 
begin by pip installing MAST-ML to the Colab session
and install the needed dependencies:

In [1]:
!pip install mastml

Sync your Google drive to Colab so that we can save MAST-ML results to our Google
Drive. If we save to the Colab session, the data will be deleted when the session 
ends.

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

We need to add the MAST-ML folder to our sys path so that python can find the modules

In [3]:
import sys
sys.path.append('MAST-ML')

Here we import the MAST-ML modules used in this tutorial

In [4]:
import mastml
from mastml.datasets import LocalDatasets
from mastml.models import SklearnModel, EnsembleModel
from mastml.preprocessing import SklearnPreprocessor
from mastml.metrics import Metrics
from mastml.baseline_tests import Baseline_tests
from mastml.datasets import SklearnDatasets

import os
data_path = os.path.join(mastml.__path__[0], 'data')

Figshare is an optional dependency. To import data from figshare, manually install figshare via git clone of git clone https://github.com/cognoma/figshare.git
XGBoost is an optional dependency. If you want to use XGBoost models, please manually install xgboost package with pip install xgboost. If have error with finding libxgboost.dylib library, dobrew install libomp. If do not have brew on your system, first do ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" from the Terminal
forestci is an optional dependency. To install latest forestci compatabilty with scikit-learn>=0.24, run pip install git+git://github.com/scikit-learn-contrib/forest-confidence-interval.git


## Task 2: Import dataset <a name="task2"></a>

In this tutorial, we will again use the diffusion dataset that we examined in the previous tutorial. Here, we use the LocalDatasets module to load in the diffusion dataset.

In [5]:
target = 'E_regression'

extra_columns = ['Material compositions 1', 'Material compositions 2']

d = LocalDatasets(file_path=data_path+'\\diffusion_data_selectfeatures.xlsx', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column='Material compositions 1',
                  testdata_columns=None,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

# Let's assign each data object to its respective name
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']



 In this tutorial, we will be using RandomForestRegressor as our model

In [6]:
model = SklearnModel(model='RandomForestRegressor', n_estimators=150)

## Task 3:  Run baseline tests on regression model <a name="task3"></a>

We list which metrics we want to evaluate. If none are given, MAST-ML will default to evaulating just the root mean squared error. A complete list of metrics can be obtained from calling Metrics()._metric_zoo() in metrics.py.



In [7]:
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

As in the previous tutorial, we need to define our preprocessing function. We are just going to use the basic StandardScaler in scikit-learn to normalize each column to have mean zero and standard deviation of one.

In [8]:
preprocessor = SklearnPreprocessor(preprocessor='StandardScaler', as_frame=True)
preprocessor.evaluate(X,y)

Unnamed: 0,Site2_MeltingT,BCCenergy_pa_max_value,BCCefflatcnt_difference,Site1_BCCefflatcnt,Site1_CovalentRadius,NdUnfilled_max_value,n_ws^third_max_value,Site2_Row,Site2_NdUnfilled,MendeleevNumber_composition_average,Site2_MendeleevNumber,Site2_IonizationEnergy,Site1_GSestFCClatcnt,IonicRadii_composition_average,HeatVaporization_min_value,HHIr_difference,Site2_ICSDVolume,NdUnfilled_composition_average,Polarizability_max_value,Site2_BCCvolume_padiff
0,-0.531814,0.931053,-1.254233,0.147244,0.412734,-1.388936,-1.573418,0.218361,-0.943565,0.587584,0.446102,-0.017709,0.183395,3.133860,-0.156934,-1.099257,-0.312672,-1.260505,-0.891095,0.279416
1,0.064051,0.931053,0.204143,0.147244,0.412734,-0.369717,0.477555,-0.855029,-0.009161,0.265113,0.081144,0.224198,0.183395,1.357710,-0.156934,-0.580394,-1.040516,-0.495578,-0.802413,0.442443
2,0.524584,0.931053,-0.006534,0.147244,0.412734,0.309763,0.372377,-0.855029,0.613776,-0.149493,-0.388087,-0.718343,0.183395,1.251141,-0.156934,-0.021619,-0.929489,0.014374,-0.089477,0.334594
3,-0.394504,0.931053,0.018934,0.147244,0.412734,-1.388936,-0.994938,-0.855029,-0.943565,0.541517,0.393965,0.111308,0.183395,1.641894,-0.156934,-1.059344,-0.954162,-1.260505,-0.891095,0.349643
4,0.112116,0.931053,0.119309,0.147244,0.412734,-0.029977,0.582733,-0.855029,0.302307,0.126911,-0.075266,0.235845,0.183395,1.002480,-0.156934,-1.099257,-0.966498,-0.240602,-0.645915,0.344627
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403,1.161729,-0.781716,-0.390855,1.377227,2.514173,1.328982,-0.100925,0.218361,0.925244,-1.209041,-0.492361,-0.617997,1.447955,0.220974,2.400344,1.375318,-0.176972,2.309155,1.006009,0.344627
404,1.765340,-0.781716,-0.378603,1.377227,2.514173,1.328982,-0.153514,1.291752,1.236712,-1.162974,-0.440224,0.251076,1.447955,0.220974,2.400344,-0.221182,-0.238654,2.564131,1.006009,0.344627
405,0.257430,-0.781716,-0.110537,1.377227,2.514173,1.328982,-0.731993,-0.855029,1.548180,-1.393311,-0.700908,-0.671754,1.447955,0.114405,1.156580,-0.700132,-0.238654,2.819107,1.006009,0.419870
406,0.888986,-0.781716,-1.097630,1.377227,2.514173,1.328982,-1.100116,1.291752,1.548180,-1.301176,-0.596634,-0.815106,1.447955,0.895911,2.400344,-1.099257,0.353491,2.819107,1.006009,0.179091


In [9]:
model.fit(X=X, y=y)

RandomForestRegressor(n_estimators=150)

Baseline_test takes in the argument X, y, model, and the metrics used to evaluate the model. Here we will go through on how to use the mean test

In [10]:
baseline_test = Baseline_tests()

test_mean will print out the score of the model tested with the actual y compared with the naive score of the model tested with the mean value of y

In [11]:
baseline_test.test_mean(X=X, y=y, model=model, metrics=metrics)

r2_score score:
Real: 0.9801730039276285
Fake: 0.0 

mean_absolute_error score:
Real: 0.044772539510946556
Fake: 0.2928962336751531 

root_mean_squared_error score:
Real: 0.06165474483807767
Fake: 0.4116233755078354 

rmse_over_stdev score:
Real: 0.14080836648570116
Fake: 8474454311104335.0 



test_permuted will print out the score of the model tested with the actual y compared with the naive score of the model tested with the y shuffled so that it does not correspond to the X data

In [12]:
baseline_test.test_permuted(X=X, y=y, model=model, metrics=metrics)

r2_score score:
Real: 0.981176616917211
Fake: -0.3592175181662185 

mean_absolute_error score:
Real: 0.04454091088024687
Fake: 0.4271994443895473 

root_mean_squared_error score:
Real: 0.06312124659586443
Fake: 0.5363783111006497 

rmse_over_stdev score:
Real: 0.1371983348397093
Fake: 1.1658548443808168 



test_nearest_neighbour_kdTree will print out the score of the model tested with the actual y compared with the naive score of the model tested with the nearest neighbour datapoint's y 

In [13]:
baseline_test.test_nearest_neighbour_kdtree(X=X, y=y, model=model, metrics=metrics)

r2_score score:
Real: 0.9889714832583125
Fake: 0.7395269289526998 

mean_absolute_error score:
Real: 0.03625086717152263
Fake: 0.17155767158419746 

root_mean_squared_error score:
Real: 0.05251680292165585
Fake: 0.2513253138804576 

rmse_over_stdev score:
Real: 0.10501674505376488
Fake: 0    0.510366
dtype: float64 



test_nearest_neighbour_cdist will print out the score of the model tested with the actual y compared with the naive score of the model tested with the nearest neighbour datapoint's y. This method can take an extra argument d_metric as a metric to calculate the distance such as:

‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’

default is euclidean

In [14]:
baseline_test.test_nearest_neighbour_cdist(X=X, y=y, model=model, metrics=metrics, d_metric ="cityblock")

r2_score score:
Real: 0.9876693504931018
Fake: 0.8581094999629117 

mean_absolute_error score:
Real: 0.04001719000921805
Fake: 0.15284566649432096 

root_mean_squared_error score:
Real: 0.05688773473048952
Fake: 0.1997243302544913 

rmse_over_stdev score:
Real: 0.11104345774019408
Fake: 0.37668355424293254 



## Task 3:  Run baseline tests on classifier model <a name="task4"></a>

For classifier base line tests, lets use KNeighborsClassifier as our model and iris as our data set

In [15]:
X, y = SklearnDatasets(as_frame=True).load_iris()
model = SklearnModel(model="KNeighborsClassifier")
model.fit(X,y)

KNeighborsClassifier()

test_classifier_dominant compares the score of the model with a test value of the dominant class (ie, the class with the highest count)

In [16]:
baseline_test.test_classifier_dominant(X, y, model, metrics=metrics)

r2_score score:
Real: 1.0
Fake: 0.0 

mean_absolute_error score:
Real: 0.0
Fake: 12.066666666666666 

root_mean_squared_error score:
Real: 0.0
Fake: 12.094075684675811 

rmse_over_stdev score:
Real: 0.0
Fake: inf 



test_classifier_dominant compares the score of the model with a test value of a random class. In this iris dataset, it will randomly guess class 0, 1, or 2. 

In [17]:
baseline_test.test_classifier_random(X, y, model, metrics=metrics)

r2_score score:
Real: 0.8821218074656189
Fake: 0.0 

mean_absolute_error score:
Real: 0.06666666666666667
Fake: 0.6333333333333333 

root_mean_squared_error score:
Real: 0.2581988897471611
Fake: 0.7958224257542215 

rmse_over_stdev score:
Real: 0.34333393734727286
Fake: inf 



In all of the examples, the actual score of the models performed significantly better compared to the scores tested with a fake test. Therefore, we know that our models have some reliability and are not completely useless. 