In [1]:
from download_delgado.delgado_datasets import DownloadAndConvertDelgadoDatasets
from mlaut.data import Data
from mlaut.estimators.estimators import instantiate_default_estimators
from mlaut.experiments import Orchestrator
from mlaut.analyze_results import AnalyseResults
from download_delgado.delgado_datasets import DownloadAndConvertDelgadoDatasets
from mlaut.analyze_results.scores import ScoreAccuracy
import pandas as pd

MLAUT is a modelling and workflow toolbox in python, written with the aim of simplifying large scale benchmarking of machine learning strategies, e.g., validation, evaluation and comparison with respect to predictive/task-specific performance or runtime.

In this basis use case example we will show the most simple MLAUT workflow.

For the purposes of this demonstration we will assume that the user has already stored the data needed for the experiments in a HDF5 database. Saving the data is not part of the core MLAUT workflow and is therefore omitted for the purposes of this this demonstration. 

Please refer to the advanced use case demos for examples how this can be done.

the diagram below sketches the typical MLAUT workflow.

<img src="img/workflow.png?2">

### Step 1: Database

The code below provides hooks to the input and output database objects

In [2]:
data = Data()
input_io = data.open_hdf5('data/delgado.hdf5', mode='a')
out_io = data.open_hdf5('data/classification.hdf5', mode='a')

`input_io`: hook to the input HDF5 database file <br>
`out_io`:  hook to the output HDF5 database file

### Step 2: Split datasets

After the hooks are created we can proceed to splitting the data in test and training. 

Unless otherwise specified we use $\dfrac{2}{3}$ of the data for training and $\dfrac{1}{3}$ for testing. We do not change or move the original data in this process. Instead we store the train/test indices in a separate HDF5 database.


In [3]:
dts_names_list, dts_names_list_full_path = data.list_datasets(hdf5_io=input_io, hdf5_group='delgado_datasets/')
split_dts_list = data.split_datasets(hdf5_in=input_io, hdf5_out=out_io, dataset_paths=dts_names_list_full_path, verbose=False)

`dts_names_list`: names of the datasets saved inside the HDF5 file <br>
`dts_names_list_full_path`: full path to the datasets inside the HDF5 database <br>
`split_dts_list`: path to the train/test indices of the split datasets

### Step 3: Define the estimators

For the puposes of the basic demo we show how the standard set of estimaots that come with MLAUT can be used for running the experiments. 

In the code example below we enumerate by name the estimators that we wish to use in the study. This will provide instances of MLAUT estimators with the built in defaults.

For more advanced used cases pleaes refer to the Advanced Usage - Example 1 and Example 2. The user can easily change the hyper paramemeter defaults or define a completely new estimator object .

In [4]:
est = ['RandomForestClassifier','BaggingClassifier','GradientBoostingClassifier','SVC','GaussianNaiveBayes','BernoulliNaiveBayes','NeuralNetworkDeepClassifier','PassiveAggressiveClassifier','BaselineClassifier']
estimators = instantiate_default_estimators(estimators=est)

`estimators`: array of MLAUT estimators

### Step 4: Run the experiments

The final step is to run the experiments by invoking the `run()` method.

This step could take a substantial amount of time depending on the number and size of datasets and the number of estimators that we wish to train.

All trained estimators are saved on the HDD.

In [5]:
orchest = Orchestrator(hdf5_input_io=input_io, hdf5_output_io=out_io, dts_names=dts_names_list,
                 original_datasets_group_h5_path='delgado_datasets/')
orchest.run(modelling_strategies=estimators, verbose=False)

One of the key feautres of the package is to allow for the experiment to resume in case of a crash or interruption. If this happens, the user would simply need to re-run the code above. Unless the `override_saved_models=True` flag was set the orchestrator will skip all estimators that were trained sucessfully. This would allow the user to continue from the point where the experiments were stopped.

### Step 5: Make predictions on the test sets

After the estimators are trained the user needs to use them in order to make predictions on the test sets which will be used subsequently for performing statistical tests.

The predictions of the estimators are saved in the input HDF5 database file a hook to which was created earlier.

Unless the `override=False` flag was set MLAUT will not override predictions that were previously stored in the database.

In [6]:
orchest.predict_all(trained_models_dir='data/trained_models', estimators=estimators, verbose=False)

### Step 6: Analyze the results

The last step in the pipeline is to analyze the results of the experiments.

The `AnalyseResults` class takes as inputs the two database files and the loss metric that will be used to compute the prediction errors.

In [7]:
analyze = AnalyseResults(hdf5_output_io=out_io, 
                         hdf5_input_io=input_io,
                         input_h5_original_datasets_group='delgado_datasets/', 
                         output_h5_predictions_group='experiments/predictions/')
score_accuracy = ScoreAccuracy()


(errors_per_estimator, 
 errors_per_dataset_per_estimator, 
 errors_per_dataset_per_estimator_df) = analyze.prediction_errors(score_accuracy, estimators)


The `prediction_errors()` method retuns two sets of results: `errors_per_estimator` dictionary which is used subsequently in further statistical tests and `errors_per_dataset_per_estimator_df` which is a dataframe with the loss of each estimator on each dataset which can be examined directly by the user. 

Below we show the results of the various statistical tests that are supported by MLAUT

#### t-test

In [8]:
t_test, t_test_df = analyze.t_test(errors_per_estimator)
t_test_df

Unnamed: 0_level_0,BaggingClassifier,BaggingClassifier,BaselineClassifier,BaselineClassifier,BernoulliNaiveBayes,BernoulliNaiveBayes,GaussianNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,GradientBoostingClassifier,NeuralNetworkDeepClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,PassiveAggressiveClassifier,RandomForestClassifier,RandomForestClassifier,SVC,SVC
Unnamed: 0_level_1,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val
BaggingClassifier,0.0,1.0,17.37,0.0,5.613,0.0,6.226,0.0,1.248,0.213,6.241,0.0,2.836,0.005,-0.839,0.402,0.088,0.93
BaselineClassifier,-17.37,0.0,0.0,1.0,-12.267,0.0,-9.978,0.0,-15.541,0.0,-8.572,0.0,-14.077,0.0,-18.455,0.0,-17.435,0.0
BernoulliNaiveBayes,-5.613,0.0,12.267,0.0,0.0,1.0,1.274,0.204,-4.077,0.0,1.789,0.075,-2.485,0.014,-6.594,0.0,-5.587,0.0
GaussianNaiveBayes,-6.226,0.0,9.978,0.0,-1.274,0.204,0.0,1.0,-4.865,0.0,0.591,0.555,-3.447,0.001,-7.078,0.0,-6.202,0.0
GradientBoostingClassifier,-1.248,0.213,15.541,0.0,4.077,0.0,4.865,0.0,0.0,1.0,5.033,0.0,1.514,0.131,-2.069,0.04,-1.177,0.24
NeuralNetworkDeepClassifier,-6.241,0.0,8.572,0.0,-1.789,0.075,-0.591,0.555,-5.033,0.0,0.0,1.0,-3.749,0.0,-6.99,0.0,-6.215,0.0
PassiveAggressiveClassifier,-2.836,0.005,14.077,0.0,2.485,0.014,3.447,0.001,-1.514,0.131,3.749,0.0,0.0,1.0,-3.693,0.0,-2.782,0.006
RandomForestClassifier,0.839,0.402,18.455,0.0,6.594,0.0,7.078,0.0,2.069,0.04,6.99,0.0,3.693,0.0,0.0,1.0,0.94,0.348
SVC,-0.088,0.93,17.435,0.0,5.587,0.0,6.202,0.0,1.177,0.24,6.215,0.0,2.782,0.006,-0.94,0.348,0.0,1.0


#### sign test

In [9]:
sign_test, sign_test_df = analyze.sign_test(errors_per_estimator)
sign_test_df

Unnamed: 0_level_0,BaggingClassifier,BaggingClassifier,BaselineClassifier,BaselineClassifier,BernoulliNaiveBayes,BernoulliNaiveBayes,GaussianNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,GradientBoostingClassifier,NeuralNetworkDeepClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,PassiveAggressiveClassifier,RandomForestClassifier,RandomForestClassifier,SVC,SVC
Unnamed: 0_level_1,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val,t_stat,p_val
BaggingClassifier,0.0,1.0,11.913,0.0,5.687,0.0,6.021,0.0,1.12,0.263,5.698,0.0,2.753,0.006,-0.879,0.38,0.152,0.88
BaselineClassifier,-11.913,0.0,0.0,1.0,-9.995,0.0,-8.812,0.0,-11.253,0.0,-7.778,0.0,-10.776,0.0,-12.224,0.0,-11.94,0.0
BernoulliNaiveBayes,-5.687,0.0,9.995,0.0,0.0,1.0,0.788,0.431,-4.26,0.0,0.856,0.392,-2.865,0.004,-6.513,0.0,-5.671,0.0
GaussianNaiveBayes,-6.021,0.0,8.812,0.0,-0.788,0.431,0.0,1.0,-4.742,0.0,0.151,0.88,-3.351,0.001,-6.788,0.0,-5.994,0.0
GradientBoostingClassifier,-1.12,0.263,11.253,0.0,4.26,0.0,4.742,0.0,0.0,1.0,4.597,0.0,1.608,0.108,-1.845,0.065,-0.912,0.362
NeuralNetworkDeepClassifier,-5.698,0.0,7.778,0.0,-0.856,0.392,-0.151,0.88,-4.597,0.0,0.0,1.0,-3.291,0.001,-6.444,0.0,-5.684,0.0
PassiveAggressiveClassifier,-2.753,0.006,10.776,0.0,2.865,0.004,3.351,0.001,-1.608,0.108,3.291,0.001,0.0,1.0,-3.61,0.0,-2.732,0.006
RandomForestClassifier,0.879,0.38,12.224,0.0,6.513,0.0,6.788,0.0,1.845,0.065,6.444,0.0,3.61,0.0,0.0,1.0,1.016,0.31
SVC,-0.152,0.88,11.94,0.0,5.671,0.0,5.994,0.0,0.912,0.362,5.684,0.0,2.732,0.006,-1.016,0.31,0.0,1.0


#### t-test with bonferroni correction

In [10]:
t_test_bonferroni_df = analyze.t_test_with_bonferroni_correction(errors_per_estimator)
t_test_bonferroni_df

Unnamed: 0,BaggingClassifier,BaselineClassifier,BernoulliNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,RandomForestClassifier,SVC
BaggingClassifier,False,True,True,True,False,True,False,False,False
BaselineClassifier,True,False,True,True,True,True,True,True,True
BernoulliNaiveBayes,True,True,False,False,True,False,False,True,True
GaussianNaiveBayes,True,True,False,False,True,False,False,True,True
GradientBoostingClassifier,False,True,True,True,False,True,False,False,False
NeuralNetworkDeepClassifier,True,True,False,False,True,False,True,True,True
PassiveAggressiveClassifier,False,True,False,False,False,True,False,True,False
RandomForestClassifier,False,True,True,True,False,True,True,False,False
SVC,False,True,True,True,False,True,False,False,False


#### Wilcoxon test

In [11]:
import warnings
warnings.filterwarnings('ignore')
wilcoxon_test, wilcoxon_test_df = analyze.wilcoxon_test(errors_per_estimator)
wilcoxon_test_df

Unnamed: 0_level_0,BaggingClassifier,BaggingClassifier,BaselineClassifier,BaselineClassifier,BernoulliNaiveBayes,BernoulliNaiveBayes,GaussianNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,GradientBoostingClassifier,NeuralNetworkDeepClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,PassiveAggressiveClassifier,RandomForestClassifier,RandomForestClassifier,SVC,SVC
Unnamed: 0_level_1,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val,statistic,p_val
BaggingClassifier,0.0,,10.0,0.0,621.0,0.0,512.5,0.0,1289.0,0.0,619.0,0.0,1173.0,0.0,1167.0,0.0,2762.5,0.949
BaselineClassifier,10.0,0.0,0.0,,50.0,0.0,458.0,0.0,26.5,0.0,231.0,0.0,6.5,0.0,0.0,0.0,0.0,0.0
BernoulliNaiveBayes,621.0,0.0,50.0,0.0,0.0,,2625.5,0.088,1204.5,0.0,2847.0,0.133,1454.0,0.0,165.5,0.0,392.5,0.0
GaussianNaiveBayes,512.5,0.0,458.0,0.0,2625.5,0.088,0.0,,1151.5,0.0,3085.0,0.586,1094.5,0.0,191.5,0.0,224.0,0.0
GradientBoostingClassifier,1289.0,0.0,26.5,0.0,1204.5,0.0,1151.5,0.0,0.0,,861.0,0.0,2572.5,0.033,645.5,0.0,1965.0,0.001
NeuralNetworkDeepClassifier,619.0,0.0,231.0,0.0,2847.0,0.133,3085.0,0.586,861.0,0.0,0.0,,1471.0,0.0,269.0,0.0,263.0,0.0
PassiveAggressiveClassifier,1173.0,0.0,6.5,0.0,1454.0,0.0,1094.5,0.0,2572.5,0.033,1471.0,0.0,0.0,,527.0,0.0,726.5,0.0
RandomForestClassifier,1167.0,0.0,0.0,0.0,165.5,0.0,191.5,0.0,645.5,0.0,269.0,0.0,527.0,0.0,0.0,,1877.5,0.002
SVC,2762.5,0.949,0.0,0.0,392.5,0.0,224.0,0.0,1965.0,0.001,263.0,0.0,726.5,0.0,1877.5,0.002,0.0,


#### Friedman test

In [12]:
friedman_test, friedman_test_df = analyze.friedman_test(errors_per_estimator)
friedman_test_df

Unnamed: 0,statistic,p_value
0,506.211,0.0


#### Nemenyi test

In [13]:
nemeniy_test = analyze.nemenyi(errors_per_estimator)
nemeniy_test

Unnamed: 0,BaggingClassifier,BaselineClassifier,BernoulliNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,RandomForestClassifier,SVC
BaggingClassifier,-1.0,0.0,0.001,0.0,0.997,0.0,0.593,1.0,1.0
BaselineClassifier,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BernoulliNaiveBayes,0.001,0.0,-1.0,1.0,0.042,1.0,0.59,0.0,0.001
GaussianNaiveBayes,0.0,0.0,1.0,-1.0,0.007,1.0,0.271,0.0,0.0
GradientBoostingClassifier,0.997,0.0,0.042,0.007,-1.0,0.006,0.978,0.902,0.998
NeuralNetworkDeepClassifier,0.0,0.0,1.0,1.0,0.006,-1.0,0.26,0.0,0.0
PassiveAggressiveClassifier,0.593,0.0,0.59,0.271,0.978,0.26,-1.0,0.204,0.636
RandomForestClassifier,1.0,0.0,0.0,0.0,0.902,0.0,0.204,-1.0,1.0
SVC,1.0,0.0,0.001,0.0,0.998,0.0,0.636,1.0,-1.0


In [14]:
nemeniy_test = analyze.nemenyi(errors_per_estimator)
nemeniy_test

Unnamed: 0,BaggingClassifier,BaselineClassifier,BernoulliNaiveBayes,GaussianNaiveBayes,GradientBoostingClassifier,NeuralNetworkDeepClassifier,PassiveAggressiveClassifier,RandomForestClassifier,SVC
BaggingClassifier,-1.0,0.0,0.001,0.0,0.997,0.0,0.593,1.0,1.0
BaselineClassifier,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BernoulliNaiveBayes,0.001,0.0,-1.0,1.0,0.042,1.0,0.59,0.0,0.001
GaussianNaiveBayes,0.0,0.0,1.0,-1.0,0.007,1.0,0.271,0.0,0.0
GradientBoostingClassifier,0.997,0.0,0.042,0.007,-1.0,0.006,0.978,0.902,0.998
NeuralNetworkDeepClassifier,0.0,0.0,1.0,1.0,0.006,-1.0,0.26,0.0,0.0
PassiveAggressiveClassifier,0.593,0.0,0.59,0.271,0.978,0.26,-1.0,0.204,0.636
RandomForestClassifier,1.0,0.0,0.0,0.0,0.902,0.0,0.204,-1.0,1.0
SVC,1.0,0.0,0.001,0.0,0.998,0.0,0.636,1.0,-1.0


In [15]:
pd.set_option('display.max_rows', 5000)
errors_per_dataset_per_estimator_df

Unnamed: 0,Unnamed: 1,loss,std_error
abalone,BaggingClassifier,0.37708,0.01305
abalone,BaselineClassifier,0.6744,0.01262
abalone,BernoulliNaiveBayes,0.44888,0.01339
abalone,GaussianNaiveBayes,0.44017,0.01337
abalone,GradientBoostingClassifier,0.38869,0.01313
abalone,NeuralNetworkDeepClassifier,0.37273,0.01302
abalone,PassiveAggressiveClassifier,0.37346,0.01303
abalone,RandomForestClassifier,0.36476,0.01296
abalone,SVC,0.36186,0.01294
acute_inflammation,BaggingClassifier,0.0,0.0
