In [1]:
from download_delgado.delgado_datasets import DownloadAndConvertDelgadoDatasets
from mlaut.data import Data
from mlaut.estimators.estimators import instantiate_default_estimators
from mlaut.experiments import Orchestrator
from mlaut.analyze_results import AnalyseResults
from download_delgado.delgado_datasets import DownloadAndConvertDelgadoDatasets
from mlaut.analyze_results.scores import ScoreAccuracy
import pandas as pd
import numpy as np
from mlaut.estimators.generic_estimator import Generic_Estimator

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlaut.estimators.nn_estimators import Deep_NN_Classifier

MLAUT is a modelling and workflow toolbox in python, written with the aim of simplifying large scale benchmarking of machine learning strategies, e.g., validation, evaluation and comparison with respect to predictive/task-specific performance or runtime.

In this example we will demonstrate how the user can define from scratch his own estimator objects and use them for running experiments.

the diagram below sketches the typical MLAUT workflow.

<img src="img/workflow.png?2">

### Step 1: Download the datasets

For the puposes of this demonstration we will use datasets that can be downloaded from:
https://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The datasets were used in the study <cite data-cite="delgado2014">Do we Need Hundreds of Classifiers to solve Real World Classification Problems? (Delgado, 2014)</cite> and were originally downloaded from the UCI Machine Learning Repository.

The code below uses this data and creates two arrays. The first one contains the actual datasets and the scond one the metadata associated with each dataset. The metadata should be in the form of a dictionary and must contain at a minimum 

```python
{'class_name': ..., #name of the column containing the labels
 'dataset_name': ... #name of the dataset
}
```

This step is not MLAUT specific but the data needs to be stored in a MLAUT compatible format so that the experiments can be run.

In this example we assume that all datasets can fit into memory and be saved with one command in the database. If this is not the case the user break this process on a dataset by dataset basis or make it even more granular. This step is not MLAUT-specific so it is not discussed in detail in this demo.

The code for downloading the collection of datasets is also not MLAUT-specific and will not be discussed in detail.

In [2]:
delgado = DownloadAndConvertDelgadoDatasets()
datasets, metadata = delgado.download_and_extract_datasets(verbose = False)

Error: Dataset Delgado_data/molec-biol-protein-second has a different number of arff files


`datasets`: array of pandas dataframe containing all datasets<br>
`metadata`: array of dictionaries containing the metadata for each dataset. <br>

The (`dataset`, `metadata`) tuples need to be ordered.

### Step 2: Store the datasets in MLAUT format

The next step is to store the datasets in a format that can be used in MLAUT.

Interaction with MLAUT's data structures is done though the `Data()` class that serves as interface for storing and loading data into HDF5 database which is used by MLAUT for data maniputlation.

In [3]:
data = Data()
input_io = data.open_hdf5('data/delgado.hdf5', mode='a')
out_io = data.open_hdf5('data/classification.hdf5', mode='a')

We are now ready to store the data in the HDF5 database. 

`save_loc_hdf5` indicates the HDF5 group in which the datasets will be saved.<br>
`datasets` are the actual datasets in pandas format.<br>
`dts_metadata` is the metadata array attached to each dataset.<br>
`input_io` is the class object that interfaces the HDF5 file.

In [4]:
data.pandas_to_db(save_loc_hdf5='delgado_datasets/', datasets=datasets, 
                  dts_metadata=metadata, input_io=input_io)

### Step 3: Split datasets

The next step is to split the data in test and training. 

Unless otherwise specified we use $\dfrac{2}{3}$ of the data for training and $\dfrac{1}{3}$ for testing. We do not change or move the original data in this process. Instead we store the train/test indices in a separate HDF5 database.


In [5]:
dts_names_list, dts_names_list_full_path = data.list_datasets(hdf5_io=input_io, hdf5_group='delgado_datasets/')
split_dts_list = data.split_datasets(hdf5_in=input_io, hdf5_out=out_io, dataset_paths=dts_names_list_full_path, verbose=False)

`dts_names_list`: names of the datasets saved inside the HDF5 file <br>
`dts_names_list_full_path`: full path to the datasets inside the HDF5 database <br>
`split_dts_list`: path to the train/test indices of the split datasets

### Step 4: Define the estimators

In this advanced use case example we will show how the user can create its own estimator objects

In [None]:
prop = {'estimator_family':['SVM'], 
            'tasks':['Classification'], 
            'name':'SVC'}

hyperparametes = {
    
                    'C': np.linspace(2**(-5), 2**(15), 13),
                    'gamma': np.linspace(2**(-15), 2**3, 13)
                        
}
est = GridSearchCV(
    SVC(), 
    hyperparametes, 
    verbose = True ,
    n_jobs=-1)

SVC = Generic_Estimator(
    properties_dict = prop, 
            
            estimator=est)

In [None]:
prop = {'estimator_family':['ENSEMBLE_METHODS'], 
            'tasks':['CLASSIFICATION'], 
            'name':'RandomForestClassifier'}
hyperparameters = {
                'n_estimators': [10, 50, 100],
                'max_features': ['auto', 'sqrt','log2', None],
                'max_depth': [5, 15, None]
                }
estimator = GridSearchCV(RandomForestClassifier(), 
                    hyperparameters, 
                    verbose =  True,
                    n_jobs= -1)
RF = Generic_Estimator(
    properties_dict = prop, 
            estimator=est)

In [8]:
prop = {'estimator_family':['NEURAL_NETWORKS'], 
            'tasks':['CLASSIFICATION'], 
            'name':'NN-3layer'}

hyperparameters = {'epochs': [50,100], 
                    'batch_size': 0,  
                    'learning_rate':0.001,
                    'loss': 'mean_squared_error',
                    'optimizer': 'Adam',
                    'metrics' : ['accuracy']}
def keras_model(num_classes, input_dim):
    nn_deep_model = OverwrittenSequentialClassifier()
    nn_deep_model.add(Dense(288, input_dim=input_dim, activation='relu'))
    nn_deep_model.add(Dense(144, activation='relu'))
    nn_deep_model.add(Dropout(0.5))
    nn_deep_model.add(Dense(12, activation='relu'))
    nn_deep_model.add(Dense(num_classes, activation='softmax'))
    return nn_deep_model
deep_nn = Deep_NN_Classifier(hyperparameters=hyperparameters,
                            keras_model=keras_model)

Please note the different approach to defining custom estimators.

The user can either start with a blank canvas and construct an estimator object from a `Generic_Estimator` class or use an MLAUT estimator (`Deep_NN_Classifier` in this example) and customize it.

In [None]:
estimators = [SVC, RF, deep_nn]

### Step 5: Run the experiments

At this step we need to select the estimators that we want to use in the study. In this example we enumerated the estimators by name. However, MLAUT also supports a search by task or estimator familily.

The user also needs to instantiate the test orchestrator object by providing reference to the input and output database files and the location of the datasets inside the HDF5 database.

The final step is to run the experiments by invoking the `run()` method.

This step could take a substantial amount of time depending on the number and size of datasets and the number of estimators that we wish to train.

In [None]:
orchest = Orchestrator(hdf5_input_io=input_io, hdf5_output_io=out_io, dts_names=dts_names_list,
                 original_datasets_group_h5_path='delgado_datasets/')
orchest.run(modelling_strategies=estimators, verbose=False)

### Step 6: Make predictions on the test sets

After the estimators are trained the user needs to use them in order to make predictions on the test sets which will be used subsequently for performing statistical tests.

In [None]:
orchest.predict_all(trained_models_dir='data/trained_models', estimators=estimators, verbose=False)

### Step 7: Analyze the results

The last step in the pipeline is to analyze the results of the experiments.

The `AnalyseResults` class takes as inputs the two database files and the loss metric that will be used to compute the prediction errors.

The `prediction_errors()` method retuns two sets of results: `errors_per_estimator` dictionary which is used subsequently in further statistical tests and `errors_per_dataset_per_estimator_df` which is a dataframe with the loss of each estimator on each dataset which can be examined directly by the user. 

In [None]:
analyze = AnalyseResults(hdf5_output_io=out_io, 
                         hdf5_input_io=input_io,
                         input_h5_original_datasets_group='delgado_datasets/', 
                         output_h5_predictions_group='experiments/predictions/')
score_accuracy = ScoreAccuracy()


(errors_per_estimator, 
 errors_per_dataset_per_estimator, 
 errors_per_dataset_per_estimator_df) = analyze.prediction_errors(score_accuracy, estimators)

Below we show the results of the various statistical tests that are supported by MLAUT

#### t-test

In [None]:
t_test, t_test_df = analyze.t_test(errors_per_estimator)
t_test_df

#### sign test

In [None]:
sign_test, sign_test_df = analyze.sign_test(errors_per_estimator)
sign_test_df

#### t-test with bonferroni correction

In [None]:
t_test_bonferroni_df = analyze.t_test_with_bonferroni_correction(errors_per_estimator)
t_test_bonferroni_df

#### Wilcoxon test

In [None]:
import warnings
warnings.filterwarnings('ignore')
wilcoxon_test, wilcoxon_test_df = analyze.wilcoxon_test(errors_per_estimator)
wilcoxon_test_df

#### Friedman test

In [None]:
friedman_test, friedman_test_df = analyze.friedman_test(errors_per_estimator)
friedman_test_df

#### Nemenyi test

In [None]:
nemeniy_test = analyze.nemenyi(errors_per_estimator)
nemeniy_test

In [None]:
nemeniy_test = analyze.nemenyi(errors_per_estimator)
nemeniy_test

In [None]:
pd.set_option('display.max_rows', 5000)
errors_per_dataset_per_estimator_df