# General Usage

This example shows the general usage of `PyExperimenter`, from creating an experiment configuration file, over the actual execution of (dummy) experiments, to the extraction of experimental results. 

To execute this notebook you need to install:
```
pip install py_experimenter
pip install scikit-learn
```

## Experiment Configuration File
This notebook shows an example execution of `PyExperimenter` based on an experiment configuration file. Further explanation about the usage of `PyExperimenter` can be found in the [documentation](https://tornede.github.io/py_experimenter/usage.html).

In [1]:
import os

content = """
[PY_EXPERIMENTER]
provider = sqlite 
database = automl_conf_2023
table = best_paper_table 

keyfields = dataset, cross_validation_splits:int, seed:int, kernel
dataset = iris
cross_validation_splits = 5
seed = 2:6:2 
kernel = linear, poly, rbf, sigmoid

resultfields = pipeline:LONGTEXT, train_f1:DECIMAL, train_accuracy:DECIMAL, test_f1:DECIMAL, test_accuracy:DECIMAL
resultfields.timestamps = false

[CUSTOM] 
path = sample_data
"""
config_folder_path = 'config'
if not os.path.isdir(config_folder_path):
  os.mkdir(config_folder_path)
experiment_configuration_file_path = os.path.join(config_folder_path, 'example_general_usage.cfg')
with open(experiment_configuration_file_path, "w") as f: 
  f.write(content)

## Defining the execution function

Next, the execution of a single experiment has to be defined. Note that this is a dummy example, which contains limited reasonable code. It is meant to show the core functionality of the PyExperimenter. 

The method is called with the parameters, i.e. `keyfields`, of a database entry. The results are meant to be processed to be written into the database, i.e. as `resultfields`. 

In [2]:
import random
import numpy as np

from py_experimenter.result_processor import ResultProcessor

from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate

def run_ml(parameters: dict, result_processor: ResultProcessor, custom_config: dict):
  seed = parameters['seed']
  random.seed(seed)
  np.random.seed(seed)

  data = load_iris()
  # In case you want to load a file from a path
  # path = os.path.join(custom_config['path'], parameters['dataset'])
  # data = pd.read_csv(path)

  X = data.data
  y = data.target

  model = make_pipeline(StandardScaler(), SVC(kernel=parameters['kernel'], gamma='auto'))  
  result_processor.process_results({
    'pipeline': str(model)
  })

  if parameters['dataset'] != 'iris':
    raise ValueError("Example error")

  scores = cross_validate(model, X, y, 
    cv=parameters['cross_validation_splits'],
    scoring=('accuracy', 'f1_micro'),
    return_train_score=True
  )
  
  result_processor.process_results({
    'train_f1': np.mean(scores['train_f1_micro']),
    'train_accuracy': np.mean(scores['train_accuracy'])
  })

  result_processor.process_results({
    'test_f1': np.mean(scores['test_f1_micro']),
    'test_accuracy': np.mean(scores['test_accuracy'])
  })

## Executing PyExperimenter

The actual execution of the PyExperimenter is done in multiple steps. 

### Initialize PyExperimenter
The PyExperimenter is initialized with the previously created configuration file. Additionally, `PyExperimenter` is given a `name`, i.e. job id, which is especially useful for parallel executions of multiple experiments on HPC. 

In [3]:
from py_experimenter.experimenter import PyExperimenter

experimenter = PyExperimenter(experiment_configuration_file_path=experiment_configuration_file_path, name='example_notebook')

### Fill Table

The table is filled based on the above created configuration file with `fill_table_from_config()`. Therefore, the cartesian product of all keyfields makes up the content of the table. Additionally, a custom defined row, i.e. a custom defined keyfield tuple, is added with `fill_table_with_rows()`. 

Note that the table can easily be obtained as `pandas.Dataframe` via `experimenter.get_table()`.

In [4]:
experimenter.fill_table_from_config()

experimenter.fill_table_with_rows(rows=[
      {'dataset': 'iris', 'cross_validation_splits': 3, 'seed': 42, 'kernel':'linear'}])

# showing database table
experimenter.get_table()

Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
1,2,iris,5,4,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
2,3,iris,5,6,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:18,
3,4,iris,5,2,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
4,5,iris,5,4,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
5,6,iris,5,6,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:20,
6,7,iris,5,2,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
7,8,iris,5,4,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
8,9,iris,5,6,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:18,
9,10,iris,5,2,sigmoid,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-03-24 18:24:19,


### Execute PyExperimenter
All experiments are executed one after the other by the same `PyExperimenter` due to `max_experiments=-1`. If just a single one or a predifined number of experiments should be executed, the `-1` has to be replaced by the according amount. The `random_order` is especially important in case of parallel execution of multiple `PyExperimenter`, e.g. when doing it on a HPC, to avoid collusions of accessing the same row of the table. 

The first parameter, i.e. `run_ml`, relates to the actual method that should be executed with the given keyfields of the table. 

In [5]:
experimenter.execute(run_ml, max_experiments=-1, random_order=True)

# showing database table
experimenter.get_table() 

Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
1,2,iris,5,4,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
2,3,iris,5,6,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:18,
3,4,iris,5,2,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
4,5,iris,5,4,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
5,6,iris,5,6,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:20,
6,7,iris,5,2,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
7,8,iris,5,4,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
8,9,iris,5,6,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:18,
9,10,iris,5,2,sigmoid,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-03-24 18:24:19,


### Restart Failed Experiments

As experiments fail at some time, those experiments were reset for another try with `reset_experiments()`. The `status` describes which table rows should be replace. In this example all failed experiments, i.e. having `status==error`, are reset.

In [6]:
experimenter.reset_experiments('error')

# showing database table
experimenter.get_table() 

Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
1,2,iris,5,4,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
2,3,iris,5,6,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:18,
3,4,iris,5,2,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
4,5,iris,5,4,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
5,6,iris,5,6,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:20,
6,7,iris,5,2,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
7,8,iris,5,4,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
8,9,iris,5,6,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:18,
9,10,iris,5,2,sigmoid,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-03-24 18:24:19,


After the reset of failed experiments, they can be executed again as described above. 

In [7]:
experimenter.execute(run_ml, max_experiments=-1, random_order=True)

# showing database table
experimenter.get_table() 

ERROR:root:Traceback (most recent call last):
  File "/home/tornede/remote_development/py_experimenter/py_experimenter/experimenter.py", line 403, in _execution_wrapper
    experiment_function(keyfield_values, result_processor, custom_fields)
  File "/tmp/ipykernel_27154/1244630566.py", line 31, in run_ml
    raise ValueError("Example error")
ValueError: Example error



Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
1,2,iris,5,4,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:19,
2,3,iris,5,6,linear,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-03-24 18:24:18,
3,4,iris,5,2,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
4,5,iris,5,4,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:19,
5,6,iris,5,6,poly,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-03-24 18:24:20,
6,7,iris,5,2,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
7,8,iris,5,4,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:19,
8,9,iris,5,6,rbf,2023-03-24 18:24:18,done,2023-03-24 18:24:18,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-03-24 18:24:18,
9,10,iris,5,2,sigmoid,2023-03-24 18:24:18,done,2023-03-24 18:24:19,example_notebook,vm-tornede3,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-03-24 18:24:19,


### Generating Result Table


The table containes single experiment results. Those can be aggregated, e.g. to generate the mean over all seeds. 

In [8]:
result_table_agg = experimenter.get_table().groupby(['dataset']).mean()
result_table_agg

  result_table_agg = experimenter.get_table().groupby(['dataset']).mean()


Unnamed: 0_level_0,ID,cross_validation_splits,seed,train_f1,train_accuracy,test_f1,test_accuracy
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iris,7.692308,4.846154,6.923077,0.947692,0.947692,0.943077,0.943077
new_data,23.0,3.0,42.0,,,,


### Printing LaTex Table

As `pandas.Dataframe`s can easily be printed as LaTex table, here is an example code for one of the above result columns. 

In [9]:
print(result_table_agg.to_latex(columns=['test_f1'], index_names=['dataset']))

\begin{tabular}{lr}
\toprule
{} &   test\_f1 \\
dataset  &           \\
\midrule
iris     &  0.943077 \\
new\_data &       NaN \\
\bottomrule
\end{tabular}



  print(result_table_agg.to_latex(columns=['test_f1'], index_names=['dataset']))
