# Example: General Usage

This example shows the general usage of `PyExperimenter`, from creating an experiment configuration file, over the actual execution of (dummy) experiments, to the extraction of experimental results. 

To execute this notebook you need to install:
```
pip install py_experimenter
pip install scikit-learn
```

## Experiment Configuration File
This notebook shows an example execution of `PyExperimenter` based on an experiment configuration file. Further explanation about the usage of `PyExperimenter` can be found in the [documentation](https://tornede.github.io/py_experimenter/usage.html).

In [1]:
import os

content = """
[PY_EXPERIMENTER]
provider = sqlite 
database = py_experimenter
table = example_general_usage 

keyfields = dataset, cross_validation_splits:int, seed:int, kernel
dataset = iris
cross_validation_splits = 5
seed = 2:6:2 
kernel = linear, poly, rbf, sigmoid

resultfields = pipeline:LONGTEXT, train_f1:DECIMAL, train_accuracy:DECIMAL, test_f1:DECIMAL, test_accuracy:DECIMAL
resultfields.timestamps = false

[CUSTOM] 
path = sample_data

[codecarbon]
offline_mode = False
measure_power_secs = 25
tracking_mode = process
log_level = error
save_to_file = True
output_dir = output/CodeCarbon
"""
# Create config directory if it does not exist
if not os.path.exists('config'):
    os.mkdir('config')
    
# Create config file
experiment_configuration_file_path = os.path.join('config', 'example_general_usage.cfg')
with open(experiment_configuration_file_path, "w") as f: 
  f.write(content)

## Defining the execution function

Next, the execution of a single experiment has to be defined. Note that this is a dummy example, which contains limited reasonable code. It is meant to show the core functionality of the PyExperimenter. 

The method is called with the parameters, i.e. `keyfields`, of a database entry. The results are meant to be processed to be written into the database, i.e. as `resultfields`. 

In [2]:
import random
import numpy as np

from py_experimenter.result_processor import ResultProcessor

from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate

def run_ml(parameters: dict, result_processor: ResultProcessor, custom_config: dict):
  seed = parameters['seed']
  random.seed(seed)
  np.random.seed(seed)

  data = load_iris()
  # In case you want to load a file from a path
  # path = os.path.join(custom_config['path'], parameters['dataset'])
  # data = pd.read_csv(path)

  X = data.data
  y = data.target

  model = make_pipeline(StandardScaler(), SVC(kernel=parameters['kernel'], gamma='auto'))  
  result_processor.process_results({
    'pipeline': str(model)
  })

  if parameters['dataset'] != 'iris':
    raise ValueError("Example error")

  scores = cross_validate(model, X, y, 
    cv=parameters['cross_validation_splits'],
    scoring=('accuracy', 'f1_micro'),
    return_train_score=True
  )
  
  result_processor.process_results({
    'train_f1': np.mean(scores['train_f1_micro']),
    'train_accuracy': np.mean(scores['train_accuracy'])
  })

  result_processor.process_results({
    'test_f1': np.mean(scores['test_f1_micro']),
    'test_accuracy': np.mean(scores['test_accuracy'])
  })

## Executing PyExperimenter

The actual execution of the PyExperimenter is done in multiple steps. 

### Initialize PyExperimenter
The PyExperimenter is initialized with the previously created configuration file. Additionally, `PyExperimenter` is given a `name`, i.e. job id, which is especially useful for parallel executions of multiple experiments on HPC. 

In [3]:
from py_experimenter.experimenter import PyExperimenter

experimenter = PyExperimenter(experiment_configuration_file_path=experiment_configuration_file_path, name='example_notebook')

### Fill Table

The table is filled based on the above created configuration file with `fill_table_from_config()`. Therefore, the cartesian product of all keyfields makes up the content of the table. Additionally, a custom defined row, i.e. a custom defined keyfield tuple, is added with `fill_table_with_rows()`. 

Note that the table can easily be obtained as `pandas.Dataframe` via `experimenter.get_table()`.

In [4]:
experimenter.fill_table_from_config()

experimenter.fill_table_with_rows(rows=[
      {'dataset': 'error_dataset', 'cross_validation_splits': 3, 'seed': 42, 'kernel':'linear'}])

# showing database table
experimenter.get_table()

Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-06-17 08:59:12,created,,,,,,,,,,
1,2,iris,5,4,linear,2023-06-17 08:59:12,created,,,,,,,,,,
2,3,iris,5,6,linear,2023-06-17 08:59:12,created,,,,,,,,,,
3,4,iris,5,2,poly,2023-06-17 08:59:12,created,,,,,,,,,,
4,5,iris,5,4,poly,2023-06-17 08:59:12,created,,,,,,,,,,
5,6,iris,5,6,poly,2023-06-17 08:59:12,created,,,,,,,,,,
6,7,iris,5,2,rbf,2023-06-17 08:59:12,created,,,,,,,,,,
7,8,iris,5,4,rbf,2023-06-17 08:59:12,created,,,,,,,,,,
8,9,iris,5,6,rbf,2023-06-17 08:59:12,created,,,,,,,,,,
9,10,iris,5,2,sigmoid,2023-06-17 08:59:12,created,,,,,,,,,,


### Execute PyExperimenter
All experiments are executed one after the other by the same `PyExperimenter` due to `max_experiments=-1`. If just a single one or a predifined number of experiments should be executed, the `-1` has to be replaced by the according amount

The first parameter, i.e. `run_ml`, relates to the actual method that should be executed with the given keyfields of the table. 

In [5]:
experimenter.execute(run_ml, max_experiments=-1)

# showing database table
experimenter.get_table() 

ERROR:root:Traceback (most recent call last):
  File "/home/lukas/development/code_projects/py_experimenter/py_experimenter/experimenter.py", line 382, in _execution_wrapper
    experiment_function(keyfield_values, result_processor, custom_fields)
  File "/tmp/ipykernel_28275/1244630566.py", line 31, in run_ml
    raise ValueError("Example error")
ValueError: Example error



Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:13,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:18,
1,2,iris,5,4,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:18,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:24,
2,3,iris,5,6,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:24,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:29,
3,4,iris,5,2,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:30,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:35,
4,5,iris,5,4,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:35,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:41,
5,6,iris,5,6,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:41,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:46,
6,7,iris,5,2,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:46,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:52,
7,8,iris,5,4,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:52,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:57,
8,9,iris,5,6,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:57,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 09:00:03,
9,10,iris,5,2,sigmoid,2023-06-17 08:59:12,done,2023-06-17 09:00:03,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-06-17 09:00:09,


### Restart Failed Experiments

As experiments fail at some time, those experiments were reset for another try with `reset_experiments()`. The `status` describes which table rows should be replace. In this example all failed experiments, i.e. having `status==error`, are reset. Experiments can also be reset based on multiple status by simply passing a list of status, e.g. `experimenter.reset_experiments('error', 'done')`. In that case, all experiments with status 'error' or 'done' will be reset.

In [6]:
experimenter.reset_experiments('error')

# showing database table
experimenter.get_table() 

Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:13,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:18,
1,2,iris,5,4,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:18,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:24,
2,3,iris,5,6,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:24,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:29,
3,4,iris,5,2,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:30,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:35,
4,5,iris,5,4,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:35,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:41,
5,6,iris,5,6,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:41,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:46,
6,7,iris,5,2,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:46,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:52,
7,8,iris,5,4,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:52,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:57,
8,9,iris,5,6,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:57,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 09:00:03,
9,10,iris,5,2,sigmoid,2023-06-17 08:59:12,done,2023-06-17 09:00:03,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-06-17 09:00:09,


After the reset of failed experiments, they can be executed again as described above. 

In [7]:
experimenter.execute(run_ml, max_experiments=-1)

# showing database table
experimenter.get_table() 

ERROR:root:Traceback (most recent call last):
  File "/home/lukas/development/code_projects/py_experimenter/py_experimenter/experimenter.py", line 382, in _execution_wrapper
    experiment_function(keyfield_values, result_processor, custom_fields)
  File "/tmp/ipykernel_28275/1244630566.py", line 31, in run_ml
    raise ValueError("Example error")
ValueError: Example error



Unnamed: 0,ID,dataset,cross_validation_splits,seed,kernel,creation_date,status,start_date,name,machine,pipeline,train_f1,train_accuracy,test_f1,test_accuracy,end_date,error
0,1,iris,5,2,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:13,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:18,
1,2,iris,5,4,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:18,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:24,
2,3,iris,5,6,linear,2023-06-17 08:59:12,done,2023-06-17 08:59:24,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.971667,0.971667,0.966667,0.966667,2023-06-17 08:59:29,
3,4,iris,5,2,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:30,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:35,
4,5,iris,5,4,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:35,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:41,
5,6,iris,5,6,poly,2023-06-17 08:59:12,done,2023-06-17 08:59:41,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.936667,0.936667,0.933333,0.933333,2023-06-17 08:59:46,
6,7,iris,5,2,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:46,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:52,
7,8,iris,5,4,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:52,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 08:59:57,
8,9,iris,5,6,rbf,2023-06-17 08:59:12,done,2023-06-17 08:59:57,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.975,0.975,0.966667,0.966667,2023-06-17 09:00:03,
9,10,iris,5,2,sigmoid,2023-06-17 08:59:12,done,2023-06-17 09:00:03,example_notebook,Worklaptop,"Pipeline(steps=[('standardscaler', StandardSca...",0.896667,0.896667,0.893333,0.893333,2023-06-17 09:00:09,


### Generating Result Table


The table containes single experiment results. Those can be aggregated, e.g. to generate the mean over all seeds. 

In [8]:
result_table_agg = experimenter.get_table().groupby(['dataset']).mean(numeric_only = True)
result_table_agg

Unnamed: 0_level_0,ID,cross_validation_splits,seed,train_f1,train_accuracy,test_f1,test_accuracy
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
error_dataset,14.0,3.0,42.0,,,,
iris,6.5,5.0,4.0,0.945,0.945,0.94,0.94


### Printing LaTex Table

As `pandas.Dataframe`s can easily be printed as LaTex table, here is an example code for one of the above result columns. 

In [9]:
print(result_table_agg[['test_f1']].style.to_latex())

\begin{tabular}{lr}
 & test_f1 \\
dataset &  \\
error_dataset & nan \\
iris & 0.940000 \\
\end{tabular}



### CodeCarbon
[CodeCarbon](https://tornede.github.io/py_experimenter/usage/experiment_configuration_file.html#codecarbon) is integrated into `PyExperimenter` to provide information about the carbon emissions of experiments. `CodeCarbon` will create a table with suffix `_codecarbon` in the database, each row containing information about the carbon emissions of a single experiment.

In [10]:
experimenter.get_codecarbon_table()

Unnamed: 0,ID,experiment_id,codecarbon_timestamp,project_name,run_id,duration_seconds,emissions_kg,emissions_rate_kg_sec,cpu_power_watt,gpu_power_watt,...,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,on_cloud,offline_mode
0,1,1,2023-06-17T08:59:18,codecarbon,451bc2e9-8c7f-416b-80f3-4ed0ef44cdff,0.121472,3.08456e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
1,2,2,2023-06-17T08:59:24,codecarbon,36f4b99e-b138-4c3f-b833-f6824feafa8f,0.147891,3.754468e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
2,3,3,2023-06-17T08:59:29,codecarbon,7ed6d96f-68b1-4343-a1c2-7b26ec4bad4b,0.147379,3.877488e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
3,4,4,2023-06-17T08:59:35,codecarbon,53826b6e-933d-4537-9477-0ff1b9afd5d1,0.125964,3.283903e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
4,5,5,2023-06-17T08:59:41,codecarbon,640bbbbf-c5a4-4706-94fa-e26bc8eb4c25,0.134753,3.522335e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
5,6,6,2023-06-17T08:59:46,codecarbon,b67e2e24-72ac-4ca6-976b-440454c79416,0.140218,3.614312e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
6,7,7,2023-06-17T08:59:52,codecarbon,b411da2a-9809-4f4f-bfbd-88d62bcde67b,0.131839,3.342969e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
7,8,8,2023-06-17T08:59:57,codecarbon,244f3cff-c8bf-42c2-a186-994aa6931e27,0.13451,3.455675e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
8,9,9,2023-06-17T09:00:03,codecarbon,7a483787-7c64-4eaa-919d-e978600ea311,0.150213,3.967545e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0
9,10,10,2023-06-17T09:00:09,codecarbon,d738e59c-83c8-4ece-b0fc-fcc8dd0907dd,0.135736,3.511814e-07,3e-06,42.5,0.0,...,16.0,12th Gen Intel(R) Core(TM) i7-1260P,,,8.8516,51.8099,15.474876,process,N,0


#### Aggregating CodeCarbon Results

The carbon emission information of `CodeCarbon` can be easily aggregated via `pandas.Dataframe`.

In [11]:
carbon_emissions = experimenter.get_codecarbon_table().groupby(['project_name']).sum(numeric_only = True)
carbon_emissions

Unnamed: 0_level_0,ID,experiment_id,duration_seconds,emissions_kg,emissions_rate_kg_sec,cpu_power_watt,gpu_power_watt,ram_power_watt,cpu_energy_kw,gpu_energy_kw,ram_energy_kw,energy_consumed_kw,cpu_count,ram_total_size,offline_mode
project_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
codecarbon,105,105,1.821685,5e-06,3.6e-05,595.0,0.0,0.890505,1.5e-05,0.0,2.255914e-08,1.5e-05,224.0,216.64827,0


#### Printing CodeCarbon Results as LaTex Table

Furthermore, the resulting `pandas.Dataframe` can easily be printed as LaTex table.

In [15]:
print(carbon_emissions[['energy_consumed_kw', 'emissions_kg']].style.to_latex())

\begin{tabular}{lrr}
 & energy_consumed_kw & emissions_kg \\
project_name &  &  \\
codecarbon & 0.000015 & 0.000005 \\
\end{tabular}

