# maxsmi
## Analysis of results

This notebook serves to analyse the results of the simulations ran on the Curta cluster.

## Prerequisites
This notebook will run under the condition that some simulations have been run, e.g.
```
(maxsmi) $ python maxsmi/full_workflow.py --task ESOL --aug-strategy-train augmentation_with_duplication --aug-nb-train 10 --aug-nb-test 10

```

Have a look at the [README](https://github.com/t-kimber/maxsmi/blob/main/README.md) page for more details.

In [None]:
#  !pip install flake8 pycodestyle_magic
%load_ext pycodestyle_magic
%pycodestyle_on

In [2]:
from maxsmi.utils_analysis import retrieve_metric
import pandas as pd
from IPython.display import display
import numpy as np
import dataframe_image as dfi

In [3]:
# To show the full pandas data frame with the full grid
pd.set_option("display.max_rows", None, "display.max_columns", None)

# Dataset
We consider the following datasets:

- ESOL
- free_solv
- lipophilicity

In [4]:
# TASK = "lipophilicity"
TASK = "ESOL"
# TASK = "free_solv"

### Grid: augmentation number

The models were run of a fine grid augmentation: from 1 to 20 with a step of 1 as well as a coarser grid: from 20 to 100 with a step of 10.

In [5]:
fine_grid = [elem for elem in range(0, 21, 1)]
coarse_grid = [elem for elem in range(10, 110, 10)]

temp_grid = [elem for elem in range(30, 110, 10)]
full_grid = fine_grid + temp_grid

In [6]:
def array_by_strategy(augmentation_strategy,
                      task="ESOL",
                      set_="test",
                      metric="rmse",
                      grid=full_grid):
    """

    Returns an array of a given metric for a given task.

    Parameters
    ----------
    augmentation_strategy : str
        The augmentation strategy to consider,
        e.g. `augmentation_with_reduced_duplication`.
    task : str, default `ESOL`
        The dataset to consider: ESOL, free_solv or lipophilicity.
    set_ : str, default `test`
        The evaluation set, train or test.
    metric: str, default `rmse`
        The performance metric to consider,
        such as the root mean squared error (rmse).
    grid: list
        The augmentation number to retrieve.

    Returns
    -------
    np.array
        Numeric values for `metric`.
    """

    models = ["CONV1D", "CONV2D", "RNN"]

    result_array = np.zeros((len(grid), len(models)))

    if augmentation_strategy == "augmentation_maximum_estimation":
        if task == "ESOL":
            task = "ESOL_SMALL"
            # Estimated maximum strategy was only evaluated
            # on a subset of ESOL, because time-intensive purposes.

    for i, model in enumerate(models):
        for j, augmentation_num in enumerate(grid):
            try:
                y = retrieve_metric(
                    metric,
                    set_,
                    task,
                    augmentation_strategy,
                    augmentation_num,
                    augmentation_strategy,
                    augmentation_num,
                    model,
                )
            except FileNotFoundError:
                y = np.nan
            result_array[j, i] = y
    return result_array

## Retrieve performance by augmentation strategy

In [7]:
results_without_dupl = array_by_strategy("augmentation_without_duplication",
                                         task=TASK)
results_with_dupl = array_by_strategy("augmentation_with_duplication",
                                      task=TASK)
results_with_red_dupl = array_by_strategy("augmentation_with_reduced_\
duplication",
                                          task=TASK)
results_max_est = array_by_strategy("augmentation_maximum_estimation",
                                    task=TASK)
results_no_aug = array_by_strategy("no_augmentation",
                                   task=TASK)

_Note:_

The augmentation strategy `estimated maximum` was not evaluated on the lipohilicity dataset.

In [8]:
if TASK == "lipophilicity":
    full_results = np.concatenate([results_without_dupl,
                                   results_with_dupl,
                                   results_with_red_dupl,
                                   results_no_aug], axis=1)
    column_name = [
        ('with duplication', 'CONV1D'),
        ('with duplication', 'CONV2D'),
        ('with duplication', 'RNN'),
        ('without duplication', 'CONV1D'),
        ('without duplication', 'CONV2D'),
        ('without duplication', 'RNN'),
        ('with reduced duplication', 'CONV1D'),
        ('with reduced duplication', 'CONV2D'),
        ('with reduced duplication', 'RNN'),
        ('no augmentation', 'CONV1D'),
        ('no augmentation', 'CONV2D'),
        ('no augmentation', 'RNN'),
    ]

else:
    column_name = [
        ('with duplication', 'CONV1D'),
        ('with duplication', 'CONV2D'),
        ('with duplication', 'RNN'),
        ('without duplication', 'CONV1D'),
        ('without duplication', 'CONV2D'),
        ('without duplication', 'RNN'),
        ('with reduced duplication', 'CONV1D'),
        ('with reduced duplication', 'CONV2D'),
        ('with reduced duplication', 'RNN'),
        ('no augmentation', 'CONV1D'),
        ('no augmentation', 'CONV2D'),
        ('no augmentation', 'RNN'),
        ('estimated maximum', 'CONV1D'),
        ('estimated maximum', 'CONV2D'),
        ('estimated maximum', 'RNN'),
    ]
    full_results = np.concatenate([results_without_dupl,
                                   results_with_dupl,
                                   results_with_red_dupl,
                                   results_no_aug,
                                   results_max_est], axis=1)

## Table

Generate the pandas dataframe with results, with multicolumns.

In [9]:
dataframe = pd.DataFrame(full_results,
                         columns=column_name,
                         index=full_grid)

In [10]:
dataframe.columns = pd.MultiIndex.from_tuples(dataframe.columns,
                                              names=['Strategy', 'Model'])
dataframe

Strategy,with duplication,with duplication,with duplication,without duplication,without duplication,without duplication,with reduced duplication,with reduced duplication,with reduced duplication,no augmentation,no augmentation,no augmentation,estimated maximum,estimated maximum,estimated maximum
Model,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN
0,,,,,,,,,,0.839284,0.895284,0.930156,,,
1,0.963811,1.00873,1.015847,0.974946,0.978425,1.019943,0.963551,0.986388,1.022108,,,,,,
2,0.784799,0.787359,0.963559,0.786002,0.768473,0.935091,0.785418,0.768814,0.943226,,,,,,
3,0.785119,0.726199,0.896327,0.784462,0.961557,0.898622,0.784557,0.72809,0.805387,,,,,,
4,0.732047,0.760507,0.881402,0.718319,0.761477,0.847465,0.738864,0.773636,0.817119,,,,,,
5,0.71614,0.74809,0.790791,0.715752,0.723403,0.788511,0.712385,0.729791,0.792996,,,,,,
6,0.666425,0.742724,0.788418,0.678882,0.69539,0.770905,0.673136,0.714351,0.760015,,,,,,
7,0.660437,0.675508,0.772992,0.667486,0.669962,0.774945,0.657508,0.683214,0.763586,,,,,,
8,0.712453,0.692367,0.742918,0.666258,0.699808,0.744323,0.672433,0.721333,0.732899,,,,,,
9,0.641594,0.761,0.727134,0.642232,0.654711,0.717657,0.640726,0.671181,0.714916,,,,,,


## Cosmetics

We beautify the dataframe for readibily.

In [11]:
def color_nan_white(val):
    """Color the nan text white"""
    if np.isnan(val):
        return 'color: white'

In [12]:
def color_nan_white_background(val):
    """Color the nan cell background white"""
    if np.isnan(val):
        return 'background-color: white'

In [13]:
df_styler = dataframe.style.\
    set_caption(f"Data: {TASK}").\
    format("{:.3f}").\
    background_gradient(cmap='Purples', axis=None).\
    applymap(lambda x: color_nan_white(x)).\
    applymap(lambda x: color_nan_white_background(x)).\
    highlight_min(color="yellow", axis=None)

df_styler

Strategy,with duplication,with duplication,with duplication,without duplication,without duplication,without duplication,with reduced duplication,with reduced duplication,with reduced duplication,no augmentation,no augmentation,no augmentation,estimated maximum,estimated maximum,estimated maximum
Model,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN,CONV1D,CONV2D,RNN
0,,,,,,,,,,0.839,0.895,0.93,,,
1,0.964,1.009,1.016,0.975,0.978,1.02,0.964,0.986,1.022,,,,,,
2,0.785,0.787,0.964,0.786,0.768,0.935,0.785,0.769,0.943,,,,,,
3,0.785,0.726,0.896,0.784,0.962,0.899,0.785,0.728,0.805,,,,,,
4,0.732,0.761,0.881,0.718,0.761,0.847,0.739,0.774,0.817,,,,,,
5,0.716,0.748,0.791,0.716,0.723,0.789,0.712,0.73,0.793,,,,,,
6,0.666,0.743,0.788,0.679,0.695,0.771,0.673,0.714,0.76,,,,,,
7,0.66,0.676,0.773,0.667,0.67,0.775,0.658,0.683,0.764,,,,,,
8,0.712,0.692,0.743,0.666,0.7,0.744,0.672,0.721,0.733,,,,,,
9,0.642,0.761,0.727,0.642,0.655,0.718,0.641,0.671,0.715,,,,,,


In [14]:
dfi.export(df_styler, f"dataframe_{TASK}.png")