# maxsmi
## Analysis of results

This notebook serves to analyse the results of the simulations ran on the Curta cluster.

## Prerequisites
This notebook will run under the condition that some simulations have been run, e.g.
```
(maxsmi) $ python maxsmi/full_workflow.py --task ESOL --aug-strategy-train augmentation_with_duplication --aug-nb-train 10 --aug-nb-test 10

```

Have a look at the [README](https://github.com/t-kimber/maxsmi/blob/main/README.md) page for more details.

In [1]:
#!pip install flake8 pycodestyle_magic
%load_ext pycodestyle_magic
%pycodestyle_on

In [2]:
from maxsmi.utils_analysis import load_data, retrieve_metric
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np

In [3]:
def from_pkl_to_pd(
    task,
    augmentation_strategy_train,
    augmentation_strategy_test,
    ml_model,
    max_augmentation=100,
    string_encoding='smiles',
    single_value=0
):
    """
    # TODO
    """
    # Initialize pandas
    df = pd.DataFrame(data=None,
                      index=[elem for elem in
                             range(0, max_augmentation + 10, 10)],
                      columns=['r2', 'RMSE', 'training time [min]'])
    df.index.name = f"{augmentation_strategy_test}"

    if (augmentation_strategy_train == "augmentation_maximum_estimation"):
        num_aug = 10
        df = pd.DataFrame(data=None,
                          index=[num_aug],
                          columns=['r2', 'RMSE', 'training time [min]'])
        df.index.name = f"{augmentation_strategy_test}"
        data = load_data(
            task,
            augmentation_strategy_train,
            num_aug,
            augmentation_strategy_test,
            num_aug,
            ml_model,)
        df.loc[num_aug, "r2"] = data.test[0][2]
        df.loc[num_aug, "RMSE"] = data.test[0][1]
        time = data.time_training[0]
        df.loc[num_aug, "training time [min]"] = time.seconds//60
    else:
        for num_aug in range(0, max_augmentation + 10, 10):
            data = load_data(
                task,
                augmentation_strategy_train,
                num_aug,
                augmentation_strategy_test,
                num_aug,
                ml_model,)
            df.loc[num_aug, "r2"] = data.test[0][2]
            df.loc[num_aug, "RMSE"] = data.test[0][1]
            time = data.time_training[0]
            df.loc[num_aug, "training time [min]"] = time.seconds//60
    # df.to_latex(buf=f"{task}_{augmentation_strategy_train}_{ml_model}.tex")
    df = df.apply(pd.to_numeric)
    if single_value == 0:
        df = df.style.\
            set_caption(f"Data: {task}, Model: {ml_model}").\
            format({'r2': "{:.3f}", 'RMSE': '{:.3f}'}).\
            highlight_max(subset=['r2']).\
            highlight_min(subset=['RMSE']).\
            highlight_min(subset=['r2'], color="lightblue").\
            highlight_max(subset=['RMSE'], color="lightblue")
        return df
    elif single_value == 1:
        df = df.style.\
            set_caption(f"Data: {task}, Model: {ml_model}").\
            format({'r2': "{:.3f}", 'RMSE': '{:.3f}'}).\
            background_gradient(cmap='Purples', subset=["r2"]).\
            background_gradient(cmap='Blues', subset=["RMSE"]).\
            highlight_max(subset=['r2'], color="green").\
            highlight_min(subset=['RMSE'], color="green").\
            highlight_min(subset=['r2'], color="red").\
            highlight_max(subset=['RMSE'], color="red")
        return df
    else:
        print(f"For {task} and {ml_model},\n"
              f"best value for {augmentation_strategy_test} is: \n"
              f"{df.r2.argmax()*10} with r2 value of {df.r2.max():.3f}")
        return df

In [4]:
df = from_pkl_to_pd(task="ESOL",
                    augmentation_strategy_train="augmentation_with"
                    "_duplication",
                    augmentation_strategy_test="augmentation_with_duplication",
                    ml_model="CONV1D",
                    max_augmentation=100,
                    string_encoding='smiles',
                    single_value=0)
df

Unnamed: 0_level_0,r2,RMSE,training time [min]
augmentation_with_duplication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.836,0.849,1
10,0.898,0.669,12
20,0.914,0.616,25
30,0.92,0.592,45
40,0.92,0.595,61
50,0.923,0.583,76
60,0.922,0.584,97
70,0.924,0.577,109
80,0.925,0.573,131
90,0.919,0.598,140


In [5]:
df = from_pkl_to_pd(task="ESOL",
                    augmentation_strategy_train="augmentation_with"
                    "_duplication",
                    augmentation_strategy_test="augmentation_with_duplication",
                    ml_model="CONV1D",
                    max_augmentation=100,
                    string_encoding='smiles',
                    single_value=1)
df

Unnamed: 0_level_0,r2,RMSE,training time [min]
augmentation_with_duplication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.836,0.849,1
10,0.898,0.669,12
20,0.914,0.616,25
30,0.92,0.592,45
40,0.92,0.595,61
50,0.923,0.583,76
60,0.922,0.584,97
70,0.924,0.577,109
80,0.925,0.573,131
90,0.919,0.598,140


In [6]:
df = from_pkl_to_pd(task="ESOL",
                    augmentation_strategy_train="augmentation_with"
                    "_duplication",
                    augmentation_strategy_test="augmentation_with_duplication",
                    ml_model="CONV1D",
                    max_augmentation=100,
                    string_encoding='smiles',
                    single_value=2)
df

For ESOL and CONV1D,
best value for augmentation_with_duplication is: 
80 with r2 value of 0.925


Unnamed: 0_level_0,r2,RMSE,training time [min]
augmentation_with_duplication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.836114,0.848974,1
10,0.898368,0.668556,12
20,0.913707,0.616042,25
30,0.920183,0.592476,45
40,0.919547,0.594831,61
50,0.922622,0.583355,76
60,0.922357,0.584351,97
70,0.924268,0.577117,109
80,0.925435,0.572651,131
90,0.918774,0.597684,140


In [7]:
df = from_pkl_to_pd(task="ESOL_small",
                    augmentation_strategy_train="augmentation_"
                    "maximum_estimation",
                    augmentation_strategy_test="augmentation_"
                    "maximum_estimation",
                    ml_model="CONV1D",
                    max_augmentation=100,
                    string_encoding='smiles',
                    single_value=2)
df

For ESOL_small and CONV1D,
best value for augmentation_maximum_estimation is: 
0 with r2 value of 0.910


Unnamed: 0_level_0,r2,RMSE,training time [min]
augmentation_maximum_estimation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,0.910198,0.576299,851


In [8]:
def best_augmentation_strategy(task, model):
    """
    # TODO
    """
    best_values = {}
    for augmentation_strategy in ["augmentation_with_duplication",
                                  "augmentation_without_duplication",
                                  "augmentation_with_reduced_duplication"]:
        df = from_pkl_to_pd(task,
                            augmentation_strategy,
                            augmentation_strategy,
                            model,
                            single_value=2)
        # display(df)
        best_values[augmentation_strategy] = [df.r2.max(),
                                              df.r2.argmax()*10]

    print(f"\n\nFor {task} and model {model}:\n"
          # f"best r2 scores for each augmentation strategies are:\n"
          # f"{best_values}\n"
          f"OVERALL best augmentation strategies is:\t"
          f"{max(best_values, key=best_values.get)}, "
          f"{best_values[max(best_values, key=best_values.get)][1]}\n"
          f"{max([elem[0] for elem in best_values.values()]):.3f}\n\n\n")

In [9]:
for task in ["ESOL", "free_solv"]:
    for model in ["CONV1D", "CONV2D", "RNN"]:
        best_augmentation_strategy(task, model)

For ESOL and CONV1D,
best value for augmentation_with_duplication is: 
80 with r2 value of 0.925
For ESOL and CONV1D,
best value for augmentation_without_duplication is: 
100 with r2 value of 0.924
For ESOL and CONV1D,
best value for augmentation_with_reduced_duplication is: 
70 with r2 value of 0.926


For ESOL and model CONV1D:
OVERALL best augmentation strategies is:	augmentation_with_reduced_duplication, 70
0.926



For ESOL and CONV2D,
best value for augmentation_with_duplication is: 
20 with r2 value of 0.906
For ESOL and CONV2D,
best value for augmentation_without_duplication is: 
80 with r2 value of 0.905
For ESOL and CONV2D,
best value for augmentation_with_reduced_duplication is: 
80 with r2 value of 0.904


For ESOL and model CONV2D:
OVERALL best augmentation strategies is:	augmentation_with_duplication, 20
0.906



For ESOL and RNN,
best value for augmentation_with_duplication is: 
70 with r2 value of 0.921
For ESOL and RNN,
best value for augmentation_without_duplication i