Now that there are multiple baseline models stored in `models/`, we can proceed to model selection. After the first milestone there will be more models, especially neural networks.  

In this notebook I'm going to:
1. Evaluate models
1. Generate a leaderboard

This notebook was run on cluster since there are 2 models kept crashing my MacBook Pro. There's a large space of improvement for resource management of `skorch`.

In [1]:
import os
import yaml
import pickle
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
%load_ext autoreload
%autoreload 2
import ivpk



Load models and settings.

Instead of parsing model file names, I've manually registered the models in a yaml file.

In [3]:
with open("models/model_registration.yaml", "r") as f:
    all_models = yaml.safe_load(f)

There are 2 types of model objects: sklearn estimator or GridSearchCV. GridSearchCV objects were refit on train+val data.

Note: GridSearchCV train / val metrics should not come from the refit version, use the model.best_score_ for MAE_cv.

## VDss

In [6]:
target = "VDss"
eval_vdss = ivpk.evalutaion.eval_registered(target, test=True)

100%|██████████| 7/7 [00:06<00:00,  1.12it/s]


In [7]:
df_vdss = pd.DataFrame([v.evaluation for v in eval_vdss.values()], index=eval_vdss.keys())
df_vdss

Unnamed: 0,MAE_train,Pearsonr_train,MAE_val,Pearsonr_val,MAE_test,Pearsonr_test,MAE_cv
lasso02_morgan256,1.054101,0.754406,1.129897,0.74752,1.485563,0.5478,
rfreg_morgan256,0.398827,0.97401,1.074936,0.762647,1.431827,0.520605,
lasso_gridsearch,1.073391,0.740603,1.056781,0.78825,1.459111,0.568618,1.163534
rfreg_gridsearch,0.384881,0.97904,0.386894,0.981479,1.380789,0.574837,1.06873
mlp_gridsearch,0.616318,0.931349,0.603687,0.938498,1.56469,0.512947,1.167947
mlp_morgan256,0.901812,0.827388,0.925742,0.831301,1.483468,0.538599,
mlp_morgan2048,0.883792,0.836584,0.915545,0.842677,1.397092,0.58004,


The best model so far is rfreg_gridsearch. Note that on morgan 2048 the MLP regressor actually performed comparable with random forest regressor on morgan 256. For easy computation I'll submit prediction from rfreg_gridsearch, but it would be really interesting to examine the mlp_morgan2048. 

A bonus task for model interpretation later: consider highlight high importance fingerprint bits on structures.

Now let's save the leaderboard.

In [10]:
df_vdss.to_csv("doc/VDss_leaderboard.csv")

## CL

In [8]:
target = "CL"
eval_cl = ivpk.evalutaion.eval_registered(target, test=True)

100%|██████████| 4/4 [00:03<00:00,  1.01it/s]


In [9]:
df_cl = pd.DataFrame([v.evaluation for v in eval_cl.values()], index=eval_cl.keys())
df_cl

Unnamed: 0,MAE_train,Pearsonr_train,MAE_val,Pearsonr_val,MAE_test,Pearsonr_test,MAE_cv
rfreg_morgan256,0.516649,0.973467,1.394187,0.457802,1.544839,0.2333,
rfreg_gridsearch,0.505391,0.978585,0.504125,0.974885,1.51473,0.271944,1.389895
mlp_gridsearch,1.383372,0.590336,1.371065,0.561811,1.50209,0.276183,1.517285
mlp_morgan256,1.402283,0.564091,1.380383,0.539837,1.489376,0.289302,


The best one is mlp_morgan256. This model still suffers from high-bias, which implies that properties + fingerprint might not be a good solution to predict CL. Since I didn't run a gridsearchCV for MLP on morgan2048, 

In [11]:
df_cl.to_csv("doc/CL_leaderboard.csv")

Later we can use [online table converter](https://tableconvert.com/) to convert the leaderboard into markdown for Readme.