# Demo of nested cross validation with GCN Regressor

This notebook has the exact same structure as `nested_cv_mlp.ipynb`, but to showcase that the same functions and workflow also works with graph neural networks developed using Pytorch Geometrics (PyG)

In [1]:
from sklearn.model_selection import KFold
from torch.utils.data import Subset

from cv import cross_validation, nested_gridsearch_cv
from data.dataset import ESOLDataset
from model.gcn import GCN
from train import train_model

## Load dataset
PyG provides nice interface for many widely used graph datasets. Here, one subset of the MoleculeNet dataset, ESOL, is used. It consists of 1128 molecules, the dimension of atom features is 9, and the target dimension is 1. For convenience, this dataset is already downloaded to `ml_scripts/data`

In [2]:
dataset = ESOLDataset()

  if osp.exists(f) and torch.load(f) != _repr(self.pre_transform):
  if osp.exists(f) and torch.load(f) != _repr(self.pre_filter):
  return torch.load(f, map_location)


## Demo 1 - train model

In [4]:
# use the first 80 % as training set and the remaining 20 % as test set
train_dataset = Subset(dataset, range(int(0.8 * len(dataset))))
test_dataset = Subset(dataset, range(int(0.8 * len(dataset)), len(dataset)))

# initialize a model
# You don't need to specify input_dim (auto-detect)
model = GCN(gcn_hidden_dim=60, n_gcn_layers=2, ffnn_hidden_dim=40, n_ffnn_layers=2)

# train the model
# the best average loss on the validation set will be returned
# and you can optionaly save the best model
# hyperparameters such as learning rate, number of epochs, loss functions 
# can be passed via `hparams`
score = train_model(train_dataset, test_dataset, model, hparams={"lr": 1e-3, "num_epochs": 30}, save_model=False)

Epoch 1/30, Validation Loss: 0.2142
Epoch 2/30, Validation Loss: 0.1823
Epoch 3/30, Validation Loss: 0.1394
Epoch 4/30, Validation Loss: 0.0660
Epoch 5/30, Validation Loss: 0.0552
Epoch 6/30, Validation Loss: 0.0710
Epoch 7/30, Validation Loss: 0.0532
Epoch 8/30, Validation Loss: 0.0701
Epoch 9/30, Validation Loss: 0.0449
Epoch 10/30, Validation Loss: 0.0447
Epoch 11/30, Validation Loss: 0.0271
Epoch 12/30, Validation Loss: 0.0227
Epoch 13/30, Validation Loss: 0.0341
Epoch 14/30, Validation Loss: 0.0322
Epoch 15/30, Validation Loss: 0.0342
Epoch 16/30, Validation Loss: 0.0254
Epoch 17/30, Validation Loss: 0.0167
Epoch 18/30, Validation Loss: 0.0198
Epoch 19/30, Validation Loss: 0.0182
Epoch 20/30, Validation Loss: 0.0207
Epoch 21/30, Validation Loss: 0.0191
Epoch 22/30, Validation Loss: 0.0250
Epoch 23/30, Validation Loss: 0.0211
Epoch 24/30, Validation Loss: 0.0206
Epoch 25/30, Validation Loss: 0.0275
Epoch 26/30, Validation Loss: 0.0164
Epoch 27/30, Validation Loss: 0.0178
Epoch 28/3

## Demo 2 - cross validation
For simplicity, a random split 5 fold cross validation is performed. The following
function should also support other type of cross validation, as long as `cv` has the 
`split` function that generates training and validation data indice.

Note, implementaion-wise, for the `cross_validation` and `nested_gridsearch_cv` no special adaption is needed to support GNNs

In [5]:
# Define the 5 fold cross validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Conducting cross validation based on a given set of 
# model hyperparameters (`model_hparams`) and training
# hyperparameters (`train_hparams`).
# Returns a list of scores for each fold.
scores = cross_validation(
    dataset,
    model_class=GCN,
    cv=kfold,
    model_hparams={"n_gcn_layers": 2},
    train_hparams={"num_epochs": 10, "lr": 1e-3},
)

Epoch 1/30, Validation Loss: 0.2394
Epoch 2/30, Validation Loss: 0.1840
Epoch 3/30, Validation Loss: 0.1371
Epoch 4/30, Validation Loss: 0.0798
Epoch 5/30, Validation Loss: 0.0433
Epoch 6/30, Validation Loss: 0.0556
Epoch 7/30, Validation Loss: 0.0408
Epoch 8/30, Validation Loss: 0.0354
Epoch 9/30, Validation Loss: 0.0266
Epoch 10/30, Validation Loss: 0.0235
Epoch 11/30, Validation Loss: 0.0280
Epoch 12/30, Validation Loss: 0.0273
Epoch 13/30, Validation Loss: 0.0244
Epoch 14/30, Validation Loss: 0.0280
Epoch 15/30, Validation Loss: 0.0217
Epoch 16/30, Validation Loss: 0.0223
Epoch 17/30, Validation Loss: 0.0201
Epoch 18/30, Validation Loss: 0.0287
Epoch 19/30, Validation Loss: 0.0237
Epoch 20/30, Validation Loss: 0.0270
Epoch 21/30, Validation Loss: 0.0434
Epoch 22/30, Validation Loss: 0.0246
Epoch 23/30, Validation Loss: 0.0195
Epoch 24/30, Validation Loss: 0.0167
Epoch 25/30, Validation Loss: 0.0172
Epoch 26/30, Validation Loss: 0.0183
Epoch 27/30, Validation Loss: 0.0166
Epoch 28/3

In [6]:
# Display the score of each fold
print(scores)

[0.016554435006285135, 0.01330262317066699, 0.01414882820264428, 0.015248866611056858, 0.01784012344148424]


## Demo 3 - nested cross validation

This example demonstrates an example of nested cross-validation. The inner cross valiation
is a 5-folder cross validation, and the outer cross validation is a 3-fold cross validation.
The grid search approach is used to find the best hyperparameters in the inner CV.
Hyperparameters giving the lowest average validation loss are kept. 
Once best hyperparameters are found, the model is retrained and evaluate on the test data
of the outer cv. The developed function chaining the process together, and output the
scores, model, and the best hyperparameters of each outer fold.




In [7]:
k_fold_inner = KFold(n_splits=5, shuffle=True, random_state=42)
k_fold_outer = KFold(n_splits=3, shuffle=True, random_state=42)

# Nest gridsearch CV
# model_hparams and train_hparams are dictionaries of hyperparameter grids
# if not exploring model_hparams or train_hparams, you can simply assign None.
# This example explores the space of hidden_dim and num_epochs for the MLP model.
# As a toy example, num_epochs are chosen to be very small to avoid extra long running time.
scores, models, hparams = nested_gridsearch_cv(
    dataset,
    model_class=GCN,
    inner_cv=k_fold_inner,
    outer_cv=k_fold_outer,
    model_hparams_grid={"n_gcn_layers": [2, 3]},
    train_hparams_grid={"num_epochs": [2, 5, 10], "lr": [1e-3]},
)

Outer Fold 0
Current hyperparameters:
Model hyperparameters: {'n_gcn_layers': 2}
Training hyperparameters: {'num_epochs': 2, 'lr': 0.001}
Epoch 1/2, Validation Loss: 0.2289
Epoch 2/2, Validation Loss: 0.1928
Epoch 1/2, Validation Loss: 0.2493
Epoch 2/2, Validation Loss: 0.2067
Epoch 1/2, Validation Loss: 0.2225
Epoch 2/2, Validation Loss: 0.1825
Epoch 1/2, Validation Loss: 0.2655
Epoch 2/2, Validation Loss: 0.2337
Epoch 1/2, Validation Loss: 0.2925
Epoch 2/2, Validation Loss: 0.2528
Current hyperparameters:
Model hyperparameters: {'n_gcn_layers': 3}
Training hyperparameters: {'num_epochs': 2, 'lr': 0.001}
Epoch 1/2, Validation Loss: 0.2285
Epoch 2/2, Validation Loss: 0.2028
Epoch 1/2, Validation Loss: 0.2397
Epoch 2/2, Validation Loss: 0.2050
Epoch 1/2, Validation Loss: 0.2253
Epoch 2/2, Validation Loss: 0.2170
Epoch 1/2, Validation Loss: 0.2475
Epoch 2/2, Validation Loss: 0.2359
Epoch 1/2, Validation Loss: 0.2964
Epoch 2/2, Validation Loss: 0.2778
Current hyperparameters:
Model hyperp

In [8]:
print(scores)

[0.1941574330025531, 0.17764239108308832, 0.13882469116373264]
