# GNN Training Demo

This notebook illustrates a short demo how to train a simple GNN model based on datasets loaded by the ``chem_mat_data`` dataset.

In [1]:
import os
from typing import List

from rich.pretty import pprint

## Loading Graph Dataset

Given the string identifier of the dataset, the ``load_graph_dataset`` function loads a dataset from the remote file share server and returns it as a list of graph dictionary objects. The cell below illustrates a sample structure of such a graph dict.

In [2]:
from chem_mat_data import load_graph_dataset

graphs: List[dict] = load_graph_dataset('aqsoldb')
example_graph: dict = graphs[0]

print(f'Number of graphs in dataset: {len(graphs)}')
pprint(example_graph)

Number of graphs in dataset: 9889


In [3]:
import torch
import torch.nn as nn
import pytorch_lightning as pl
from torch import Tensor
from torch_geometric.nn.models import GIN
from torch_geometric.data import Data, DataLoader
from torch_geometric.nn.aggr import SumAggregation

class GINModel(pl.LightningModule):
    
    def __init__(self, input_dim, hidden_dim=64, output_dim=1, num_layers=3):
        super(GINModel, self).__init__()
        self.encoder = GIN(input_dim, hidden_channels=hidden_dim, out_channels=hidden_dim, num_layers=num_layers)
        self.global_pool = SumAggregation()
        self.predictor = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x, edge_index, batch):
        x = self.encoder(x, edge_index)
        x = self.global_pool(x, batch)
        return self.predictor(x)

    def training_step(self, batch, batch_idx):
        x, edge_index, batch, y = batch.x, batch.edge_index, batch.batch, batch.y
        y_hat = self(x, edge_index, batch)
        loss = torch.nn.functional.mse_loss(y_hat, y.unsqueeze(-1))
        self.log('train_loss', loss, prog_bar=True, on_epoch=True, batch_size=y_hat.size(0))
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.001)
        return optimizer


# Assuming example_graph is already loaded and contains the necessary data
input_dim = example_graph['node_attributes'].shape[1]
output_dim = example_graph['graph_labels'].shape[0]
model = GINModel(input_dim=input_dim, output_dim=output_dim)

## Converting Graphs to Data Objects

To train a Pytorch Geometric model, the data has to be formatted into the PyG ``Data`` objects. For this purpose, the package offers a converter function ``pyg_data_list_from_graphs`` which converts a given list of graph dictionary objects, as they can be loaded from the database, into a list of PyG compatible data objects.

In [4]:
from chem_mat_data import pyg_data_list_from_graphs

# Create a PyTorch Geometric DataLoader
data_list: List[Data] = pyg_data_list_from_graphs(graphs)
train_loader = DataLoader(data_list, batch_size=32, shuffle=True)




## Model Training

In [5]:
# Train the model
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, train_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/jonas/Programming/chem_mat_data/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default

  | Name        | Type           | Params | Mode 
-------------------------------------------------------
0 | encoder     | GIN            | 23.7 K | train
1 | global_pool | SumAggregation | 0      | train
2 | predictor   | Sequential     | 4.2 K  | train
-------------------------------------------------------
27.9 K    Trainable 

Epoch 5:  44%|████▎     | 135/310 [00:01<00:02, 70.48it/s, v_num=8, train_loss_step=1.690, train_loss_epoch=1.890]   


Detected KeyboardInterrupt, attempting graceful shutdown ...


NameError: name 'exit' is not defined

In [1]:
import numpy as np
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Switch model to evaluation mode
model.eval()

# Collect all predictions and true values
all_preds = []
all_true = []

for data in train_loader:
    with torch.no_grad():
        x, edge_index, batch, y = data.x, data.edge_index, data.batch, data.y
        preds = model(x, edge_index, batch)
        all_preds.append(preds.cpu().numpy())
        all_true.append(y.cpu().numpy())

# Flatten the lists
all_preds = np.concatenate(all_preds, axis=0)
all_true = np.concatenate(all_true, axis=0)

# Calculate R2 score
r2_value = r2_score(all_true, all_preds)
mse_value = mean_squared_error(all_true, all_preds)
print(f'R2 score: {r2_value}')
print(f'MSE score: {mse_value}')

NameError: name 'model' is not defined

## Loading Raw Datasets

Even though it is a convenient option, the package is not limited to the training of graph neural networks. To obtain the raw version of this dataset you can use the ``load_smiles_dataset`` function with the same string dataset name to obtain a pandas dataframe containing the raw data.

In [7]:
from chem_mat_data import load_smiles_dataset

df = load_smiles_dataset('aqsoldb')
df.head()

Unnamed: 0,index,ID,SMILES,InChIKey,Solubility,split
0,0,A-10,Cc1cccc(C=C)c1,JZHGRUMIRATHIU-UHFFFAOYSA-N,-3.12315,1
1,1,A-100,Cc1cc(cc(C)c1O)C(C)(C)c2cc(C)c(O)c(C)c2,ODJUOZPKKHIEOZ-UHFFFAOYSA-N,-4.952869,1
2,2,A-1000,O=C1CCCCCCCCCOCCCCCO1,MKEIDVFLAWJKMY-UHFFFAOYSA-N,-3.883849,1
3,3,A-1002,CCCCCCCCCCC(C)CCCCCCCC,FFVPRSKCTDQLBP-UHFFFAOYSA-N,-6.451105,1
4,4,A-1003,NC(=O)N=NC(N)=O,XOZUGNYVDXMRKW-UHFFFAOYSA-N,-3.546243,1
