# MiSaTo-Dataset: a tutorial

In this notebook, we will show how our QM and MD dataset are stored in h5 files. We also show how the data can be loaded so that it can be used by a deep learning model. 

We start by importing the useful packages and set up the paths of the files

In [11]:
import h5py
 
import torch_geometric.transforms as T
from torch_geometric.loader import DataLoader

from data.components.datasets import MolDataset, ProtDataset
from data.components.transformQM import GNNTransformQM
from data.components.transformMD import GNNTransformMD
from data.qm_datamodule import QMDataModule
from data.md_datamodule import MDDataModule

In [12]:
qmh5_file = "../data/QM/h5_files/qm.hdf5"
norm_file = "../data/QM/h5_files/qm_norm_fold1.hdf5"
norm_txtfile = "../data/QM/splits/train_norm_fold1.txt"

## H5 files presentations

We read the QM H5 file and H5 file used to normalize the target values.

In [15]:
qm_H5File = h5py.File("/p/project/hai_drug_qm/qm.hdf5")
qm_normFile = h5py.File(norm_file)

The ligands can be accessed using the pdb-id. Bellow we show the first ten molecules of the file.

In [16]:
list(qm_H5File.keys())[:10]

['10gs',
 '11gs',
 '13gs',
 '16pk',
 '184l',
 '185l',
 '186l',
 '187l',
 '188l',
 '1a07']

Target values can be accessed by specifiying into bracket the molecule name, then mol_properties and finally the name of the target value that we want to access: 

In [17]:
qm_H5File["5gmm"]["mol_properties"].keys()

<KeysViewHDF5 ['Electron_Affinity', 'Electronegativity', 'Hardness', 'Ionization_Potential', 'Koopman', 'molecular_weight', 'total_charge']>

In [18]:
qm_H5File["5gmm"]["mol_properties"]["Electron_Affinity"][()]

7.7383

We can access to the mean and standard-deviation of each target value by specifiying it into bracket.
We first specify the set, then the target value and finally either mean or std. 

In [19]:
qm_normFile.keys()

<KeysViewHDF5 ['Electron_Affinity', 'Electronegativity', 'Hardness', 'Ionization_Potential']>

In [21]:
print(qm_normFile["Electron_Affinity"]["mean"][()])
print(qm_normFile["Electron_Affinity"]["std"][()])

6.33265
18.636927


## Datasets and dataloaders

### PyTorch

The QM and MD datasets are warped into a PyTorch Dataset class under the name MolDataset and ProtDataset, respectively. 
The parameters taken by the two classes as well as their types can be found as follow.

In [22]:
help(MolDataset)

Help on class MolDataset in module data.components.datasets:

class MolDataset(torch.utils.data.dataset.Dataset)
 |  MolDataset(data_file, idx_file, target_norm_file, transform, isTrain=False, post_transform=None)
 |  
 |  Load the QM dataset.
 |  
 |  Method resolution order:
 |      MolDataset
 |      torch.utils.data.dataset.Dataset
 |      typing.Generic
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, index: int)
 |  
 |  __init__(self, data_file, idx_file, target_norm_file, transform, isTrain=False, post_transform=None)
 |      Args:
 |          data_file (str): H5 file path
 |          idx_file (str): path of txt file which contains pdb ids for a specific split such as train, val or test.
 |          target_norm_file (str): H5 file path where training mean and std are stored.  
 |          transform (obj): class that convert a dict to a PyTorch Geometric graph.
 |          isTrain (bool, optional): Flag to standardize the target values (only used

In [23]:
help(ProtDataset)

Help on class ProtDataset in module data.components.datasets:

class ProtDataset(torch.utils.data.dataset.Dataset)
 |  ProtDataset(md_data_file, idx_file, transform=None, post_transform=None)
 |  
 |  Load the MD dataset
 |  
 |  Method resolution order:
 |      ProtDataset
 |      torch.utils.data.dataset.Dataset
 |      typing.Generic
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, index: int)
 |  
 |  __init__(self, md_data_file, idx_file, transform=None, post_transform=None)
 |      Args:
 |          md_data_file (str): H5 file path
 |          idx_file (str): path of txt file which contains pdb ids for a specific split such as train, val or test.
 |          transform (obj): class that convert a dict to a PyTorch Geometric graph.
 |          post_transform (PyTorch Geometric, optional): data augmentation. Defaults to None.
 |  
 |  __len__(self) -> int
 |  
 |  ----------------------------------------------------------------------
 |  Data and oth

We can load the data by instanciating MolDataset and providing the QM H5 file, the text file that indicates the molecule used for training and the norm file used to normalize the target values. 

The MolDataset class without any transform return a dictionary that contain the elements and their coordinates. We use GNNTransformQM class to transform our data to a graph that can be used by a GNN. The parameter post_transform is another transformation used to perform data augmentation.

In [24]:
transform = T.RandomTranslate(0.25)
batch_size = 128
num_workers = 48

data_train = MolDataset(qmh5_file, norm_txtfile, target_norm_file=norm_file, transform=GNNTransformQM(), post_transform=transform)

Finally, we can load our data using the PyTorch DataLoader.

In [25]:
train_loader = DataLoader(data_train, batch_size, shuffle=True, num_workers=0)

for idx, val in enumerate(train_loader):
    print(val)
    break

DataBatch(x=[7138, 25], edge_index=[2, 132512], edge_attr=[132512, 1], y=[256], pos=[7138, 3], id=[128], batch=[7138], ptr=[129])


### PyTorch lightning 

The QMDataModule is a class inherated from LightningDataModule that instanciate the MolDataset for training, validation and test set and retrun a dataloader for each set. 

We start by instanciation of the QMDataModule

In [26]:
files_root =  "/p/project/hai_drug_qm/MiSaTo-dataset/data/QM/"
fold = 1
qmdata = QMDataModule(files_root, fold)

Then, we call the setup function to instanciate the MolDataset for training, validation and test set

In [27]:
qmdata.setup()

Finally, we can return a dataloader for each set.

In [28]:
train_loader = qmdata.train_dataloader()

for idx, val in enumerate(train_loader):
    print(val)
    break
    

DataBatch(x=[8109, 25], edge_index=[2, 157024], edge_attr=[157024, 1], y=[256], pos=[8109, 3], id=[128], batch=[8109], ptr=[129])


# MD dataset

The same steps can be used to load the MD dataset

In [15]:
mdh5_file = '../data/MD/h5_files/MD_dataset_soft_hard_noH.hdf5'
train_idx = "../data/MD/splits/train_soft_hard.txt"

post_transform = T.RandomTranslate(0.1)

train_dataset = ProtDataset(mdh5_file, train_idx, transform=GNNTransformMD(), post_transform=post_transform)

In [16]:
train_loader = DataLoader(train_dataset, batch_size, shuffle=True, num_workers=0)

for idx, val in enumerate(train_loader):
    print(val)
    break

DataBatch(x=[429928, 11], edge_index=[2, 6822188], edge_attr=[6822188], y=[429928], pos=[429928, 3], ids=[128], batch=[429928], ptr=[129])


In [17]:
files_root =  "../data/MD"

mddata = MDDataModule(files_root)

In [18]:
mddata.setup()

In [19]:
train_loader = mddata.train_dataloader()

for idx, val in enumerate(train_loader):
    print(val)
    break