### Loading Datasets




#### Underlying Code
All dataset preprocessing is done in the 'MolecularDataSet' class. To load custom data in the form of smiles and csvs one must create a base class from this class,
which is informed about the structure of the csv file (which columns are features and which are targets).

Existing derived classes exist for various dataset already (qm9, logp, logs). An example on making a custom dataset class can be found below.

#### Usage
```python
class LogSDataset(MolecularDataSet):
    
    def __init__(self, data_path):
        
        # Check if the current provided path is a folder
        if os.path.isdir(data_path):
            print("Loading dataset from folder")
            super().__init__("", "LogS")
            data_path = data_path + "/"
            self.load(data_path)
            
        else:
            print("Loading Raw Data and processing it")
            save_path = data_path + "/"            
            data_path = data_path + ".csv"
            super().__init__(data_path, "LogS")
            self.process()
            # Saving the processed data
            print("Saving data.")
            
            # Make save path folder if it is not already there.
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            self.save(save_path)
            

        if self.data_path == "":
            return
        

    def seperate_smiles_and_targets(self):
        # Ignoring the first column which is the index
        
        # Loading the smiles from the first column
        smiles = self.raw_table_data.iloc[:, 0].values

        # Storing all other columns together in the form of a matrix
        # as targets
        targets = self.raw_table_data.iloc[:, 1:].values

        return smiles, targets
```

As you can see the initializer just processes the appropriate datapath and calls relevant 'MolecularDataset' class functions. This is mostly the same for all datasets.
The major difference occurs in the 'seperate_smiles_and_targets' functions, which tells the 'MolecularDataset' class where it can extract smiles from and where it can get all other information from.

Processing data can take a considerable amount of time. For this reason, the 'MolecularDataset' has save and load functions, which can be used to quickly save and load processed data. Save and load can be manually called however, by default it is automatically called if the appropriate processed data is found at the location specified (the derived class's initializer takes care of this).



In [6]:
# Here we import a small sample of the qm9 dataset (500 molecules).

from dataset import QM9

qm9_sample = QM9("../data/smaller_sample")

Loading dataset from folder
Initializing Molecular Representation Generator


In [32]:
# To get an idea of the structure of the dataset, we can print the first molecule.
# Moreover, the dataset itself is printable. 


molecule = qm9_sample[0]

# Printing out the dimensions of all of these features with a description of what each feature is
print(f"Atomic Features: {molecule[0].shape} - This represents the atomic features of the molecule")
print(f"Bond Features: {molecule[1].shape} - This represents the bond features of the molecule")
print(f"Angle Features: {molecule[2].shape} - This represents the angle features of the molecule")
print(f"Dihedral Features: {molecule[3].shape} - This represents the dihedral features of the molecule")
print(f"Global Molecular Features: {molecule[4].shape} - This represents the global molecular features of the molecule")
print(f"Bond Indices: {molecule[5].shape} - This represents the bond indices of the molecule")
print(f"Angle Indices: {molecule[6].shape} - This represents the angle indices of the molecule")
print(f"Dihedral Indices: {molecule[7].shape} - This represents the dihedral indices of the molecule")
print(f"Target: {molecule[8].shape} - This represents the target of the molecule")


print("\n\n", qm9_sample)

Atomic Features: torch.Size([5, 80]) - This represents the atomic features of the molecule
Bond Features: torch.Size([8, 10]) - This represents the bond features of the molecule
Angle Features: torch.Size([12]) - This represents the angle features of the molecule
Dihedral Features: torch.Size([0]) - This represents the dihedral features of the molecule
Global Molecular Features: torch.Size([200]) - This represents the global molecular features of the molecule
Bond Indices: torch.Size([8, 2]) - This represents the bond indices of the molecule
Angle Indices: torch.Size([12, 2]) - This represents the angle indices of the molecule
Dihedral Indices: torch.Size([0]) - This represents the dihedral indices of the molecule
Target: torch.Size([19]) - This represents the target of the molecule


 Dataset Name: QM9
Number of Molecules Loaded: 498


In the architecture used for prediction here, to reduce the number of parameters in the graph neural network (to reduce overfitting and increase speed) we use autoencoders to reduce atomic and bond vector sizes before passing them in. For this however, we must first, of course train these autoencoders.

A full and precise description of the implementation can be found in model.py


In [12]:
import torch
from model import Autoencoder # Simply importing the autoencoder model module from the model.py file

# Here we create two instances of the autoencoder model, one for atoms and the other for bonds

# From the earlier printed dimensions, we can see that the atomic features have a dimension of 80 while the bond features have a dimension of 10
# We reduce these dimension sizes to 10 and 3 respectively.
atom_autoencoder = Autoencoder(80, 10) 
bond_autoencoder = Autoencoder(10, 3)

# Training is done in two phases, first the autoencoders are trained then the gnn is trained. For 
# now we begin by simply training the autoencoders.

mse_loss_fn = torch.nn.MSELoss()
atom_optimizer = torch.optim.Adam(atom_autoencoder.parameters())
bond_optimizer = torch.optim.Adam(bond_autoencoder.parameters())


In [13]:
# We now write a simple training loop for the autoencoders

n_epochs = 1
printstep = 50


print("Training autoencoders on logs")
for epoch_i in range(n_epochs):
  avg_atom_rmse_loss = 0
  avg_bond_rmse_loss = 0
  for i, molecule in enumerate(qm9_sample):
    
    # if i > 1000:
    #   break # Everything else is for training.
    
    # Putting everything onto "device"  
  
    atom_features = molecule[0]
    bond_features = molecule[1]
    
    # Forward pass
    reconstructed_atom = atom_autoencoder(atom_features)
    reconstructed_bond = bond_autoencoder(bond_features)
    
    # Calculating loss
    atom_loss = mse_loss_fn(reconstructed_atom, atom_features)
    bond_loss = mse_loss_fn(reconstructed_bond, bond_features)
    
    # Backward pass
    atom_optimizer.zero_grad()
    bond_optimizer.zero_grad()
    
    atom_loss.backward()
    bond_loss.backward()
    
    atom_optimizer.step()
    bond_optimizer.step()
    
    # Calculating average loss
    avg_atom_rmse_loss = (avg_atom_rmse_loss * i + atom_loss.item() ** 0.5) / (i + 1)
    avg_bond_rmse_loss = (avg_bond_rmse_loss * i + bond_loss.item() ** 0.5) / (i + 1)
    
    if i % printstep == 0:
      print(f"LOG S Epoch: {epoch_i}, ex: {i}, Atom RMSE Loss: {avg_atom_rmse_loss}, Bond RMSE Loss: {avg_bond_rmse_loss}")


Training autoencoders on logs
LOG S Epoch: 0, ex: 0, Atom RMSE Loss: 0.28671990791939317, Bond RMSE Loss: 0.4493634861517593
LOG S Epoch: 0, ex: 50, Atom RMSE Loss: 0.2476801213662089, Bond RMSE Loss: 0.459343467845178
LOG S Epoch: 0, ex: 100, Atom RMSE Loss: 0.2064813139610872, Bond RMSE Loss: 0.4277830984895229
LOG S Epoch: 0, ex: 150, Atom RMSE Loss: 0.1904817979424135, Bond RMSE Loss: 0.3641877603035393
LOG S Epoch: 0, ex: 200, Atom RMSE Loss: 0.17994645552349764, Bond RMSE Loss: 0.3550276530088284
LOG S Epoch: 0, ex: 250, Atom RMSE Loss: 0.16693644293082743, Bond RMSE Loss: 0.3405718693533781
LOG S Epoch: 0, ex: 300, Atom RMSE Loss: 0.15665510029852903, Bond RMSE Loss: 0.3123839913822101
LOG S Epoch: 0, ex: 350, Atom RMSE Loss: 0.15087960591770555, Bond RMSE Loss: 0.3009052477743147
LOG S Epoch: 0, ex: 400, Atom RMSE Loss: 0.14413573272744484, Bond RMSE Loss: 0.28227833773886024
LOG S Epoch: 0, ex: 450, Atom RMSE Loss: 0.1385684196418199, Bond RMSE Loss: 0.275001315346767


In [15]:
# We now finally move on to training the actual GNN model itself.
# The implementation of the model can be found in model.py
from model import GNN3D

gnn3d = GNN3D(atomic_vector_size=10, bond_vector_size=3, number_of_molecular_features=200, number_of_targets=1) # We set appropriate value to the optional parameters

gnn_optimizer = torch.optim.Adam(gnn3d.parameters())


In [31]:
for epoch_i in range(n_epochs):
  avg_rmse_loss = 0
  avg_mse_loss = 0
  for i, molecule in enumerate(qm9_sample):
    
    target = molecule[8]
    molecule = list(molecule[0:7])
    
    
    # Putting latent atomic and bond features through GNN3D
    molecule[0] = atom_autoencoder.encode(molecule[0])
    molecule[1] = bond_autoencoder.encode(molecule[1])
      
    # Forward pass
    prediction = gnn3d(molecule)
    
    # Calculating loss
    loss = mse_loss_fn(prediction, target)
    
    # Backward pass
    gnn_optimizer.zero_grad()
    loss.backward()
    gnn_optimizer.step()
    
    # Calculating average loss
    avg_rmse_loss = (avg_rmse_loss * i + (loss.item() ** 0.5)) / (i + 1)
    avg_mse_loss = (avg_mse_loss * i + loss.item()) / (i + 1)
    
    if i % printstep == 0:
      print(f"LOG P Epoch: {epoch_i}, ex: {i}, Avg. RMSE Loss: {avg_rmse_loss}, Avg. MSE Loss: {avg_mse_loss}")

ValueError: not enough values to unpack (expected 8, got 7)