# Molecular graph generation with PyTorch and PyGeometric
> We use [GraphVAE](https://arxiv.org/abs/1802.03480) for molecular generation with one shot generation of a probabilistic graph with predefined maximum size. 

- toc: true 
- badges: true
- comments: false
- author: Anirudh Jain
- categories: [graph generation, pytorch, pygeometric, tutorial]

# Requirements

The following packages need to be installed:
- rdkit
- pytorch
- torch_geometric
- networkx

In [1]:
#collapse-hide

#Initial imports

import numpy as np
import torch
import matplotlib.pyplot as plt
from glob import glob
import tqdm

# Introduction

We represent a molecule as graph $G = (\mathcal{X, A})$ using PyGeometric framework. Each molecule is represented by a feature matrix $\mathcal{X}$ and adjacency matrix $\mathcal{A}$. We use QM9 dataset from [MoleculeNet:A Benchmark for Molecular Machine Learning](https://arxiv.org/abs/1703.00564) implemented in `torch_geometric.datasets.QM9`. PyGeometric relies on rdkit to process the SMILES string and convert them into graphs.

We modify the data processing script in two ways:
- We strip hydrogen atoms from the molecules to keep only the heavy atoms
- We kekulize the molecules to convert aromatic rings to Kekule form
The modified script can be found [here](https://gist.github.com/sponde25/7dfa5492c21c007cf1e60a02dced1334)

After processing the dataset, we have a set of molecules with 4 heavy atoms (C, N, O, F) and 3 bond types (SINGLE, DOUBLE and TRIPLE) with maximum graph size of 9. 

The decoder outputs the graph as one-hot encoded vectors for atoms `[9 x 5]` and bonds `[9 x 4]`. The label 0 represents empty atom or edge.  

In [4]:
#Imports for data pre-processing

import torch_geometric
from qm9_modified import QM9
from torch_geometric.utils.convert import to_networkx
import networkx

In [8]:
# Setting up variables for the dataset

MAX_ATOM = 5 
MAX_EDGE = 4 
path = '/scratch/project_2002655/datasets/qm9_noH' # Change the path for your local directory structure
dataset = QM9(path)

# Store the max. graph size
MAX_N = -1
for data in dataset:
    if MAX_N < data.x.shape[0]: MAX_N = data.x.shape[0]
MAX_E = int(MAX_N * (MAX_N - 1))
print('MAX ATOMS: {}'.format(MAX_N))    # Maximum number of atoms in a graph in the dataset
print('MAX EDGE: {}'.format(MAX_E))     # Corresponding size of upper triangle adjacency matrix 

MAX ATOMS: 9
MAX EDGE: 72


`torch_geometric` stores the graph as `torch_geometric.data.Data` and we generate the one-hot representation of the graph $G$ as described above. For each graph $G$, we create a vector $\mathcal{X}$ as one-hot encoded for atom of dimension `[MAX_N x MAX_ATOM]` and vector bond of dimension `[MAX_E x MAX_EDGE]`.

![](../images/data_representation.png "A visualization of the graph, atom and edge representations")

In [11]:
# We create a matrix to map the index of the edge vector $\mathcal{A}$ to the upper triangular adjacency matrix.

index_array = torch.zeros([MAX_N, MAX_N], dtype=int)
idx = 0
for i in range(MAX_N):
    for j in range(MAX_N):
        if i < j:
            index_array[i, j] = idx
            idx+=1

print(index_array)

tensor([[ 0,  0,  1,  2,  3,  4,  5,  6,  7],
        [ 0,  0,  8,  9, 10, 11, 12, 13, 14],
        [ 0,  0,  0, 15, 16, 17, 18, 19, 20],
        [ 0,  0,  0,  0, 21, 22, 23, 24, 25],
        [ 0,  0,  0,  0,  0, 26, 27, 28, 29],
        [ 0,  0,  0,  0,  0,  0, 30, 31, 32],
        [ 0,  0,  0,  0,  0,  0,  0, 33, 34],
        [ 0,  0,  0,  0,  0,  0,  0,  0, 35],
        [ 0,  0,  0,  0,  0,  0,  0,  0,  0]])


# Model

# Results

# Conclusion