## Physics-Constrained Predictive Molecular Latent Space Discovery with Graph Scattering Variational Autoencoder

ABSTRACT: Recent advances in artificial intelligence have propelled the development of innovative
computational materials modeling and design techniques. Generative deep learning models have been used for molecular representation, discovery and design. In this work, we
assess the predictive capabilities of a molecular generative model developed based on variational inference and graph theory in the small data regime. Physical constraints that encourage energetically stable molecules are proposed. The encoding network is based on the
scattering transform with adaptive spectral filters to allow for better generalization of the
model. The decoding network is a one-shot graph generative model that conditions atom
types on molecular topology. A Bayesian formalism is considered to capture uncertainties
in the predictive estimates of molecular properties. The model’s performance is evaluated
by generating molecules with desired target properties.

Link to paper: https://arxiv.org/pdf/2009.13878v2.pdf

Credit: https://github.com/zabaras/GSVAE

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/zabaras/GSVAE.git
%cd GSVAE

/content/GSVAE


In [None]:
# Install dependencies / requirements
!pip install -r requirements.txt

# Install RDKit
!pip install rdkit-pypi==2021.3.1.5

### Data

Data samples are generated through `data_gen.py`, which also performs classic bootstrapping. The script accepts the following arguments:

```bash
optional arguments:
  --data_size           Total size of the training + test dataset (default: 100000)
  --N                   Size of the training set. This only affects the bootstrapping (default: 600)
  --n_samples           Number of the bootstrap samples (default: 1, no bootstrap)
```

Note that the Bayesian bootstrapping is done in the main code. To generate data, run:

In [3]:
%cd GSVAE/data
!python data_gen.py
%cd GSVAE

/content/GSVAE/data
Extracting QM9 dataset, it takes time...
Downloading from https://ndownloader.figshare.com/files/3195389...
100% 133885/133885 [00:06<00:00, 19479.81it/s]
100% 100000/100000 [02:05<00:00, 799.28it/s]
  return array(a, dtype, copy=False, order=order)


### Run

#### Training

The model is trained using `main.py`. This code accepts the following arguments:

```bash
optional arguments:
  --epochs              number of epochs to train (default: 1900)
  --batch_number        number of batches per epoch (default: 25)
  --gpu_mode            accelerate the script using GPU (default: 1)
  --z_dim               latent space dimensionality (default: 30)
  --seed                random seed (default: 1400)
  --loadtrainedmodel    path to trained model
  --mu_reg_1            regularization parameter for ghost nodes and valence constraint (default: 0)
  --mu_reg_2            regularization parameter for connectivity constraint (default: 0)
  --mu_reg_3            regularization parameter for 3-member cycle constraint (default: 0)
  --mu_reg_4            regularization parameter for cycle with triple bond constraint (default: 0)
  --N_vis               number of test data for visualization (default: 3000)
  --log_interval        number of epochs between visualizations (default: 200)
  --mol_vis             visualize samples molecules (default: 0)
  --n_samples           number of generated samples from molecular space (default: 10000)
  --wlt_scales          number of wavelet scales (default: 12)
  --scat_layers         number of scattering layers (default: 4)
  --database            name of the training database (default: 'QM9')
  --datafile            name and location of the training file in data folder (default: 'QM9_0.data')
  --BB_samples          index for Bayesian bootstrap sample (default: 0)
  --N                   number of training data (default: 600)
  --res                 path for storing the results (default: 'results/')
  --y_id                index for target property in the conditional design (default: None, unconditional design)
  --y_target            target property value in the conditional design (default: None, unconditional design)
```

After generating the data, run

In [7]:
!python main.py

device_count() 1
get_device_name Tesla P100-PCIE-16GB
Train Epoch: 1	Loss: 1.989043
Train Epoch: 2	Loss: 1.236556
Train Epoch: 3	Loss: 1.109881
Train Epoch: 4	Loss: 1.063782
Train Epoch: 5	Loss: 1.038642
Train Epoch: 6	Loss: 1.017001
Train Epoch: 7	Loss: 1.005502
Train Epoch: 8	Loss: 0.990827
Train Epoch: 9	Loss: 0.981253
Train Epoch: 10	Loss: 0.976275
Train Epoch: 11	Loss: 0.972006
Train Epoch: 12	Loss: 0.969025
Train Epoch: 13	Loss: 0.966821
Train Epoch: 14	Loss: 0.962601
Train Epoch: 15	Loss: 0.959713
Train Epoch: 16	Loss: 0.954832
Train Epoch: 17	Loss: 0.953984
Train Epoch: 18	Loss: 0.955744
Train Epoch: 19	Loss: 0.950577
Train Epoch: 20	Loss: 0.950971
Train Epoch: 21	Loss: 0.953846
Train Epoch: 22	Loss: 0.959475
Train Epoch: 23	Loss: 0.962950
Train Epoch: 24	Loss: 0.960788
Train Epoch: 25	Loss: 0.956558
Train Epoch: 26	Loss: 0.953923
Train Epoch: 27	Loss: 0.950419
Train Epoch: 28	Loss: 0.947926
Train Epoch: 29	Loss: 0.952807
Train Epoch: 30	Loss: 0.947883
Train Epoch: 31	Loss: 0.9

to train the base model. To run the constrained model, set the regularization parameters `mu_reg_1`, `mu_reg_2`, `mu_reg_3`, and `mu_reg_4` to a positive value and tune them based on the output statistics.

### Conditional design

This code performs conditional design by setting a target property value for the sampled molecules. Set the property ID with argument `y_id` (0: PSA, 1: MolWt, 2: LogP) and the target value with `y_target`.

### Quantifying uncertainties

To perform UQ analysis, use `utils.py`. The `utils.py` script accepts the following arguments:

```bash
optional arguments:
  --BB_samples          number of samples for uncertainty quantification (default: 0)
  --N                   number of training data (default: 600)
  --database            name of the training database (default: 'QM9')
  --sample_file         predictive samples directory (default: 'BB_600')
  --gpu_mode            accelerate the script using GPU (default: 0)
```

To compute the confidence interval, use the following example script

In [None]:
%%bash

ITR=25
DIR=B_200
N=200

for i in `seq 1 ${ITR}`;
do
    python main.py --N "$N" --BB_samples "$i" --res results/"${DIR}"
done

mkdir data/samples
mkdir data/samples/${DIR}

mv results/"${DIR}"/*/samples_*.data data/samples/${DIR}

In [None]:
!python utils.py --BB_samples "$ITR" --N "$N" --sample_file "${DIR}"

### Filters

You can run `filter.py` independently in order to perform scattering transform and visualize graph filters. The `filter.py` script accepts the following arguments:

```bash
optional arguments:
  --gpu_mode            accelerate the script using GPU (default: 0)
  --wlt_scales          number of wavelet scales (default: 12)
  --scat_layers         number of scattering layers (default: 4)
  --N                   number of training data (default: 600)
  --database            name of the training database (default: 'QM9')
```