Syntax-Directed Variational Autoencoder for Structured Data (https://arxiv.org/abs/1802.08786)
Use the following dropbox link:
https://www.dropbox.com/sh/621ufmvqgg5h2d8/AAARWPpuADNfPx8eu9E8y-rha?dl=0
Put everything under the 'dropbox' folder, or create a symbolic link with name 'dropbox':
ln -s /path/to/your/downloaded/files dropbox
Finally the folder structure should look like this:
sdvae (project root)
|__ README.md
|__ mol_vae
|__ prog_vae
|__ dropbox
|__ |__ data
| |__ results
| |__ context_free_grammars
|......
The current code depends on pytorch 0.3.1. Most of the python dependencies can be installed by pip. However, the bayesian optimizaiton depends on a customized build of Theano. Please follow the instruction in GrammarVAE (https://github.com/mkusner/grammarVAE):
below we will use mol_vae as the illustration for training/evaluation. The prog_vae works similarly.
Before training/evaluation, we need to cook the raw txt dataset. We use the mol_vae as illustration:
cd mol_vae/data_processing
./run_data.sh
./run_cfg_dump.sh
The above two scripts will compile the txt data into binary file and cfg dump, correspondingly.
To train the model using GPU, run the following commands. You may also want to modify the parameters in the training script.
cd mol_vae/pytorch_train
./run_train.sh
The pretrained models are available under the dropbox folder, dropbox/results
.
Before evaluation, we need to first dump the latent encodings of programs/molecules:
cd mol_vae/pytorch_eval
./run_feature_dump.sh
To test the reconstruction, or sample from prior, please see the corresponding scripts under the same folder.
To optimize the molecule property, run the bayesian optimization:
cd mol_vae/mol_optimization
./run_bo.sh
After that, use the script get_final_results.py
to collect the results. We use the same evaluation protocol
as in GrammarVAE(https://github.com/mkusner/grammarVAE).
The results reported in the paper can be found under dropbox/results/zinc/bo
. If you use the same random seeds,
then the exact same results should be expected.
To test the regression performance using the latent embeddings of molecules/programs:
cd mol_vae/sparse_gp_regression
./run_regression.sh
Again, the 10 runs with different random seeds are reported, under dropbox/results/zinc/sgp
To interpolate the latent space, do the following:
cd mol_vae/visualize
./run_2dvis.sh
You may want to tune the gap, number of grids, etc., to see some reasonable visualization results.
@article{dai2018syntax,
title={Syntax-Directed Variational Autoencoder for Structured Data},
author={Dai, Hanjun and Tian, Yingtao and Dai, Bo and Skiena, Steven and Song, Le},
journal={arXiv preprint arXiv:1802.08786},
year={2018}
}