This repository contains the implementation for Property-Prioritized Generaive Variational Auto Encoder (PPGVAE). Model-based optimization with PPGVAE robustly finds improved samples regardless of 1) the imbalance between low- and high-fitness training samples, and 2) the extent of their separation in the design space. MBO with our PPGVAE can be used for both discrete and continuous design spaces. Details of our comprehensive benchmark are covered in the following.
-
Cloning the Repository
git clone --recursive https://github.com/sabagh1994/PGVAE.git
cd PGVAE
-
Download the Datasets and Configs
To download the datasets used to create the oracles, and generate train sets at varying separation and imbalance ratios, run
./datasets/download.sh
. The downloaded files will be./datasets/aav.csv
,./datasets/GB1.txt
,./datasets/PhoQ.txt
, andpinn_poisson.npz
. AAV dataset was retreived from https://benchmark.protein.properties/landscapes. PINN dataset was originally generated in experiments of https://arxiv.org/abs/2305.17387, however the.npz
format can only be accessed from here.In addition to the protein and pinn datasets, the train set for the toy mnist example
toy_mnist.npz
will be downloaded and moved tosample_trainset
directory. This is to avoid downloading MNIST and creating the train set from it. Train sets for other datasets will be generated later. Read "Generating the Train Sets and Oracles" for further details.To download the config files run
./configs/download.sh
. The configs will be stored at theconfigs
directory. There will be a sample config for each benchmark task. You can modify the config file to include more methods and train sets. Read "Running Configuration" for more details. -
Make a Virtual Environment
Before running MBO with PPGVAE (or other methods) make sure that all the required packages are installed. To create a virtual environment with all the required packages installed,
- Install Python version 3.9 or higher. We used Python 3.9.
- Run
make venv
. This step creates a folder named./venv
which contains all the required packages. - Run
source venv/bin/activate
to activate the venv
-
Generating the Train Sets and Oracles
For each benchmark task, trains sets and oracles should be generated before running MBO. Note that this step requires the datasets included in the
datasets
folder. Navigate tonotebooks
folder and run the jupyter notebookds_generator.ipynb
. This will create,- Train sets for semi-synthetic GB1 and PhoQ, AAV, PINN and GMM benchamrk tasks, at varying imbalance ratios and separation levels.
Train sets will be stored at a separate folder for each benchmark task in
sample_trainset
directory. - Oracles used for protein benchmark tasks. Oracles will be stored in
oracles
directory includingoracles/protein_aav
,oracles/protein_gb_synth
, andoracles/protein_phoq_synth
Note 1: GMM and PINN won't have any oracles stored. GMM oracle can be constructed with its parameter specification, which is stored within each instance of train set, e.g.,
sample_trainset/gmm/ds0.npz
. PINN oracle is generated fromdatasets/pinn_poisson.npz
when its instance is created inscripts/run_mbo.py
.
Note 2: For semi-synthetic GB1 and PhoQ datasets, train sets and oracles are generated with appended length of three corresponding to the lowest separation. For higher separation, set the variableext_len
to higher integer values (default 3) innotebooks/ds_generator.ipynb
. - Train sets for semi-synthetic GB1 and PhoQ, AAV, PINN and GMM benchamrk tasks, at varying imbalance ratios and separation levels.
Train sets will be stored at a separate folder for each benchmark task in
-
Running MBO
To perform MBO, one config file is needed. An example of the config file is included in
configs/run_config.json
. Read "Running Configuration" for the description of each field in the config file. To run MBO with the example config file, executepython scripts/run_mbo.py --run_config configs/run_config.json &> log_mbo
This runs 10 MBO steps using PPGVAE on the example GMM train set located at
sample_trainset/ds0.npz
. The results will be stored atresults/ds0/*.pt
. Read "Train Set and Output Format" for the contents of train set*.npz
and output*.pt
, for each benchmark task. -
Summarizing the Results
After running MBO for each benchmark task, there will be multiple
*.pt
files in theresults
directory. To compute various statistics, e.g., maximum property relative to train/initial set, for the generated samples from MBO, run summary notebooks in thenotebooks
directory, e.g.,summary_gmm.ipynb
summarizes the results for the GMM benchmark. In each summary notebook, the following three dataframes are constructed and saved in thesummary
directory.df_stats
each row represents a unique configuration of (imbalance ratio, separation level, seed number, method name, MBO step). Various statistics are computed for each unique configuration.df_bs
contains the statistics computed in (1) as well as their 95% bootstrap confidence intervals for each unique configuration of (imbalance ratio, separation level, method name, MBO step). Note that "seed number" is not in the configuration. This dataframe was used to study the impact of varying imbalance for a given separation level in each benchmark task.df_bsg
contains the statistics computed in (1) as well as their 95% bootstrap confidence intervals for each unique configuration of (separation level, method name, MBO step). Note that both "seed number" and "imbalance ratio" are not in the configuration. This dataframe was used to generate the plots representing the impact of separation level aggragated over all imbalance ratios.
-
Toy MNIST Example
Run
notebooks/mnist_demo.ipynb
to train PPGVAE and visualize its latent space on the toy MNIST example. The config for this notebook is located atconfigs/mnist_config.json
.
-
Example
An example of the configuration file
configs/run_config.json
is,{ "description": "sample config file to run MBO with ppgvae or other methods", "ds_rootdir": "sample_trainset", "ds_names": ["ds0.npz"], "method_names": ["pgvae"], "weighted_opt_firststeps": [false], "n_samples_gens": [100], "savedir": "results", "vae_type": "mlp", "n_seeds": 10, "mbo_steps": 10 }
-
Description of the Arguments
"description"
is the notes about the configuration file or whatever notes you want to keep for the configuration you are using."ds_rootdir"
the root directory containing the train sets."ds_names"
is a list containing the file names for the train sets. A single train setds_name
is read fromds_rootdir/ds_name
."method_names"
is a list containing the name of the methods, e.g.,["pgvae", "rwr", "cem-pi", "dbas", "cbas"]
"weighted_opt_firststeps"
if false the first MBO step uses uniform nonzero weights in weighted optimization as done in CbAS paper. If True, weighted optimization with non-uniform weights is performed in the first step as well. Both CbAS and PPGVAE run withfalse
. Leave this as[false]
for simplicity."n_samples_gen"
is a list containing the integer number of samples generated per MBO step."vae_type"
is a string specifying the type of VAE. This should be set to"mlp"
for all experiments in the paper."n_seeds"
determines the number of models to be trained in parallel, each leading to a different chain of samples generated from MBO. See "Stacked Model and Data Training" for more details."mbo_steps"
is the number of MBO steps performed.
-
Train Set Format
Train sets are in
*.npz
format. Each file consists of three fieldsx
,y
, andorc_spec
. Bothx
andy
are numpy arrays containing the samples from the design space and their associated properties.x
is an array of strings for protein benchmarks.orc_spec
is a dictionary containing the oracle specifications and variables involved in train set generation. These depend on the benchmark task (gmm, pinn, aav, ...) as explained below.- In GMM
orc_spec
consists of"mu_1st"
mean of the first Gaussian mode (less desired mode)"mu_2nd"
mean of the second Gaussian mode (more desired mode). Specifies the extent of separation as"mu_1st"
is set to zero."ro"
imbalance ratio between the less desired and more desired train samples."data_type"
type of the benchmark task, i.e.,"gmm", "protein", "pinn"
. This affects the initialization ofDataset
object inrun_mbo.py
"sigmas_gmm"
numpy array containing the standard deviations for the two modes."weights"
peak height of each Gaussian mode."N1"
number of training samples taken from the less desired mode."N"
size of the train set.
- In AAV
orc_spec
consists of"mut_thr"
an integer indicating the minimum number of mutated sites in less desired samples. Specifies the extent of separation."ro"
imbalance ratio"data_type"
type of the benchmark task, i.e.,"protein"
"N"
size of the train set."orc_path"
path to the oracle.
- semi-synthetic GB1 and PhoQ, have the same
orc_spec
as the AAV, with the exclusion ofmut_thr
.
- In GMM
-
Output Format
Outputs are in
*.pt
format. The output dictionary consists of the following fields,x
is atorch.tensor
with shape(n_seeds, N_total, dim_rest)
.dim_rest
varies depending on the type of dataset. For one-hot encoded protein sequencesdim_rest = (sequence length, number of amino acids)
. For the gmm datasetdim_rest = 1
y
is atorch.tensor
with shape(n_seeds, N_total)
containing the property (fitness) values.w_optm
is atorch.tensor
with shape(n_seeds, N_total)
containing the optimization weights. This isNone
for PPGVAE as it does not perform weighted optimization.step
is atorch.tensor
with shape(n_seeds, N_total)
containing the MBO step values.orc_spec
is a dictionary containing oracle specifications and variables used for train set generation, as explained in "Train Set Format".method_name
is a string specifying the method used for MBO, e.g.,"pgvae", "rwr", "cem-pi"
.n_samples_gen
is the integer number of samples generated per MBO step.weighted_opt_firststep
is a bool determining whether non-uniform weighted optimization was used in the first MBO step. This was set tofalse
in all experiments of the paper.datadir
is the path to the train set.
n_seeds
is the number of models ran in parallel to perform MBO with different seeds.
N_total
is the total number of samples generated in MBO which is equivalent to the number of MBO steps times the number of samples generated per MBO step.
The entire training pipeline, including the models, the data, and the random number generators, were stacked along a first dimension across multiple indpendent runs. This allows our code to efficiently run many independent instances in parallel on a single GPU device. For further details, you can check the BMLP and BatchRNG classes in the tch_utils.py
script.
Note that the shape of most tensors in our implementation starts with a (n_seeds, ...)
prefix, where n_seeds is the number of independent runs specified in the input config files.
This approach was first implemented in "On the Importance of Firth Bias Reduction in Few-Shot Classification (https://github.com/ehsansaleh/firth_bias_reduction)"
and used in the following two studies as well,
- BEDwARS: a robust Bayesian approach to bulk gene expression deconvolution with noisy reference signatures (https://github.com/sabagh1994/BEDwARS)
- Learning from Integral Losses in Physics Informed Neural Networks (https://github.com/ehsansaleh/btspinn)
If you use our implementation in your work, please cite Firth Bias Reduction paper (https://arxiv.org/abs/2110.02529) or this study.
Openreview link to the paper accepted to ICLR 2024.
-
The bioRxiv link to the paper:
- PDF link: https://browse.arxiv.org/pdf/2305.13650v2.pdf
- Web-page link: https://arxiv.org/abs/2305.13650v2
-
Here is the bibtex citation entry for our work:
@article{ghaffari2023,
title={Robust Model-Based Optimization for Challenging Fitness Landscapes},
author={Ghaffari, Saba and Saleh, Ehsan and Schwing, Alexander G and Wang, Yu-Xiong and Burke, Martin D and Sinha, Saurabh},
journal={arXiv preprint arXiv:2305.13650v2},
year={2023}
}