This is the official repository for our paper PRIME: A Pretrained Representation–Induced Model for 3D Molecules in De Novo Binder Design
We have prepared conda environment configurations for cuda 11.7 + pytorch 1.13.1 (env_cuda117.yaml) and cuda 12.1 + pytorch 2.1.2 (env_cuda121.yaml). For example, you can create the environment by:
conda env create -f env_cuda117.yamlRemember to activate the environment before running the codes:
conda activate PRIMEThe pretrained EPT checkpoint is available at: Google Drive Link.
Please download and place it at ckpts/ept.ckpt.
PyRosetta is used to calculate interface energy of generated peptides and antibody CDRs. Please follow the official instruction here to install it.
Throughout the instructions, we suppose all the datasets are downloaded below ./datasets.
Suppose all data are saved under ./datasets/peptide. We set environment variable export PREFIX=./datasets/peptide. The data for peptides includes the following datasets:
- LNR: The test set of 93 complexes.
- PepBench: The training/validation dataset of about 6K complexes with the peptide length between 4 to 25.
- ProtFrag: Augmented dataset with about 70K pseudo pocket-peptide complexes from local contexts of protein monomers.
Download:
# create the folder
mkdir -p $PREFIX
# LNR
wget https://zenodo.org/records/13373108/files/LNR.tar.gz?download=1 -O ${PREFIX}/LNR.tar.gz
tar zxvf ${PREFIX}/LNR.tar.gz -C $PREFIX
# PepBench
wget https://zenodo.org/records/13373108/files/train_valid.tar.gz?download=1 -O ${PREFIX}/pepbench.tar.gz
tar zxvf $PREFIX/pepbench.tar.gz -C $PREFIX
mv ${PREFIX}/train_valid ${PREFIX}/pepbench
# ProtFrag
wget https://zenodo.org/records/13373108/files/ProtFrag.tar.gz?download=1 -O ${PREFIX}/ProtFrag.tar.gz
tar zxvf $PREFIX/ProtFrag.tar.gz -C $PREFIXProcessing:
python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/LNR/test.txt --out_dir ${PREFIX}/LNR/processed --remove_het
python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/pepbench/all.txt --out_dir ${PREFIX}/pepbench/processed
python -m scripts.data_process.peptide.transform_index --train_index ${PREFIX}/pepbench/train.txt --valid_index ${PREFIX}/pepbench/valid.txt --all_index_for_non_standard ${PREFIX}/pepbench/all.txt --processed_dir ${PREFIX}/pepbench/processed/
python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/ProtFrag/all.txt --out_dir ${PREFIX}/ProtFrag/processedSuppose all data are saved under ./datasets/antibody. We set environment variable export PREFIX=./datasets/antibody. We use SAbDab downloaded at Sep 24th, 2024 for training, validation, and testing on antibody CDR design, with testing complexes coming from RAbD. As the database is weekly updated, we have also uploaded the processed binary files and index files on google drive for reproduction and benchmarking purposes. We also provide the IDs of the complexes used in our paper under ./datasets/antibody(train/valid/test_id.txt), so that users can use these IDs to filter the downloaded SAbDab database to reconstruct the splits used in our paper.
Download with the newest updates:
mkdir -p ${PREFIX}/SAbDab
# download the summary file
wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/all/ -O ${PREFIX}/SAbDab/summary.csv
# download the structure data
wget https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/archive/all/ -O ${PREFIX}/SAbDab/all_structures.zip
# decompress the zip file
unzip $PREFIX/SAbDab/all_structures.zip -d $PREFIX/SAbDab/Processing:
# process
python -m scripts.data_process.antibody.sabdab --index ${PREFIX}/SAbDab/summary.csv --out_dir ${PREFIX}/SAbDab/processed
# split by target protein sequence identity (40%)
# if you want to reconstruct the benchmark used in our paper, please manually generate the index files with the splits provided under ./datasets/antibody
python -m scripts.data_process.antibody.split --index ${PREFIX}/SAbDab/processed/index.txt --rabd_summary ${PREFIX}/RAbD/rabd_summary.jsonlPRIME is trained in two stages:
- Pretrained Representation Learning: Training the Decoder using the EPT encoder.
- Latent Diffusion Model (LDM): Training the diffusion model on the frozen latent representations.
Our training based on 4 4090 GPUs, you can change the number of GPUs by modifying the GPU variable.
# Antibody
GPU=1,2,3,4 bash scripts/train.sh ./configs/IterAE/train_ab.yaml
GPU=1,2,3,4 bash scripts/train.sh ./configs/LDM/train_ab.yaml
# Peptide
GPU=1,2,3,4 bash scripts/train.sh ./configs/IterAE/train_pep.yaml
GPU=1,2,3,4 bash scripts/train.sh ./configs/LDM/train_pep.yaml Training Configurations:
- AutoEncoder (
./configs/IterAE/train.yaml):training_mode: set topretrain
- LDM (
./configs/LDM/train.yaml):
The following commands generate 100 candidates for each target in the test sets.
# peptide
python generate.py --config configs/test/test_pep.yaml --ckpt /path/to/checkpoint.ckpt --gpu 0 --save_dir ./results/pep
# antibody
python generate.py --config configs/test/test_ab.yaml --ckpt /path/to/checkpoint.ckpt --gpu 0 --save_dir ./results/ab
Sampling Configuration (SPES):
To enable Semantics-Preserving Exploratory Sampling (SPES), use the following sample_opt settings in your inference config (e.g., configs/test/test_pep.yaml):
sample_opt:
noise_beta: 1
use_graph_laplacian: true # Use Graph Laplacian
use_mc_cads: true
mc_cads_tau_scaf: [0.3, 0.6] # Generation area
mc_cads_tau_back: [0.4, 0.5] # Condition area
mc_cads_noise_scale: 0.1
mc_cads_rescale: true # Rescale corrupted condition to original mean/std
mc_cads_mixing_factor: 0.5❓ Due to the non-deterministic behavior of torch.scatter, the reproduced results might not be exactly the same as those reported in the paper, but should be very close to them.
The evaluation scripts are as follows. Note that the evaluation process is CPU-intensive. During our experiments, each of them requires running for 3-4 hours on 32 cpu cores.
# peptide
python -m scripts.metrics.peptide_o_ray --results ./results/pep/results.jsonl --num_workers 96 --calc_dg
# antibody
python -m scripts.metrics.peptide_o_ray --results ./results/ab/results.jsonl --antibody --log_suffix HCDR3 --num_workers 64 --calc_dg
Thank you for your interest in our work!
Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at here.
Our repository is adopted from UniMoMo. We thank the authors for their open-source contribution.
