Skip to content

simplaj/PRIME

Repository files navigation

PRIME: A Pretrained Representation–Induced Model for 3D Molecules in De Novo Binder Design

cover

🧬 Introduction

This is the official repository for our paper PRIME: A Pretrained Representation–Induced Model for 3D Molecules in De Novo Binder Design

🔍 Quick Links

🚀 Setup

Environment

We have prepared conda environment configurations for cuda 11.7 + pytorch 1.13.1 (env_cuda117.yaml) and cuda 12.1 + pytorch 2.1.2 (env_cuda121.yaml). For example, you can create the environment by:

conda env create -f env_cuda117.yaml

Remember to activate the environment before running the codes:

conda activate PRIME

Trained Weights

The pretrained EPT checkpoint is available at: Google Drive Link. Please download and place it at ckpts/ept.ckpt.

📄 Reproduction of Paper Experiments

Additional Dependencies

PyRosetta

PyRosetta is used to calculate interface energy of generated peptides and antibody CDRs. Please follow the official instruction here to install it.

Datasets

Throughout the instructions, we suppose all the datasets are downloaded below ./datasets.

1. Peptide

Suppose all data are saved under ./datasets/peptide. We set environment variable export PREFIX=./datasets/peptide. The data for peptides includes the following datasets:

  • LNR: The test set of 93 complexes.
  • PepBench: The training/validation dataset of about 6K complexes with the peptide length between 4 to 25.
  • ProtFrag: Augmented dataset with about 70K pseudo pocket-peptide complexes from local contexts of protein monomers.

Download:

# create the folder
mkdir -p $PREFIX
# LNR
wget https://zenodo.org/records/13373108/files/LNR.tar.gz?download=1 -O ${PREFIX}/LNR.tar.gz
tar zxvf ${PREFIX}/LNR.tar.gz -C $PREFIX
# PepBench
wget https://zenodo.org/records/13373108/files/train_valid.tar.gz?download=1 -O ${PREFIX}/pepbench.tar.gz
tar zxvf $PREFIX/pepbench.tar.gz -C $PREFIX
mv ${PREFIX}/train_valid ${PREFIX}/pepbench
# ProtFrag
wget https://zenodo.org/records/13373108/files/ProtFrag.tar.gz?download=1 -O ${PREFIX}/ProtFrag.tar.gz
tar zxvf $PREFIX/ProtFrag.tar.gz -C $PREFIX

Processing:

python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/LNR/test.txt --out_dir ${PREFIX}/LNR/processed --remove_het
python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/pepbench/all.txt --out_dir ${PREFIX}/pepbench/processed
python -m scripts.data_process.peptide.transform_index --train_index ${PREFIX}/pepbench/train.txt --valid_index ${PREFIX}/pepbench/valid.txt --all_index_for_non_standard ${PREFIX}/pepbench/all.txt --processed_dir ${PREFIX}/pepbench/processed/
python -m scripts.data_process.peptide.pepbench --index ${PREFIX}/ProtFrag/all.txt --out_dir ${PREFIX}/ProtFrag/processed

2. Antibody

Suppose all data are saved under ./datasets/antibody. We set environment variable export PREFIX=./datasets/antibody. We use SAbDab downloaded at Sep 24th, 2024 for training, validation, and testing on antibody CDR design, with testing complexes coming from RAbD. As the database is weekly updated, we have also uploaded the processed binary files and index files on google drive for reproduction and benchmarking purposes. We also provide the IDs of the complexes used in our paper under ./datasets/antibody(train/valid/test_id.txt), so that users can use these IDs to filter the downloaded SAbDab database to reconstruct the splits used in our paper.

Download with the newest updates:

mkdir -p ${PREFIX}/SAbDab
# download the summary file
wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/all/ -O ${PREFIX}/SAbDab/summary.csv
# download the structure data
wget https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/archive/all/ -O ${PREFIX}/SAbDab/all_structures.zip
# decompress the zip file
unzip $PREFIX/SAbDab/all_structures.zip -d $PREFIX/SAbDab/

Processing:

# process
python -m scripts.data_process.antibody.sabdab --index ${PREFIX}/SAbDab/summary.csv --out_dir ${PREFIX}/SAbDab/processed
# split by target protein sequence identity (40%)
# if you want to reconstruct the benchmark used in our paper, please manually generate the index files with the splits provided under ./datasets/antibody 
python -m scripts.data_process.antibody.split --index ${PREFIX}/SAbDab/processed/index.txt --rabd_summary ${PREFIX}/RAbD/rabd_summary.jsonl

Training

PRIME is trained in two stages:

  1. Pretrained Representation Learning: Training the Decoder using the EPT encoder.
  2. Latent Diffusion Model (LDM): Training the diffusion model on the frozen latent representations.

Our training based on 4 4090 GPUs, you can change the number of GPUs by modifying the GPU variable.

# Antibody
GPU=1,2,3,4 bash scripts/train.sh ./configs/IterAE/train_ab.yaml
GPU=1,2,3,4 bash scripts/train.sh ./configs/LDM/train_ab.yaml  

# Peptide
GPU=1,2,3,4 bash scripts/train.sh ./configs/IterAE/train_pep.yaml
GPU=1,2,3,4 bash scripts/train.sh ./configs/LDM/train_pep.yaml  

Training Configurations:

  • AutoEncoder (./configs/IterAE/train.yaml):
    • training_mode: set to pretrain
  • LDM (./configs/LDM/train.yaml):

Inference & Configurations

The following commands generate 100 candidates for each target in the test sets.

# peptide
python generate.py --config configs/test/test_pep.yaml --ckpt /path/to/checkpoint.ckpt --gpu 0 --save_dir ./results/pep
# antibody
python generate.py --config configs/test/test_ab.yaml --ckpt /path/to/checkpoint.ckpt --gpu 0 --save_dir ./results/ab

Sampling Configuration (SPES):

To enable Semantics-Preserving Exploratory Sampling (SPES), use the following sample_opt settings in your inference config (e.g., configs/test/test_pep.yaml):

sample_opt:
  noise_beta: 1
  use_graph_laplacian: true  # Use Graph Laplacian 
  use_mc_cads: true
  mc_cads_tau_scaf: [0.3, 0.6]   # Generation area
  mc_cads_tau_back: [0.4, 0.5]   # Condition area
  mc_cads_noise_scale: 0.1
  mc_cads_rescale: true          # Rescale corrupted condition to original mean/std
  mc_cads_mixing_factor: 0.5

Evaluation

❓ Due to the non-deterministic behavior of torch.scatter, the reproduced results might not be exactly the same as those reported in the paper, but should be very close to them.

The evaluation scripts are as follows. Note that the evaluation process is CPU-intensive. During our experiments, each of them requires running for 3-4 hours on 32 cpu cores.

# peptide
python -m scripts.metrics.peptide_o_ray --results ./results/pep/results.jsonl --num_workers 96 --calc_dg
# antibody
python -m scripts.metrics.peptide_o_ray --results ./results/ab/results.jsonl --antibody --log_suffix HCDR3 --num_workers 64 --calc_dg

💡 Contact

Thank you for your interest in our work!

Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at here.

🤝 Acknowledgements

Our repository is adopted from UniMoMo. We thank the authors for their open-source contribution.

About

This is the official repository for PRIME: A Pretrained Representation–Induced Model for 3D Molecules in *De Novo* Binder Design

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors