This work presents a two-stage molecular pretraining approach. The first stage performs multi-modal molecular representation learning using paired 3D molecular and 2D molecular features from the PCQM4Mv2 dataset. The second stage uses single-modality data (3D-only or 2D-only subsets from Uni-Mol data) and leverages the decoder learned in the first stage to complete missing modalities.
The complete pretraining pipeline is illustrated in Figure.
- Dependencies
- Dataset Download
- Data Preprocessing
- Molecular Pretraining
- Molecular Property Prediction
- Molecular Conformation Generation
- License
Create a conda environment and install the required packages:
conda create -n flexmol python=3.10
conda activate flexmolInstall the following dependencies:
# Core dependencies
# Note: For V100 GPUs (compute capability 7.0), use PyTorch 2.2.x or earlier
# For newer GPUs (A100, H100, etc.), you can use PyTorch 2.2.0 or later
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
pip install torch-geometric
pip install pytorch-lightning
pip install rdkit
pip install cython
pip install omegaconf
pip install ogb
pip install scikit-learn
pip install pandas
pip install peft
pip install lmdb
pip install tensorboardDownload the PCQM4Mv2 dataset from OGB:
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2-train.sdf.tar.gz
# Verify download integrity
md5sum pcqm4m-v2-train.sdf.tar.gz # Expected: fd72bce606e7ddf36c2a832badeec6ab pcqm4m-v2-train.sdf.tar.gz
# Extract the dataset
tar -xf pcqm4m-v2-train.sdf.tar.gz # Extracts pcqm4m-v2-train.sdfDownload the complete Uni-Mol pretraining dataset:
wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/pretrain/ligands.tar.gz
tar -xf ligands.tar.gzwget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/conformation_generation.tar.gz
tar -xf conformation_generation.tar.gzwget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/molecular_property_prediction.tar.gz
tar -xf molecular_property_prediction.tar.gzBuild the required Cython extensions for data processing:
# Build Cython extensions
python setup.py build_ext --inplace
python setup_cython.py build_ext --inplaceUpdate the dataset paths in conf/dataset_paths.yml according to your local setup:
# Example configuration
dictionary_path: "/path/to/your/unimol_pretrain/dict.txt"
pcqm_data_path: "/path/to/your/pcqm_nf.lmdb"
unimol_2d_path: "/path/to/your/unimol_2d"
unimol_3d_path: "/path/to/your/unimol_3d_1m"# Preprocess PCQM4Mv2 dataset
python process_pcqm.py
# Prepricess 2d/3d only dataset
python process_only_modal.py --modes both
# or
python process_only_modal.py --modes only3d --scale 500k
The pretraining consists of two stages:
Train the model using both 2D and 3D molecular features:
./scripts/train_multi_gpu.sh stage1 4 0,1,2,3
Continue pretraining with 2D molecular features only:
./scripts/train_multi_gpu.sh stage2a 1 0Continue pretraining with 3D molecular features only:
./scripts/train_multi_gpu.sh stage2b 1 0
The training automatically configures the following:
- Strategy:
ddp_find_unused_parameters_truefor distributed training - Precision: Mixed precision (FP16) for GPU training, FP32 for CPU
- Device Detection: Automatic GPU detection
- Gradient Clipping: L2 norm clipping with max value 1.0
Fine-tune the pretrained model for downstream molecular property prediction tasks:
sbatch ./scripts/train_multi_gpu.sh finetune 1 0Fine-tune task_num setting:
| Dataset | BBBP | BACE | ClinTox | Tox21 | ToxCast | SIDER | HIV | PCBA | MUV |
|---|---|---|---|---|---|---|---|---|---|
| task_num | 2 | 2 | 2 | 12 | 617 | 27 | 2 | 128 | 17 |
| Dataset | ESOL | FreeSolv | Lipo | QM7 | QM8 | QM9 |
|---|---|---|---|---|---|---|
| task_num | 1 | 1 | 1 | 1 | 12 | 3 |
Perform molecular conformation generation using the pretrained model:
Generate initial conformations for inference:
mode="gen_data"
nthreads=20 # Number of threads
reference_file="./conformation_generation/qm9/test_data_200.pkl" # Reference file path
output_dir="./conformation_generation/qm9" # Output directory
python ./flexmol/utils/conf_gen_cal_metrics.py \
--mode $mode \
--nthreads $nthreads \
--reference-file $reference_file \
--output-dir $output_dirFine-tune the pretrained model on the conformation generation task:
python flexmol/generation/pl_gen.pyEvaluate the generated conformations:
mode="cal_metrics"
threshold=0.5 # Threshold for metrics calculation (0.5 for QM9, 1.25 for drugs)
nthreads=20 # Number of threads
predict_file="/path/to/your/inference/results.pkl" # Generated conformations file
reference_file="/path/to/your/reference/data.pkl" # Reference conformations file
python flexmol/utils/conf_gen_cal_metrics.py \
--mode $mode \
--threshold $threshold \
--nthreads $nthreads \
--predict-file $predict_file \
--reference-file $reference_fileThe training behavior can be customized through configuration files located in the conf/ directory. Key parameters include:
- Batch size: Adjust
batch_sizefor your GPU memory - Learning rate: Modify
learning_rateandlr_schedulersettings - Model architecture: Configure
encoder_layers,embed_dim, etc. - Training epochs: Set
max_epochsfor training duration
- Memory Requirements:
- Single GPU: ~16GB VRAM recommended
- Multi-GPU: 8GB+ VRAM per GPU
- Training Time:
- Stage 1: ~1 days on 4x V100
- Stage 2: ~6-8 hs per modality on 1x V100
- Recommended Setup:
- Use mixed precision (FP16) to reduce memory usage
- Enable gradient checkpointing for larger models
-
CUDA Out of Memory: Reduce batch size or enable gradient checkpointing
-
DDP Issues: Ensure all GPUs are visible and properly configured
-
Data Loading: Verify dataset paths and preprocessing completion
-
Alternative Multi-GPU Launch Method
For more control over distributed training, you can use PyTorch's distributed launcher:
# Set environment variables export PYTHONPATH="${PYTHONPATH}:$(pwd)" export MASTER_ADDR="localhost" export MASTER_PORT="29500" # Launch with torch.distributed python -m torch.distributed.launch \ --nproc_per_node=4 \ --master_port=29500 \ flexmol/run_pcqm/pl_train.py
-
Troubleshooting Multi-GPU Issues
-
NCCL Backend Issues: Set environment variables:
export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME=eth0 # Replace with your network interface
-
Port Conflicts: Change the master port:
export MASTER_PORT="29501" # Use a different port
If you use this work, please cite:
@inproceedings{song2025flexmol,
author = {Song, Tengwei and Wu, Min and Fang, Yuan},
title = {Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration},
year = {2025},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {2750–2760},
numpages = {11},
series = {CIKM '25}
}This project is built upon Uni-Mol and Transformer-M. Please refer to their respective licenses for usage terms.
We thank the authors of Uni-Mol and Transformer-M for their foundational work and open-source contributions to the molecular modeling community.