Unified Molecule Pre-training with Flexible 2D and 3D Modalities

This work presents a two-stage molecular pretraining approach. The first stage performs multi-modal molecular representation learning using paired 3D molecular and 2D molecular features from the PCQM4Mv2 dataset. The second stage uses single-modality data (3D-only or 2D-only subsets from Uni-Mol data) and leverages the decoder learned in the first stage to complete missing modalities.

The complete pretraining pipeline is illustrated in Figure.

Dependencies

Create a conda environment and install the required packages:

conda create -n flexmol python=3.10
conda activate flexmol

Install the following dependencies:

# Core dependencies
# Note: For V100 GPUs (compute capability 7.0), use PyTorch 2.2.x or earlier
# For newer GPUs (A100, H100, etc.), you can use PyTorch 2.2.0 or later
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
pip install torch-geometric
pip install pytorch-lightning
pip install rdkit
pip install cython
pip install omegaconf
pip install ogb
pip install scikit-learn
pip install pandas
pip install peft
pip install lmdb
pip install tensorboard

Dataset Download

PCQM4Mv2 Dataset

Download the PCQM4Mv2 dataset from OGB:

wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2-train.sdf.tar.gz

# Verify download integrity
md5sum pcqm4m-v2-train.sdf.tar.gz # Expected: fd72bce606e7ddf36c2a832badeec6ab pcqm4m-v2-train.sdf.tar.gz

# Extract the dataset
tar -xf pcqm4m-v2-train.sdf.tar.gz # Extracts pcqm4m-v2-train.sdf

Uni-Mol Dataset

Download the complete Uni-Mol pretraining dataset:

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/pretrain/ligands.tar.gz
tar -xf ligands.tar.gz

Downstream Task Datasets

Molecule Conformation Generation

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/conformation_generation.tar.gz
tar -xf conformation_generation.tar.gz

Molecule Property Prediction

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/molecular_property_prediction.tar.gz
tar -xf molecular_property_prediction.tar.gz

Data Preprocessing

Step 1: Build Extensions

Build the required Cython extensions for data processing:

# Build Cython extensions
python setup.py build_ext --inplace
python setup_cython.py build_ext --inplace

Step 2: Configure Dataset Paths

Update the dataset paths in conf/dataset_paths.yml according to your local setup:

# Example configuration
dictionary_path: "/path/to/your/unimol_pretrain/dict.txt"
pcqm_data_path: "/path/to/your/pcqm_nf.lmdb"
unimol_2d_path: "/path/to/your/unimol_2d"
unimol_3d_path: "/path/to/your/unimol_3d_1m"

Step 3: Preprocess Data

# Preprocess PCQM4Mv2 dataset
python process_pcqm.py
# Prepricess 2d/3d only dataset
python process_only_modal.py --modes both
# or
python process_only_modal.py --modes only3d --scale 500k

Molecular Pretraining

The pretraining consists of two stages:

Stage 1: Multi-modal Pretraining (2D+3D)

Train the model using both 2D and 3D molecular features:

./scripts/train_multi_gpu.sh stage1 4 0,1,2,3

Stage 2a: 2D-only Pretraining

Continue pretraining with 2D molecular features only:

./scripts/train_multi_gpu.sh stage2a 1 0

Stage 2b: 3D-only Pretraining

Continue pretraining with 3D molecular features only:

./scripts/train_multi_gpu.sh stage2b 1 0

Training Configuration

The training automatically configures the following:

Strategy: ddp_find_unused_parameters_true for distributed training
Precision: Mixed precision (FP16) for GPU training, FP32 for CPU
Device Detection: Automatic GPU detection
Gradient Clipping: L2 norm clipping with max value 1.0

Molecular Property Prediction

Fine-tune the pretrained model for downstream molecular property prediction tasks:

sbatch ./scripts/train_multi_gpu.sh finetune 1 0

Fine-tune task_num setting:

Classification

Dataset	BBBP	BACE	ClinTox	Tox21	ToxCast	SIDER	HIV	PCBA	MUV
task_num	2	2	2	12	617	27	2	128	17

Regression

Dataset	ESOL	FreeSolv	Lipo	QM7	QM8	QM9
task_num	1	1	1	1	12	3

Molecular Conformation Generation

Perform molecular conformation generation using the pretrained model:

Step 1: Generate Initial RDKit Conformations

Generate initial conformations for inference:

mode="gen_data"
nthreads=20  # Number of threads
reference_file="./conformation_generation/qm9/test_data_200.pkl"  # Reference file path
output_dir="./conformation_generation/qm9"  # Output directory

python ./flexmol/utils/conf_gen_cal_metrics.py \
    --mode $mode \
    --nthreads $nthreads \
    --reference-file $reference_file \
    --output-dir $output_dir

Step 2: Fine-tune for Conformation Generation

Fine-tune the pretrained model on the conformation generation task:

python flexmol/generation/pl_gen.py

Step 3: Calculate Evaluation Metrics

Evaluate the generated conformations:

mode="cal_metrics"
threshold=0.5  # Threshold for metrics calculation (0.5 for QM9, 1.25 for drugs)
nthreads=20  # Number of threads
predict_file="/path/to/your/inference/results.pkl"  # Generated conformations file
reference_file="/path/to/your/reference/data.pkl"  # Reference conformations file

python flexmol/utils/conf_gen_cal_metrics.py \
    --mode $mode \
    --threshold $threshold \
    --nthreads $nthreads \
    --predict-file $predict_file \
    --reference-file $reference_file

Configuration

The training behavior can be customized through configuration files located in the conf/ directory. Key parameters include:

Batch size: Adjust batch_size for your GPU memory
Learning rate: Modify learning_rate and lr_scheduler settings
Model architecture: Configure encoder_layers, embed_dim, etc.
Training epochs: Set max_epochs for training duration

Performance Notes

Memory Requirements:
- Single GPU: ~16GB VRAM recommended
- Multi-GPU: 8GB+ VRAM per GPU
Training Time:
- Stage 1: ~1 days on 4x V100
- Stage 2: ~6-8 hs per modality on 1x V100
Recommended Setup:
- Use mixed precision (FP16) to reduce memory usage
- Enable gradient checkpointing for larger models

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce batch size or enable gradient checkpointing
DDP Issues: Ensure all GPUs are visible and properly configured
Data Loading: Verify dataset paths and preprocessing completion

Alternative Multi-GPU Launch Method

For more control over distributed training, you can use PyTorch's distributed launcher:

# Set environment variables
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_ADDR="localhost"
export MASTER_PORT="29500"

# Launch with torch.distributed
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=29500 \
    flexmol/run_pcqm/pl_train.py

Troubleshooting Multi-GPU Issues

NCCL Backend Issues: Set environment variables:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0  # Replace with your network interface

Port Conflicts: Change the master port:

export MASTER_PORT="29501"  # Use a different port

Citation

If you use this work, please cite:

@inproceedings{song2025flexmol,
author = {Song, Tengwei and Wu, Min and Fang, Yuan},
title = {Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration},
year = {2025},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {2750–2760},
numpages = {11},
series = {CIKM '25}
}

License

This project is built upon Uni-Mol and Transformer-M. Please refer to their respective licenses for usage terms.

Acknowledgments

We thank the authors of Uni-Mol and Transformer-M for their foundational work and open-source contributions to the molecular modeling community.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
build		build
conf		conf
dcgm/flexmol_stage1		dcgm/flexmol_stage1
flexmol		flexmol
scripts		scripts
.gitignore		.gitignore
README.md		README.md
algos.c		algos.c
algos.cpython-310-x86_64-linux-gnu.so		algos.cpython-310-x86_64-linux-gnu.so
algos.pyx		algos.pyx
cikm-paper-main.png		cikm-paper-main.png
conf_parser.py		conf_parser.py
environment.yml		environment.yml
flexmol_unified.py		flexmol_unified.py
process_only_modal.py		process_only_modal.py
process_pcqm.py		process_pcqm.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_cython.py		setup_cython.py

Folders and files

Latest commit

History

Repository files navigation

Unified Molecule Pre-training with Flexible 2D and 3D Modalities

Table of Contents

Dependencies

Dataset Download

PCQM4Mv2 Dataset

Uni-Mol Dataset

Downstream Task Datasets

Molecule Conformation Generation

Molecule Property Prediction

Data Preprocessing

Step 1: Build Extensions

Step 2: Configure Dataset Paths

Step 3: Preprocess Data

Molecular Pretraining

Stage 1: Multi-modal Pretraining (2D+3D)

Stage 2a: 2D-only Pretraining

Stage 2b: 3D-only Pretraining

Training Configuration

Molecular Property Prediction

Classification

Regression

Molecular Conformation Generation

Step 1: Generate Initial RDKit Conformations

Step 2: Fine-tune for Conformation Generation

Step 3: Calculate Evaluation Metrics

Configuration

Performance Notes

Troubleshooting

Common Issues

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages