Skip to content

tewiSong/FlexMol

Repository files navigation

Unified Molecule Pre-training with Flexible 2D and 3D Modalities

This work presents a two-stage molecular pretraining approach. The first stage performs multi-modal molecular representation learning using paired 3D molecular and 2D molecular features from the PCQM4Mv2 dataset. The second stage uses single-modality data (3D-only or 2D-only subsets from Uni-Mol data) and leverages the decoder learned in the first stage to complete missing modalities.

The complete pretraining pipeline is illustrated in Figure.

Table of Contents

Dependencies

Create a conda environment and install the required packages:

conda create -n flexmol python=3.10
conda activate flexmol

Install the following dependencies:

# Core dependencies
# Note: For V100 GPUs (compute capability 7.0), use PyTorch 2.2.x or earlier
# For newer GPUs (A100, H100, etc.), you can use PyTorch 2.2.0 or later
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
pip install torch-geometric
pip install pytorch-lightning
pip install rdkit
pip install cython
pip install omegaconf
pip install ogb
pip install scikit-learn
pip install pandas
pip install peft
pip install lmdb
pip install tensorboard

Dataset Download

PCQM4Mv2 Dataset

Download the PCQM4Mv2 dataset from OGB:

wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2-train.sdf.tar.gz

# Verify download integrity
md5sum pcqm4m-v2-train.sdf.tar.gz # Expected: fd72bce606e7ddf36c2a832badeec6ab pcqm4m-v2-train.sdf.tar.gz

# Extract the dataset
tar -xf pcqm4m-v2-train.sdf.tar.gz # Extracts pcqm4m-v2-train.sdf

Uni-Mol Dataset

Download the complete Uni-Mol pretraining dataset:

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/pretrain/ligands.tar.gz
tar -xf ligands.tar.gz

Downstream Task Datasets

Molecule Conformation Generation

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/conformation_generation.tar.gz
tar -xf conformation_generation.tar.gz

Molecule Property Prediction

wget https://bioos-hermite-beijing.tos-cn-beijing.volces.com/unimol_data/finetune/molecular_property_prediction.tar.gz
tar -xf molecular_property_prediction.tar.gz

Data Preprocessing

Step 1: Build Extensions

Build the required Cython extensions for data processing:

# Build Cython extensions
python setup.py build_ext --inplace
python setup_cython.py build_ext --inplace

Step 2: Configure Dataset Paths

Update the dataset paths in conf/dataset_paths.yml according to your local setup:

# Example configuration
dictionary_path: "/path/to/your/unimol_pretrain/dict.txt"
pcqm_data_path: "/path/to/your/pcqm_nf.lmdb"
unimol_2d_path: "/path/to/your/unimol_2d"
unimol_3d_path: "/path/to/your/unimol_3d_1m"

Step 3: Preprocess Data

# Preprocess PCQM4Mv2 dataset
python process_pcqm.py
# Prepricess 2d/3d only dataset
python process_only_modal.py --modes both
# or
python process_only_modal.py --modes only3d --scale 500k    

Molecular Pretraining

The pretraining consists of two stages:

Stage 1: Multi-modal Pretraining (2D+3D)

Train the model using both 2D and 3D molecular features:

./scripts/train_multi_gpu.sh stage1 4 0,1,2,3

Stage 2a: 2D-only Pretraining

Continue pretraining with 2D molecular features only:

./scripts/train_multi_gpu.sh stage2a 1 0

Stage 2b: 3D-only Pretraining

Continue pretraining with 3D molecular features only:

./scripts/train_multi_gpu.sh stage2b 1 0

Training Configuration

The training automatically configures the following:

  • Strategy: ddp_find_unused_parameters_true for distributed training
  • Precision: Mixed precision (FP16) for GPU training, FP32 for CPU
  • Device Detection: Automatic GPU detection
  • Gradient Clipping: L2 norm clipping with max value 1.0

Molecular Property Prediction

Fine-tune the pretrained model for downstream molecular property prediction tasks:

sbatch ./scripts/train_multi_gpu.sh finetune 1 0

Fine-tune task_num setting:

Classification

Dataset BBBP BACE ClinTox Tox21 ToxCast SIDER HIV PCBA MUV
task_num 2 2 2 12 617 27 2 128 17

Regression

Dataset ESOL FreeSolv Lipo QM7 QM8 QM9
task_num 1 1 1 1 12 3

Molecular Conformation Generation

Perform molecular conformation generation using the pretrained model:

Step 1: Generate Initial RDKit Conformations

Generate initial conformations for inference:

mode="gen_data"
nthreads=20  # Number of threads
reference_file="./conformation_generation/qm9/test_data_200.pkl"  # Reference file path
output_dir="./conformation_generation/qm9"  # Output directory

python ./flexmol/utils/conf_gen_cal_metrics.py \
    --mode $mode \
    --nthreads $nthreads \
    --reference-file $reference_file \
    --output-dir $output_dir

Step 2: Fine-tune for Conformation Generation

Fine-tune the pretrained model on the conformation generation task:

python flexmol/generation/pl_gen.py

Step 3: Calculate Evaluation Metrics

Evaluate the generated conformations:

mode="cal_metrics"
threshold=0.5  # Threshold for metrics calculation (0.5 for QM9, 1.25 for drugs)
nthreads=20  # Number of threads
predict_file="/path/to/your/inference/results.pkl"  # Generated conformations file
reference_file="/path/to/your/reference/data.pkl"  # Reference conformations file

python flexmol/utils/conf_gen_cal_metrics.py \
    --mode $mode \
    --threshold $threshold \
    --nthreads $nthreads \
    --predict-file $predict_file \
    --reference-file $reference_file

Configuration

The training behavior can be customized through configuration files located in the conf/ directory. Key parameters include:

  • Batch size: Adjust batch_size for your GPU memory
  • Learning rate: Modify learning_rate and lr_scheduler settings
  • Model architecture: Configure encoder_layers, embed_dim, etc.
  • Training epochs: Set max_epochs for training duration

Performance Notes

  • Memory Requirements:
    • Single GPU: ~16GB VRAM recommended
    • Multi-GPU: 8GB+ VRAM per GPU
  • Training Time:
    • Stage 1: ~1 days on 4x V100
    • Stage 2: ~6-8 hs per modality on 1x V100
  • Recommended Setup:
    • Use mixed precision (FP16) to reduce memory usage
    • Enable gradient checkpointing for larger models

Troubleshooting

Common Issues

  1. CUDA Out of Memory: Reduce batch size or enable gradient checkpointing

  2. DDP Issues: Ensure all GPUs are visible and properly configured

  3. Data Loading: Verify dataset paths and preprocessing completion

  4. Alternative Multi-GPU Launch Method

    For more control over distributed training, you can use PyTorch's distributed launcher:

    # Set environment variables
    export PYTHONPATH="${PYTHONPATH}:$(pwd)"
    export MASTER_ADDR="localhost"
    export MASTER_PORT="29500"
    
    # Launch with torch.distributed
    python -m torch.distributed.launch \
        --nproc_per_node=4 \
        --master_port=29500 \
        flexmol/run_pcqm/pl_train.py
  5. Troubleshooting Multi-GPU Issues

  • NCCL Backend Issues: Set environment variables:

    export NCCL_DEBUG=INFO
    export NCCL_SOCKET_IFNAME=eth0  # Replace with your network interface
  • Port Conflicts: Change the master port:

    export MASTER_PORT="29501"  # Use a different port

Citation

If you use this work, please cite:

@inproceedings{song2025flexmol,
author = {Song, Tengwei and Wu, Min and Fang, Yuan},
title = {Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration},
year = {2025},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {2750–2760},
numpages = {11},
series = {CIKM '25}
}

License

This project is built upon Uni-Mol and Transformer-M. Please refer to their respective licenses for usage terms.

Acknowledgments

We thank the authors of Uni-Mol and Transformer-M for their foundational work and open-source contributions to the molecular modeling community.

About

Official repository of paper "Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration" in CIKM 2025.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors