Skip to content

Susan571/DisProtBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

91 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ DisProtBench: Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts

Xinyue Zeng¹, Tuo Wang¹, Adithya Kulkarni⁡, Alexander Lu¹, Alexandra Ni¹, Phoebe Xing¹, Junhan Zhao²³, Siwei Chen²⁴, Dawei Zhou¹

¹ Virginia Tech, ² University of Chicago ³ Harvard Medical School, ⁴ Broad Institute of MIT and Harvard, ⁡ Ball State University

DisProtBench


πŸ“Œ Abstract

Recent advances in protein structure prediction have achieved near-atomic accuracy for well-folded proteins, but their reliability in biologically realistic settings involving intrinsically disordered regions (IDRs) remains poorly understood. Existing benchmarks primarily emphasize global structural accuracy and therefore fail to capture how structural uncertainty propagates to downstream functional tasks such as protein–protein interaction (PPI) prediction and structure-based drug discovery.

In realistic biological settings, evaluating protein structure prediction models (PSPMs) is challenging due to three fundamental factors:

  • Structural complexity: Proteins often contain extensive IDRs, flexible segments, and multimeric interfaces that violate assumptions of rigid folding.
  • Low availability of reliable ground truth: Reliable experimental structures are scarce for disordered regions, making direct structural evaluation unreliable.
  • Prediction uncertainty: Structural uncertainty propagates into downstream prediction uncertainty, leading to task-dependent degradation in functional performance.

We introduce DisProtBench, an IDR-centric, uncertainty-aware benchmark for evaluating PSPMs beyond static accuracy. DisProtBench integrates a large, multi-modal dataset spanning disease-associated IDRs, GPCR–ligand complexes, and multimeric protein assemblies, and evaluates models through downstream functional tasks. To explicitly diagnose uncertainty, we introduce Functional Uncertainty Sensitivity (FUS), which stratifies evaluation by task-relevant structural uncertainty rather than confidence alone.

Our results reveal clear task-dependent failure modes: PPI prediction is highly sensitive to IDR-driven uncertainty, whereas structure-based drug discovery remains comparatively robust. These differences are largely obscured by aggregate accuracy metrics. DisProtBench provides a reproducible and extensible framework for uncertainty-aware evaluation of protein structure prediction models in realistic biological contexts.

πŸ”Ή DisProtBench - A Unified Benchmark for IDR Investigation

We introduce DisProtBench with the following key contributions:

(1) Database Development: We curate a large benchmark dataset spanning biologically complex IDR scenarios, including thousands of disease-associated human proteins, GPCR–ligand interactions, and multimeric complexes with disorder-mediated interfaces. It captures structural heterogeneity essential for assessing model robustness in realistic contexts.

(2) IDR-centric Toolbox Development: We introduce a unified evaluation toolbox that benchmarks eleven PSPMs on disorder-sensitive downstream tasks, including PPI prediction and drug discovery, using consistent classification, regression, and structural interface metrics. Evaluation is performed via Functional Uncertainty Sensitivity (FUS), which stratifies predictions by task-relevant structural uncertainty rather than confidence alone, enabling diagnosis of task-dependent failure modes across model families.

(3) Visual Analystics Interface Development: The DisProtBench Portal provides 3D visualizations, model comparison heatmaps, and interactive results to explore structure–function links, assess disorder-specific performance, and support hypothesis generationβ€”without local setup.

πŸ“‚ Datasets

We open-sourced our benchmark on Github here, consisting of the following subsets:

Dataset Description # Number of Protein Only Source
DisProt-Based Dataset Disorder in human disease $\sim 10^3$ proteins / $\sim$10$^4$ PPIs First proposed in our work
Protein Interaction Dataset Disorder-mediated interfaces $\sim 10^5$ interactions GitHub
Individual Protein Dataset Disorder and ligand binding $\sim 10^3$ complexes GitHub

πŸ—οΈ Toolbox

πŸ“‚ Models Toolbox

We benchmark state-of-the-art PSPMs spanning diverse architectures, inputs, and structural representations across protein-related tasks, as summarized below:

PSPM Task Architecture Input Source Structural Representation
AF2 PPI, Drug Evoformer MSA Paper Atomic
AF3 PPI, Drug Evoformer+LLM MSA + Seq Paper Atomic + ligand
OpenFold PPI, Drug Evoformer MSA Paper Atomic
UniFold PPI, Drug Evoformer MSA Paper Atomic
Boltz PPI, Drug Transformer Seq-only Paper Coarse-grained
Chai PPI, Drug Transformer Seq-only Paper Coarse-grained
Protenix PPI, Drug Transformer+ Seq-only Paper Atomic + ligand
ESMFold Drug Transformer Seq-only Paper Coarse-grained
OmegaFold Drug Transformer Seq-only Paper Coarse-grained
RoseTTAFold Drug Hybrid (CNN+Attn) MSA Paper Atomic
DeepFold Drug Custom DL Seq-only Paper Atomic

πŸ“Š Evaluation ToolBox

We evaluate model performance using a comprehensive set of classification, regression, and structural interface metrics, defined as follows:

Metric Definition / Formula Categorization
Precision (Positive Predictive Value) TP / (TP + FP) Classification Metrics
Recall (Sensitivity) TP / (TP + FN) Classification Metrics
F1 Score 2 Γ— TP / (2 Γ— TP + FP + FN) Classification Metrics
Accuracy (TP + TN) / (TP + TN + FP + FN) Classification Metrics
Mean Absolute Error (MAE) $\displaystyle \frac{1}{N}\sum_{i=1}^{N}\lvert y_i-\hat{y}_i\rvert$ Regression Metrics
Mean Squared Error (MSE) $\displaystyle \frac{1}{N}\sum_{i=1}^{N}\bigl(y_i - \hat{y}_i\bigr)^2$ Regression Metrics
Pearson Correlation Coefficient (R) Ξ£[(yα΅’ βˆ’ Θ³)(Ε·α΅’ βˆ’ Ε·Μ„)] / √(Ξ£(yα΅’ βˆ’ Θ³)Β² Γ— Ξ£(Ε·α΅’ βˆ’ Ε·Μ„)Β²) Regression Metrics
Receptor Precision (RP) size(True ∩ Pred Receptor) / size(Pred Receptor) Structural Interface Metrics
Receptor Recall (RR) size(True ∩ Pred Receptor) / size(True Receptor) Structural Interface Metrics
Ligand Precision (LP) size(Pred Ligand ∩ True Receptor) / size(Pred Ligand) Structural Interface Metrics
Ligand Recall (LR) size(True Ligand ∩ Pred Receptor) / size(True Ligand) Structural Interface Metrics

For the definitions of Receptor and Ligand, we follow the work of Multi-level analysis of intrinsically disordered protein docking methods:

Definition

🎨 Visualize Portal

For more visualizations, please link to DisProtBench Portal.

πŸ”— Visual-Interactive Interface: DisProtBench Portal

Visualization Examples

πŸ—οΈ Results

PPI Prediction

Generation Evaluation

The table below summarizes RR, RP, LR, and LP scores with 95% CI for each PSPM across three structural confidence thresholds (full sequence, $\mathrm{FUS}_T(\tau=30)$, and $\mathrm{FUS}_T(\tau=50)$), highlighting interface prediction performance under varying disorder levels:

Original $\mathrm{FUS}_T(\tau=30)$ $\mathrm{FUS}_T(\tau=50)$
PSPM RR RP LR LP RR RP LR LP RR RP LR LP
AF2 0.749 Β± 0.0217 0.7186 Β± 0.0226 0.7052 Β± 0.0215 0.7486 Β± 0.0215 0.7729 Β± 0.0208 0.7426 Β± 0.0217 0.7313 Β± 0.0207 0.7719 Β± 0.0204 0.7959 Β± 0.0197 0.7663 Β± 0.0207 0.7569 Β± 0.0198 0.7946 Β± 0.0193
Boltz 0.7648 Β± 0.0155 0.7666 Β± 0.0153 0.7703 Β± 0.0155 0.7624 Β± 0.0140 0.7876 Β± 0.0147 0.7899 Β± 0.0146 0.7934 Β± 0.0148 0.7863 Β± 0.0142 0.8098 Β± 0.0139 0.8121 Β± 0.0138 0.8152 Β± 0.0140 0.8093 Β± 0.0134
Chai 0.7583 Β± 0.0158 0.7746 Β± 0.0149 0.7674 Β± 0.0155 0.7571 Β± 0.0155 0.7815 Β± 0.0150 0.7980 Β± 0.0142 0.7898 Β± 0.0147 0.7804 Β± 0.0148 0.8034 Β± 0.0142 0.8203 Β± 0.0134 0.8117 Β± 0.0139 0.8029 Β± 0.0140
OpenFold 0.6626 Β± 0.0271 0.6127 Β± 0.0312 0.6551 Β± 0.0274 0.6322 Β± 0.0280 0.6904 Β± 0.0263 0.6386 Β± 0.0306 0.6828 Β± 0.0267 0.6593 Β± 0.0272 0.7182 Β± 0.0254 0.6644 Β± 0.0297 0.7111 Β± 0.0258 0.6861 Β± 0.0263
Proteinx 0.7344 Β± 0.0164 0.7427 Β± 0.0162 0.7385 Β± 0.0159 0.7310 Β± 0.0159 0.7584 Β± 0.0158 0.7658 Β± 0.0155 0.7626 Β± 0.0152 0.7544 Β± 0.0155 0.7811 Β± 0.0150 0.7884 Β± 0.0147 0.7857 Β± 0.0144 0.7775 Β± 0.0144
UniFold 0.5876 Β± 0.0593 0.5431 Β± 0.0604 0.5892 Β± 0.0711 0.6116 Β± 0.0609 0.6127 Β± 0.0576 0.5671 Β± 0.0590 0.6172 Β± 0.0707 0.6384 Β± 0.0595 0.6386 Β± 0.0557 0.5923 Β± 0.0576 0.6426 Β± 0.0689 0.6655 Β± 0.0578

Prediction Evaluation

We use PPI prediction as a downstream task for evaluating PSPMs with the following pipeline:

PPI Pipeline

We examine the robustness of PSPMs in predicting PPI, a setting where disordered regions frequently mediate transient or flexible binding. The results are as follows:

Original $\mathrm{FUS}_T(\tau=30)$ $\mathrm{FUS}_T(\tau=50)$
PSPM Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
AF2 0.793 0.783 0.799 0.791 0.802 0.791 0.812 0.801 0.818 0.809 0.825 0.817
AF3 0.902 0.888 0.915 0.901 0.905 0.893 0.905 0.906 0.913 0.899 0.930 0.914
Boltz 0.850 0.848 0.853 0.850 0.858 0.853 0.863 0.858 0.869 0.870 0.868 0.869
Chai 0.850 0.841 0.863 0.852 0.858 0.847 0.873 0.860 0.869 0.857 0.887 0.871
OpenFold 0.624 0.605 0.605 0.605 0.643 0.622 0.638 0.630 0.671 0.656 0.651 0.653
Proteinx 0.810 0.809 0.812 0.810 0.819 0.820 0.818 0.819 0.834 0.834 0.835 0.834
UniFold 0.552 0.378 0.667 0.483 0.567 0.389 0.667 0.491 0.597 0.417 0.714 0.526

Heatmaps of -log10(p) values from McNemar tests comparing pairwise model performance on PPI prediction across different $\mathrm{FUS}_T(\tau)$. Left: full sequence; Middle: $\mathrm{FUS}_T(\tau=30)$; Right: $\mathrm{FUS}_T(\tau=50)$. Higher values indicate greater statistical significance between PSPMs. Blank blocks indicate self-comparisons, which are omitted by definition.

PPI Heatmap

Drug Discovery

Prediction Evaluation

We use drug discovery as a downstream task for evaluating PSPMs with the following pipeline:

Drug Discovery Pipeline

We examine the robustness of PSPMs in discovering drugs, a setting where disordered regions frequently mediate transient or flexible binding. The results are as follows:

Original $\mathrm{FUS}_T(\tau=30)$ $\mathrm{FUS}_T(\tau=50)$
Model MAE R MAE R MAE R
AlphaFold3 0.048 0.999 0.048 0.999 0.049 0.999
Proteinx 0.072 0.997 0.072 0.997 0.072 0.997
Boltz 0.079 0.996 0.079 0.996 0.080 0.996
Chai 0.096 0.995 0.096 0.995 0.097 0.995
DeepFold 0.144 0.988 0.144 0.988 0.144 0.988
ESMFold 0.150 0.987 0.150 0.987 0.151 0.987
OmegaFold 0.151 0.987 0.151 0.987 0.152 0.987
OpenFold 0.151 0.987 0.151 0.987 0.151 0.987
AlphaFold2 0.160 0.985 0.160 0.985 0.160 0.985
UniFold 0.183 0.981 0.183 0.981 0.184 0.981
RoseTTAFold 0.190 0.979 0.190 0.979 0.190 0.979

Heatmaps of -log10(p) values from Wilcoxon signed-rank tests comparing model performance in drug discovery tasks across different $\mathrm{FUS}_T(\tau)$. Left: full sequence; Middle: $\mathrm{FUS}_T(\tau=30)$; Right: $\mathrm{FUS}_T(\tau=50)$. Higher values indicate greater statistical significance in pairwise differences between PSPMs. Blank blocks indicate self-comparisons, which are omitted by definition.

Drug Heatmap

Folder Structure

DisProtBench/
β”‚
β”œβ”€β”€ Database/
β”‚   β”œβ”€β”€ DisProt-based Datase/         # Scripts for extracting and processing DisProt-based data
β”‚   β”œβ”€β”€ Protein Interaction Dataset/  # Main dataset files (e.g., dataset1200.json)
β”‚   └── Individual Protein Datase/    # Scripts for individual protein data (e.g., GPCRs)
β”‚
β”œβ”€β”€ Downstream Evaluation/
β”‚   β”œβ”€β”€ toolbox/                      # Unified CLI and utility scripts for evaluation tasks
β”‚   β”œβ”€β”€ PPI/                          # PPI task scripts (preprocessing, training, testing)
β”‚   └── CPI-drug discovery/           # CPI/drug discovery task scripts and AlphaFold tools
β”‚
β”œβ”€β”€ disprotbenchmark-metadata.json    # Metadata and dataset description
β”œβ”€β”€ LICENSE                          # Project license (MIT)
└── README.md                        # This file

Getting Started

Prerequisites

For AlphaFold-related tasks, install dependencies from:

Downstream Evaluation/toolbox/cpi/alphafold/requirements.txt

Example (in a new environment):

pip install -r Downstream\ Evaluation/toolbox/cpi/alphafold/requirements.txt

Installation

Clone the repository and install the required dependencies for your tasks of interest.

Usage

Unified CLI for Downstream Tasks

All major PPI and CPI tasks can be run from a single entry point:

cd Downstream\ Evaluation/toolbox
python toolbox.py <ppi|cpi> <subcommand> [options]

PPI Tasks

  • Train a PPI model:

    python toolbox.py ppi train \
      --model DenseNet3D \
      --datapath ./data/example_dataset/distance \
      --train_set ./data/example_dataset/part_0_train.csv \
      --test_set ./data/example_dataset/part_0_val.csv \
      --savingPath ./models/example_model
  • Test a PPI model:

    python toolbox.py ppi test \
      --model DenseNet3D \
      --datapath ./data/example_dataset/distance \
      --weights ./models/example_model/<timestamp>/best_model.weights.h5 \
      --output ./models/example_model/<timestamp>/preds.npy \
      --test_set ./data/example_dataset/part_0_test.csv

CPI/Drug Discovery Tasks

  • Prepare CPI dataset:

    python toolbox.py cpi prepare_data --dataset data/original/top20_raw.csv --gpcr-col uniprot_id --smiles-col smiles --label-col pKi --file-name-col inchi_key --rep-path data/representations/top20/{}.npy --save-path data/ligands/top20/imgs --anno-path data/ligands/top20/anno -j 12 --test-size 0.3 --task regression
  • Train a CPI model:

    python toolbox.py cpi train --cfg configs/train/top20.yml
  • Predict with a CPI model:

    python toolbox.py cpi predict --cfg configs/prediction/pain.yml --data-dir data/pred/fda --rep-path data/representations/pain/P08908.npy --out-dir output/prediction/fda/P08908

Protein Structure Prediction

For generating protein structure, see the instructions of the models directly.


License

This project is licensed under the MIT License. See the LICENSE file for details.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors