π DisProtBench: Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts
Xinyue ZengΒΉ, Tuo WangΒΉ, Adithya Kulkarniβ΅, Alexander LuΒΉ, Alexandra NiΒΉ, Phoebe XingΒΉ, Junhan ZhaoΒ²Β³, Siwei ChenΒ²β΄, Dawei ZhouΒΉ
ΒΉ Virginia Tech, Β² University of Chicago Β³ Harvard Medical School, β΄ Broad Institute of MIT and Harvard, β΅ Ball State University
Recent advances in protein structure prediction have achieved near-atomic accuracy for well-folded proteins, but their reliability in biologically realistic settings involving intrinsically disordered regions (IDRs) remains poorly understood. Existing benchmarks primarily emphasize global structural accuracy and therefore fail to capture how structural uncertainty propagates to downstream functional tasks such as proteinβprotein interaction (PPI) prediction and structure-based drug discovery.
In realistic biological settings, evaluating protein structure prediction models (PSPMs) is challenging due to three fundamental factors:
- Structural complexity: Proteins often contain extensive IDRs, flexible segments, and multimeric interfaces that violate assumptions of rigid folding.
- Low availability of reliable ground truth: Reliable experimental structures are scarce for disordered regions, making direct structural evaluation unreliable.
- Prediction uncertainty: Structural uncertainty propagates into downstream prediction uncertainty, leading to task-dependent degradation in functional performance.
We introduce DisProtBench, an IDR-centric, uncertainty-aware benchmark for evaluating PSPMs beyond static accuracy. DisProtBench integrates a large, multi-modal dataset spanning disease-associated IDRs, GPCRβligand complexes, and multimeric protein assemblies, and evaluates models through downstream functional tasks. To explicitly diagnose uncertainty, we introduce Functional Uncertainty Sensitivity (FUS), which stratifies evaluation by task-relevant structural uncertainty rather than confidence alone.
Our results reveal clear task-dependent failure modes: PPI prediction is highly sensitive to IDR-driven uncertainty, whereas structure-based drug discovery remains comparatively robust. These differences are largely obscured by aggregate accuracy metrics. DisProtBench provides a reproducible and extensible framework for uncertainty-aware evaluation of protein structure prediction models in realistic biological contexts.
We introduce DisProtBench with the following key contributions:
(1) Database Development: We curate a large benchmark dataset spanning biologically complex IDR scenarios, including thousands of disease-associated human proteins, GPCRβligand interactions, and multimeric complexes with disorder-mediated interfaces. It captures structural heterogeneity essential for assessing model robustness in realistic contexts.
(2) IDR-centric Toolbox Development: We introduce a unified evaluation toolbox that benchmarks eleven PSPMs on disorder-sensitive downstream tasks, including PPI prediction and drug discovery, using consistent classification, regression, and structural interface metrics. Evaluation is performed via Functional Uncertainty Sensitivity (FUS), which stratifies predictions by task-relevant structural uncertainty rather than confidence alone, enabling diagnosis of task-dependent failure modes across model families.
(3) Visual Analystics Interface Development: The DisProtBench Portal provides 3D visualizations, model comparison heatmaps, and interactive results to explore structureβfunction links, assess disorder-specific performance, and support hypothesis generationβwithout local setup.
We open-sourced our benchmark on Github here, consisting of the following subsets:
| Dataset | Description | # Number of Protein Only | Source |
|---|---|---|---|
| DisProt-Based Dataset | Disorder in human disease |
|
First proposed in our work |
| Protein Interaction Dataset | Disorder-mediated interfaces |
|
GitHub |
| Individual Protein Dataset | Disorder and ligand binding |
|
GitHub |
We benchmark state-of-the-art PSPMs spanning diverse architectures, inputs, and structural representations across protein-related tasks, as summarized below:
| PSPM | Task | Architecture | Input | Source | Structural Representation |
|---|---|---|---|---|---|
| AF2 | PPI, Drug | Evoformer | MSA | Paper | Atomic |
| AF3 | PPI, Drug | Evoformer+LLM | MSA + Seq | Paper | Atomic + ligand |
| OpenFold | PPI, Drug | Evoformer | MSA | Paper | Atomic |
| UniFold | PPI, Drug | Evoformer | MSA | Paper | Atomic |
| Boltz | PPI, Drug | Transformer | Seq-only | Paper | Coarse-grained |
| Chai | PPI, Drug | Transformer | Seq-only | Paper | Coarse-grained |
| Protenix | PPI, Drug | Transformer+ | Seq-only | Paper | Atomic + ligand |
| ESMFold | Drug | Transformer | Seq-only | Paper | Coarse-grained |
| OmegaFold | Drug | Transformer | Seq-only | Paper | Coarse-grained |
| RoseTTAFold | Drug | Hybrid (CNN+Attn) | MSA | Paper | Atomic |
| DeepFold | Drug | Custom DL | Seq-only | Paper | Atomic |
We evaluate model performance using a comprehensive set of classification, regression, and structural interface metrics, defined as follows:
| Metric | Definition / Formula | Categorization |
|---|---|---|
| Precision (Positive Predictive Value) | TP / (TP + FP) | Classification Metrics |
| Recall (Sensitivity) | TP / (TP + FN) | Classification Metrics |
| F1 Score | 2 Γ TP / (2 Γ TP + FP + FN) | Classification Metrics |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Classification Metrics |
| Mean Absolute Error (MAE) | Regression Metrics | |
| Mean Squared Error (MSE) | Regression Metrics | |
| Pearson Correlation Coefficient (R) | Ξ£[(yα΅’ β Θ³)(Ε·α΅’ β Ε·Μ)] / β(Ξ£(yα΅’ β Θ³)Β² Γ Ξ£(Ε·α΅’ β Ε·Μ)Β²) | Regression Metrics |
| Receptor Precision (RP) | size(True β© Pred Receptor) / size(Pred Receptor) |
Structural Interface Metrics |
| Receptor Recall (RR) | size(True β© Pred Receptor) / size(True Receptor) |
Structural Interface Metrics |
| Ligand Precision (LP) | size(Pred Ligand β© True Receptor) / size(Pred Ligand) |
Structural Interface Metrics |
| Ligand Recall (LR) | size(True Ligand β© Pred Receptor) / size(True Ligand) |
Structural Interface Metrics |
For the definitions of Receptor and Ligand, we follow the work of Multi-level analysis of intrinsically disordered protein docking methods:
For more visualizations, please link to DisProtBench Portal.
π Visual-Interactive Interface: DisProtBench Portal
The table below summarizes RR, RP, LR, and LP scores with 95% CI for each PSPM across three structural confidence thresholds (full sequence,
| Original | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSPM | RR | RP | LR | LP | RR | RP | LR | LP | RR | RP | LR | LP |
| AF2 | 0.749 Β± 0.0217 | 0.7186 Β± 0.0226 | 0.7052 Β± 0.0215 | 0.7486 Β± 0.0215 | 0.7729 Β± 0.0208 | 0.7426 Β± 0.0217 | 0.7313 Β± 0.0207 | 0.7719 Β± 0.0204 | 0.7959 Β± 0.0197 | 0.7663 Β± 0.0207 | 0.7569 Β± 0.0198 | 0.7946 Β± 0.0193 |
| Boltz | 0.7648 Β± 0.0155 | 0.7666 Β± 0.0153 | 0.7703 Β± 0.0155 | 0.7624 Β± 0.0140 | 0.7876 Β± 0.0147 | 0.7899 Β± 0.0146 | 0.7934 Β± 0.0148 | 0.7863 Β± 0.0142 | 0.8098 Β± 0.0139 | 0.8121 Β± 0.0138 | 0.8152 Β± 0.0140 | 0.8093 Β± 0.0134 |
| Chai | 0.7583 Β± 0.0158 | 0.7746 Β± 0.0149 | 0.7674 Β± 0.0155 | 0.7571 Β± 0.0155 | 0.7815 Β± 0.0150 | 0.7980 Β± 0.0142 | 0.7898 Β± 0.0147 | 0.7804 Β± 0.0148 | 0.8034 Β± 0.0142 | 0.8203 Β± 0.0134 | 0.8117 Β± 0.0139 | 0.8029 Β± 0.0140 |
| OpenFold | 0.6626 Β± 0.0271 | 0.6127 Β± 0.0312 | 0.6551 Β± 0.0274 | 0.6322 Β± 0.0280 | 0.6904 Β± 0.0263 | 0.6386 Β± 0.0306 | 0.6828 Β± 0.0267 | 0.6593 Β± 0.0272 | 0.7182 Β± 0.0254 | 0.6644 Β± 0.0297 | 0.7111 Β± 0.0258 | 0.6861 Β± 0.0263 |
| Proteinx | 0.7344 Β± 0.0164 | 0.7427 Β± 0.0162 | 0.7385 Β± 0.0159 | 0.7310 Β± 0.0159 | 0.7584 Β± 0.0158 | 0.7658 Β± 0.0155 | 0.7626 Β± 0.0152 | 0.7544 Β± 0.0155 | 0.7811 Β± 0.0150 | 0.7884 Β± 0.0147 | 0.7857 Β± 0.0144 | 0.7775 Β± 0.0144 |
| UniFold | 0.5876 Β± 0.0593 | 0.5431 Β± 0.0604 | 0.5892 Β± 0.0711 | 0.6116 Β± 0.0609 | 0.6127 Β± 0.0576 | 0.5671 Β± 0.0590 | 0.6172 Β± 0.0707 | 0.6384 Β± 0.0595 | 0.6386 Β± 0.0557 | 0.5923 Β± 0.0576 | 0.6426 Β± 0.0689 | 0.6655 Β± 0.0578 |
We use PPI prediction as a downstream task for evaluating PSPMs with the following pipeline:
We examine the robustness of PSPMs in predicting PPI, a setting where disordered regions frequently mediate transient or flexible binding. The results are as follows:
| Original | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSPM | Acc | Prec | Rec | F1 | Acc | Prec | Rec | F1 | Acc | Prec | Rec | F1 |
| AF2 | 0.793 | 0.783 | 0.799 | 0.791 | 0.802 | 0.791 | 0.812 | 0.801 | 0.818 | 0.809 | 0.825 | 0.817 |
| AF3 | 0.902 | 0.888 | 0.915 | 0.901 | 0.905 | 0.893 | 0.905 | 0.906 | 0.913 | 0.899 | 0.930 | 0.914 |
| Boltz | 0.850 | 0.848 | 0.853 | 0.850 | 0.858 | 0.853 | 0.863 | 0.858 | 0.869 | 0.870 | 0.868 | 0.869 |
| Chai | 0.850 | 0.841 | 0.863 | 0.852 | 0.858 | 0.847 | 0.873 | 0.860 | 0.869 | 0.857 | 0.887 | 0.871 |
| OpenFold | 0.624 | 0.605 | 0.605 | 0.605 | 0.643 | 0.622 | 0.638 | 0.630 | 0.671 | 0.656 | 0.651 | 0.653 |
| Proteinx | 0.810 | 0.809 | 0.812 | 0.810 | 0.819 | 0.820 | 0.818 | 0.819 | 0.834 | 0.834 | 0.835 | 0.834 |
| UniFold | 0.552 | 0.378 | 0.667 | 0.483 | 0.567 | 0.389 | 0.667 | 0.491 | 0.597 | 0.417 | 0.714 | 0.526 |
Heatmaps of -log10(p) values from McNemar tests comparing pairwise model performance on PPI prediction across different
We use drug discovery as a downstream task for evaluating PSPMs with the following pipeline:
We examine the robustness of PSPMs in discovering drugs, a setting where disordered regions frequently mediate transient or flexible binding. The results are as follows:
| Original | ||||||
|---|---|---|---|---|---|---|
| Model | MAE | R | MAE | R | MAE | R |
| AlphaFold3 | 0.048 | 0.999 | 0.048 | 0.999 | 0.049 | 0.999 |
| Proteinx | 0.072 | 0.997 | 0.072 | 0.997 | 0.072 | 0.997 |
| Boltz | 0.079 | 0.996 | 0.079 | 0.996 | 0.080 | 0.996 |
| Chai | 0.096 | 0.995 | 0.096 | 0.995 | 0.097 | 0.995 |
| DeepFold | 0.144 | 0.988 | 0.144 | 0.988 | 0.144 | 0.988 |
| ESMFold | 0.150 | 0.987 | 0.150 | 0.987 | 0.151 | 0.987 |
| OmegaFold | 0.151 | 0.987 | 0.151 | 0.987 | 0.152 | 0.987 |
| OpenFold | 0.151 | 0.987 | 0.151 | 0.987 | 0.151 | 0.987 |
| AlphaFold2 | 0.160 | 0.985 | 0.160 | 0.985 | 0.160 | 0.985 |
| UniFold | 0.183 | 0.981 | 0.183 | 0.981 | 0.184 | 0.981 |
| RoseTTAFold | 0.190 | 0.979 | 0.190 | 0.979 | 0.190 | 0.979 |
Heatmaps of -log10(p) values from Wilcoxon signed-rank tests comparing model performance in drug discovery tasks across different
DisProtBench/
β
βββ Database/
β βββ DisProt-based Datase/ # Scripts for extracting and processing DisProt-based data
β βββ Protein Interaction Dataset/ # Main dataset files (e.g., dataset1200.json)
β βββ Individual Protein Datase/ # Scripts for individual protein data (e.g., GPCRs)
β
βββ Downstream Evaluation/
β βββ toolbox/ # Unified CLI and utility scripts for evaluation tasks
β βββ PPI/ # PPI task scripts (preprocessing, training, testing)
β βββ CPI-drug discovery/ # CPI/drug discovery task scripts and AlphaFold tools
β
βββ disprotbenchmark-metadata.json # Metadata and dataset description
βββ LICENSE # Project license (MIT)
βββ README.md # This file
- Python 3.7+
- TensorFlow (for PPI tasks)
- PyTorch (for CPI tasks)
- AlphaFold dependencies (for structure-based tasks)
- Other dependencies: numpy, pandas, scikit-learn, tqdm, biopython, jax, etc.
For AlphaFold-related tasks, install dependencies from:
Downstream Evaluation/toolbox/cpi/alphafold/requirements.txt
Example (in a new environment):
pip install -r Downstream\ Evaluation/toolbox/cpi/alphafold/requirements.txtClone the repository and install the required dependencies for your tasks of interest.
All major PPI and CPI tasks can be run from a single entry point:
cd Downstream\ Evaluation/toolbox
python toolbox.py <ppi|cpi> <subcommand> [options]-
Train a PPI model:
python toolbox.py ppi train \ --model DenseNet3D \ --datapath ./data/example_dataset/distance \ --train_set ./data/example_dataset/part_0_train.csv \ --test_set ./data/example_dataset/part_0_val.csv \ --savingPath ./models/example_model
-
Test a PPI model:
python toolbox.py ppi test \ --model DenseNet3D \ --datapath ./data/example_dataset/distance \ --weights ./models/example_model/<timestamp>/best_model.weights.h5 \ --output ./models/example_model/<timestamp>/preds.npy \ --test_set ./data/example_dataset/part_0_test.csv
-
Prepare CPI dataset:
python toolbox.py cpi prepare_data --dataset data/original/top20_raw.csv --gpcr-col uniprot_id --smiles-col smiles --label-col pKi --file-name-col inchi_key --rep-path data/representations/top20/{}.npy --save-path data/ligands/top20/imgs --anno-path data/ligands/top20/anno -j 12 --test-size 0.3 --task regression -
Train a CPI model:
python toolbox.py cpi train --cfg configs/train/top20.yml
-
Predict with a CPI model:
python toolbox.py cpi predict --cfg configs/prediction/pain.yml --data-dir data/pred/fda --rep-path data/representations/pain/P08908.npy --out-dir output/prediction/fda/P08908
For generating protein structure, see the instructions of the models directly.
This project is licensed under the MIT License. See the LICENSE file for details.






