Evaluating Multimodal Large Language Models Across the Full Lifecycle of Polymer Science
Overview β’ Highlights β’ Task Design β’ Getting Started β’ Evaluation Pipeline β’ Citation
Website: https://polyreal-benchmark.github.io/
PolyReal is a multimodal benchmark for real-world polymer science workflows.
It is designed to evaluate Multimodal Large Language Models (MLLMs) on the full lifecycle of polymer experimentation, rather than only isolated, simplified tasks.
Unlike prior chemistry or materials benchmarks that mostly focus on single-step problems, PolyReal emphasizes workflow-oriented evaluation grounded in authentic scientific practice, covering:
- Foundational knowledge application
- Lab safety analysis
- Experiment mechanism reasoning
- Raw data extraction and analysis
- Performance and application exploration
The benchmark contains 545 high-quality question-answer pairs built from real experimental scenarios, including lab images, spectra, mechanism diagrams, and raw CSV data.
PolyReal evaluates MLLMs across five key stages of real-world polymer science workflows.
-
Real-world workflow coverage
PolyReal is built around the actual lifecycle of polymer science research, rather than disconnected toy tasks. -
Multimodal and practice-grounded
The benchmark includes diverse inputs such as spectra, charts, mechanism diagrams, lab scenes, and structured/unstructured raw data. -
High-value scientific evaluation
It targets challenging capabilities that matter in real research environments, including safety understanding, data interpretation, and application reasoning. -
Beyond multiple-choice
PolyReal includes open-ended questions and ranking tasks, enabling finer-grained assessment of scientific reasoning quality. -
Strong diagnostic value
The benchmark exposes a key weakness of current MLLMs: they often perform better on knowledge-heavy reasoning than on practice-oriented, context-dependent scientific tasks.
- Domain: Polymer Science
- Benchmark Type: Multimodal scientific evaluation
- Total Samples: 545
- Question Formats:
- Open-ended question answering
- Numerical extraction
- Ranking tasks
- Data Sources:
- Real experimental scenarios
- Scientific figures and mechanism diagrams
- Lab photos
- Spectra and raw CSV files
PolyReal is organized into five independent yet workflow-aligned modules:
Evaluates whether models can apply core scientific principles to realistic polymer science scenarios.
Tests visual scene understanding and hazard identification in cluttered, real-world laboratory environments.
Assesses causal and procedural reasoning over reaction diagrams, structures, and scientific processes.
Measures a modelβs ability to parse and interpret raw scientific data, including NMR, IR, plots, and CSV-based data.
Evaluates high-level reasoning on structureβproperty relationships and suitability for downstream applications.
Modern MLLMs are increasingly strong in general multimodal reasoning, yet scientific practice requires more than broad knowledge.
PolyReal is designed to answer a more demanding question:
Can multimodal models operate reliably in authentic scientific workflows, where safety, mechanism understanding, raw data interpretation, and application reasoning are tightly connected?
Our benchmark shows that even strong models still face substantial challenges when moving from abstract scientific knowledge to practical, real-world scientific decision-making.
- Leading closed-source models achieve the strongest overall performance.
- Models tend to perform better on knowledge-intensive reasoning tasks.
- Performance drops substantially on practice-based tasks, especially:
- Lab Safety Analysis
- Raw Data Extraction and Analysis
- Many strong models show high recall but relatively low precision, suggesting a tendency toward verbose yet noisy scientific answers.
- Performance also varies notably across different polymer sub-fields.
Hugging Face Dataset:
After downloading, the project root should contain:
PolyReal.json # Main dataset
ref/ # Reference images, spectra, CSVs, and other auxiliary files
git clone git@github.com:wanhaoliu/PolyReal.git
cd PolyRealpip install -r requirements.txtpip install huggingface_hub
python - <<'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="weidawang/PolyReal",
repo_type="dataset",
local_dir=".",
)
EOFSet your API endpoint and key before running inference:
export POLYREAL_API_BASE_URL="https://your-api-host.example.com"
export POLYREAL_API_KEY="your-api-key"If you use --model intern-s1, also set:
export INTERN_S1_API_BASE_URL="https://your-intern-s1-host.example.com/api"
export INTERN_S1_API_KEY="your-intern-s1-api-key"POLYREAL_API_BASE_URL should point to an OpenAI-compatible /v1/chat/completions endpoint, such as OpenAI, Together, OpenRouter, or a local vLLM deployment.
python test.py --model gpt-4oResults will be saved to:
result/gpt-4o/results_gpt-4o.jsonl
Already-processed items are automatically skipped, so the script supports resume.
| Argument | Default | Description |
|---|---|---|
--model |
gpt-4o |
Model name sent to the API |
--workers |
10 |
Number of concurrent threads |
--input_file |
PolyReal.json |
Dataset path |
--image_dir |
ref/ |
Directory containing reference images and CSVs |
--output_dir |
result/ |
Root output directory |
python eval_precision.py --model gpt-4o --eval_model gemini-2.5-flashThis stage uses a second LLM evaluator to compute:
Precision = TP / (TP + FP)
Output:
result/gpt-4o/precision_gpt-4o.jsonl
python eval_recall.py --model gpt-4o --eval_model gemini-2.5-flashThis stage measures answer completeness and coverage of key scoring points.
Output:
result/gpt-4o/recall_gpt-4o.jsonl
python eval_ranking.pyThis script computes ranking-task metrics across all model folders under result/.
Output:
result/{model}/ranking_{model}.jsonl
PolyReal uses a more rigorous evaluation protocol than simple exact-match accuracy.
We report:
- Precision (P) β correctness of the answer
- Recall (R) β completeness of the answer
- F1 Score (F1) β harmonic mean of precision and recall
For each question, domain experts define Key Points that capture the essential scientific content expected in a correct answer.
This allows PolyReal to evaluate not only whether a model answers, but also how well it answers in a scientifically meaningful way.
PolyReal.json # Main dataset
ref/ # Images, spectra, CSVs, and other reference files
test.py # Inference script
eval_precision.py # Precision evaluation
eval_recall.py # Recall evaluation
eval_ranking.py # Ranking evaluation
open_source_config.py # API and path configuration helpers
requirements.txt # Python dependencies
result/ # Output directory for inference and evaluation
logs/ # Log files
- API keys are loaded from environment variables only.
eval_precision.pyandeval_recall.pyautomatically skip ranking questions.- Use
eval_ranking.pyfor ranking-task evaluation. - Please make sure to add your final open-source license file before public release.
If you find PolyReal useful in your research, please cite:
@article{liu2026polyreal,
title={PolyReal: A Benchmark for Real-World Polymer Science Workflows},
author={Liu, Wanhao and Wang, Weida and Xie, Jiaqing and Yang, Suorong and Wang, Jue and Chen, Benteng and Mei, Guangtao and Yang, Zonglin and Zhang, Shufei and Mo, Yuchun and Cheng, Lang and Zeng, Jin and Li, Houqiang and Ouyang, Wanli and Li, Yuqiang},
journal={arXiv preprint arXiv:2604.02934},
year={2026}
}We thank all contributors and domain experts involved in the construction and validation of PolyReal.
If PolyReal is helpful to your work, please consider giving this repository a star.


