Large Language Models (LLMs) are predominantly evaluated on static benchmarks, which fail to capture real-world distribution shifts, semantic perturbations, and compositional input evolution.
Do LLMs remain robust when evaluation data systematically evolves — across both option-based and open-form tasks?
To answer this question, we introduce AutoEvoEval — an automated, interpretable, and composable framework for evolving evaluation datasets to assess LLM robustness across both close-ended and open-ended tasks.
AutoEvoEval enables controlled dataset evolution through:
- 22 atomic evolution operations (for close-ended tasks)
- 4 semantic-preserving and noise-invariant operations (for open-form tasks)
- Multi-round compositional evolution pipelines
- Robustness measurement across task formats

Figure 1. AutoEvoEval pipeline: dataset → atomic operations → multi-round evolution → model evaluation.
How do different types of atomic perturbations affect model performance?
| Method | History | Math | Medicine | Psychology | AVG |
|---|---|---|---|---|---|
| AbbrOptCont | -1.788 | -0.125 | -0.983 | -3.085 | -1.495 |
| AbbrQ | -5.264 | -3.000 | -8.746 | -9.254 | -6.566 |
| AddAboveWrong | -19.135 | -25.000 | -41.966 | -36.890 | -30.748 |
| AddIrrOpts | -3.950 | -4.750 | -2.195 | -4.175 | -3.768 |
| AddStrongDist | -7.484 | -4.875 | -2.624 | -3.894 | -4.719 |
| OptToJudge | -17.663 | -14.750 | -9.431 | -14.863 | -14.177 |
| ExpendOptsIrr | +4.167 | +19.375 | +8.591 | +5.536 | +9.417 |
| ExpandOptsRel | +5.063 | +17.125 | +8.896 | +6.290 | +9.344 |
| ExpandQuesIrr | +0.570 | -3.375 | -1.409 | -2.493 | -1.677 |
| ExpandQuesRel | +3.228 | -3.500 | +2.458 | +4.352 | +1.634 |
| InsertIrrChars | -6.377 | -10.800 | -14.206 | -13.153 | -11.134 |
| RevQ | -49.884 | -26.375 | -49.569 | -49.248 | -43.769 |
| RewriteOpt | -3.107 | -0.375 | -6.159 | -8.233 | -4.468 |
| RewriteOptRAG | -12.447 | -- | -20.548 | -20.753 | -17.916 |
| RewriteQ | -1.719 | -1.000 | -1.813 | -0.859 | -1.348 |
| RewriteQRAG | -0.717 | -- | -4.073 | +0.361 | -1.476 |
| ShuffleOptIds | -3.924 | -2.625 | -2.295 | -4.393 | -3.309 |
| ShuffleOptOrder | -2.896 | -5.500 | -2.495 | -4.674 | -3.891 |
| SwapQOpt | -12.368 | -13.000 | -11.269 | -16.864 | -13.375 |
| TransOptEnZh | +0.016 | -3.125 | -2.718 | -6.177 | -3.001 |
| TransQEnZh | -4.198 | -7.250 | -10.534 | -10.602 | -8.146 |
| UpdateOptIds | -4.789 | -3.625 | -4.055 | -10.085 | -5.639 |
| AVG | -6.576 | -4.828 | -8.052 | -9.234 | -7.283 |
Table 1. Average accuracy evolution operation across all models.
- RevQ (logic reversal): −43.77
- AddAboveWrong (None of the above): −30.75
- Average degradation: −7.28
- Models are highly sensitive to both structural and semantic edits.
Do models maintain correct answers under minor and moderate perturbations?
| Model | RQ | SOO | IIC | UOI | COTJ | SQWO | Others | AVG |
|---|---|---|---|---|---|---|---|---|
| DeepSeek-R1 | 0.912 | 0.930 | 0.928 | 0.926 | 0.846 | 0.908 | 0.857 | 0.871 |
| DeepSeek-V3 | 0.909 | 0.832 | 0.818 | 0.882 | 0.643 | 0.752 | 0.853 | 0.840 |
| Gemini-1.5 | 0.897 | 0.810 | 0.810 | 0.777 | 0.673 | 0.608 | 0.766 | 0.765 |
| GLM-4 | 0.897 | 0.860 | 0.698 | 0.900 | 0.567 | 0.717 | 0.804 | 0.795 |
| Llama-3.1 | 0.579 | 0.236 | 0.251 | 0.161 | 0.410 | 0.201 | 0.460 | 0.418 |
| Mistral-small | 0.833 | 0.801 | 0.609 | 0.804 | 0.574 | 0.518 | 0.747 | 0.731 |
| GPT-3.5 | 0.896 | 0.872 | 0.699 | 0.914 | 0.598 | 0.721 | 0.804 | 0.799 |
| GPT-4 | 0.909 | 0.855 | 0.733 | 0.885 | 0.586 | 0.734 | 0.805 | 0.799 |
| Average | 0.854 | 0.774 | 0.693 | 0.781 | 0.612 | 0.645 | -- | -- |
Table 2. Recall of Performance (ROP) for various models across perturbation types.
- AutoEvoEval replicates and extends PertEval with richer transformations.
- Stronger models (GPT-4, DeepSeek-R1) show higher robustness.
- ROP drops below 0.5 for weaker models under realistic edits.
Do combinations of two operations amplify performance degradation?
| Method | Avg Drop |
|---|---|
| AddIrrOpts | -3.950 |
| AddStrongDist | -7.484 |
| ShuffleOptIds | -3.924 |
| ShuffleOptOrder | -2.896 |
| AddIrrOpts + AddIrrOpts | -6.809 |
| AddIrrOpts + ShuffleOptIds | -8.623 |
| AddIrrOpts + ShuffleOptOrder | -9.525 |
| AddStrongDist + AddIrrOpts | -8.629 |
| AddStrongDist + AddStrongDist | -7.874 |
Combinations lead to non-linear compounding degradation, with effects stronger than individual perturbations.
Table 3
- Combinations like
AddIrrOpts + ShuffleOrdercause −9.5 average degradation. - Some effects are non-linear, compounding beyond individual impact.
Do longer evolution chains lead to greater robustness failure?
| Method | DeepSeek-R1 | DeepSeek-V3 | Gemini-1.5 | GLM-4 | Llama-3.1 | Mistral-small | GPT-3.5 | GPT-4 | AVG Drop |
|---|---|---|---|---|---|---|---|---|---|
| Origin | 85.232 | 89.451 | 81.857 | 82.700 | 29.451 | 75.105 | 82.700 | 83.122 | -- |
| Rule | -6.751 | -45.148 | -71.308 | -68.354 | -23.882 | -70.886 | -68.776 | -68.354 | -52.932 |
| LLM | -27.848 | -27.848 | -43.038 | -29.114 | -7.257 | -37.553 | -28.692 | -28.692 | -28.755 |
| Hybrid | -18.017 | -24.219 | -35.865 | -21.688 | -17.637 | -23.376 | -20.844 | -22.110 | -22.969 |
Table 4. Performance degradation across evolution pipelines (Rule-based, LLM-based, Hybrid).
| Chain Type | Avg Drop |
|---|---|
| Rule-based | −52.93 |
| LLM-based | −28.76 |
| Hybrid | −22.97 |
Even top-tier models (e.g., GPT-4) exhibit large-scale performance degradation under multi-step evolution.
| Dataset | Rewrite | Shrink | AddIr | InsertText |
|---|---|---|---|---|
| BoolQ | -1.14 | -1.14 | -2.86 | -5.57 |
| DROP | +0.43 | -1.14 | +3.00 | -2.43 |
| GSM8K | -15.29 | -3.86 | -10.57 | -5.71 |
| Average | -5.33 | -2.05 | -3.48 | -4.57 |
Overall average drop across datasets: −3.86
These results confirm that dataset evolution generalizes beyond multiple-choice formats and consistently exposes robustness weaknesses in open-form reasoning tasks.
- 22 interpretable atomic operations for close-ended evaluation
- 4 evolution operations for open-form tasks
- Multi-round compositional pipelines
- Evaluation across 8 LLMs and 4 domains
- Cross-format validation on 3 open-form benchmarks
- Empirical evidence of overestimated robustness in static benchmarks
Static benchmarks may overestimate generalization.
AutoEvoEval demonstrates that:
- Controlled input evolution consistently degrades performance
- Composition depth amplifies robustness failure
- Vulnerabilities persist across both close-ended and open-ended tasks
AutoEvoEval provides a composable, interpretable, and format-agnostic robustness evaluation framework for LLMs.