AutoEvoEval: Metamorphic Dataset Evolution for Robust Evaluation of Large Language Models

🧠 Motivation

Large Language Models (LLMs) are predominantly evaluated on static benchmarks, which fail to capture real-world distribution shifts, semantic perturbations, and compositional input evolution.

Do LLMs remain robust when evaluation data systematically evolves — across both option-based and open-form tasks?

To answer this question, we introduce AutoEvoEval — an automated, interpretable, and composable framework for evolving evaluation datasets to assess LLM robustness across both close-ended and open-ended tasks.

🔧 Framework Overview

AutoEvoEval enables controlled dataset evolution through:

22 atomic evolution operations (for close-ended tasks)
4 semantic-preserving and noise-invariant operations (for open-form tasks)
Multi-round compositional evolution pipelines
Robustness measurement across task formats

📌 Framework Diagram

Figure 1. AutoEvoEval pipeline: dataset → atomic operations → multi-round evolution → model evaluation.

🔬 Research Questions and Results

🧪 RQ1: Impact of Atomic Evolution Operations

How do different types of atomic perturbations affect model performance?

Method	History	Math	Medicine	Psychology	AVG
AbbrOptCont	-1.788	-0.125	-0.983	-3.085	-1.495
AbbrQ	-5.264	-3.000	-8.746	-9.254	-6.566
AddAboveWrong	-19.135	-25.000	-41.966	-36.890	-30.748
AddIrrOpts	-3.950	-4.750	-2.195	-4.175	-3.768
AddStrongDist	-7.484	-4.875	-2.624	-3.894	-4.719
OptToJudge	-17.663	-14.750	-9.431	-14.863	-14.177
ExpendOptsIrr	+4.167	+19.375	+8.591	+5.536	+9.417
ExpandOptsRel	+5.063	+17.125	+8.896	+6.290	+9.344
ExpandQuesIrr	+0.570	-3.375	-1.409	-2.493	-1.677
ExpandQuesRel	+3.228	-3.500	+2.458	+4.352	+1.634
InsertIrrChars	-6.377	-10.800	-14.206	-13.153	-11.134
RevQ	-49.884	-26.375	-49.569	-49.248	-43.769
RewriteOpt	-3.107	-0.375	-6.159	-8.233	-4.468
RewriteOptRAG	-12.447	--	-20.548	-20.753	-17.916
RewriteQ	-1.719	-1.000	-1.813	-0.859	-1.348
RewriteQRAG	-0.717	--	-4.073	+0.361	-1.476
ShuffleOptIds	-3.924	-2.625	-2.295	-4.393	-3.309
ShuffleOptOrder	-2.896	-5.500	-2.495	-4.674	-3.891
SwapQOpt	-12.368	-13.000	-11.269	-16.864	-13.375
TransOptEnZh	+0.016	-3.125	-2.718	-6.177	-3.001
TransQEnZh	-4.198	-7.250	-10.534	-10.602	-8.146
UpdateOptIds	-4.789	-3.625	-4.055	-10.085	-5.639
AVG	-6.576	-4.828	-8.052	-9.234	-7.283

Table 1. Average accuracy evolution operation across all models.

RevQ (logic reversal): −43.77
AddAboveWrong (None of the above): −30.75
Average degradation: −7.28
Models are highly sensitive to both structural and semantic edits.

🧪 RQ2: Consistency Under Perturbation (ROP Score)

Do models maintain correct answers under minor and moderate perturbations?

Model	RQ	SOO	IIC	UOI	COTJ	SQWO	Others	AVG
DeepSeek-R1	0.912	0.930	0.928	0.926	0.846	0.908	0.857	0.871
DeepSeek-V3	0.909	0.832	0.818	0.882	0.643	0.752	0.853	0.840
Gemini-1.5	0.897	0.810	0.810	0.777	0.673	0.608	0.766	0.765
GLM-4	0.897	0.860	0.698	0.900	0.567	0.717	0.804	0.795
Llama-3.1	0.579	0.236	0.251	0.161	0.410	0.201	0.460	0.418
Mistral-small	0.833	0.801	0.609	0.804	0.574	0.518	0.747	0.731
GPT-3.5	0.896	0.872	0.699	0.914	0.598	0.721	0.804	0.799
GPT-4	0.909	0.855	0.733	0.885	0.586	0.734	0.805	0.799
Average	0.854	0.774	0.693	0.781	0.612	0.645	--	--

Table 2. Recall of Performance (ROP) for various models across perturbation types.

AutoEvoEval replicates and extends PertEval with richer transformations.
Stronger models (GPT-4, DeepSeek-R1) show higher robustness.
ROP drops below 0.5 for weaker models under realistic edits.

🧪 RQ3: Effects of Combining Evolutionary Operations

Do combinations of two operations amplify performance degradation?

Method	Avg Drop
AddIrrOpts	-3.950
AddStrongDist	-7.484
ShuffleOptIds	-3.924
ShuffleOptOrder	-2.896
AddIrrOpts + AddIrrOpts	-6.809
AddIrrOpts + ShuffleOptIds	-8.623
AddIrrOpts + ShuffleOptOrder	-9.525
AddStrongDist + AddIrrOpts	-8.629
AddStrongDist + AddStrongDist	-7.874

Combinations lead to non-linear compounding degradation, with effects stronger than individual perturbations.

Table 3

Combinations like AddIrrOpts + ShuffleOrder cause −9.5 average degradation.
Some effects are non-linear, compounding beyond individual impact.

🧪 RQ4: Impact of Long Evolution Chains

Do longer evolution chains lead to greater robustness failure?

Method	DeepSeek-R1	DeepSeek-V3	Gemini-1.5	GLM-4	Llama-3.1	Mistral-small	GPT-3.5	GPT-4	AVG Drop
Origin	85.232	89.451	81.857	82.700	29.451	75.105	82.700	83.122	--
Rule	-6.751	-45.148	-71.308	-68.354	-23.882	-70.886	-68.776	-68.354	-52.932
LLM	-27.848	-27.848	-43.038	-29.114	-7.257	-37.553	-28.692	-28.692	-28.755
Hybrid	-18.017	-24.219	-35.865	-21.688	-17.637	-23.376	-20.844	-22.110	-22.969

Table 4. Performance degradation across evolution pipelines (Rule-based, LLM-based, Hybrid).

Chain Type	Avg Drop
Rule-based	−52.93
LLM-based	−28.76
Hybrid	−22.97

Even top-tier models (e.g., GPT-4) exhibit large-scale performance degradation under multi-step evolution.

🧪 RQ5: Generalization to Open-Form Benchmarks

Dataset	Rewrite	Shrink	AddIr	InsertText
BoolQ	-1.14	-1.14	-2.86	-5.57
DROP	+0.43	-1.14	+3.00	-2.43
GSM8K	-15.29	-3.86	-10.57	-5.71
Average	-5.33	-2.05	-3.48	-4.57

Overall average drop across datasets: −3.86

These results confirm that dataset evolution generalizes beyond multiple-choice formats and consistently exposes robustness weaknesses in open-form reasoning tasks.

🧩 Contributions

22 interpretable atomic operations for close-ended evaluation
4 evolution operations for open-form tasks
Multi-round compositional pipelines
Evaluation across 8 LLMs and 4 domains
Cross-format validation on 3 open-form benchmarks
Empirical evidence of overestimated robustness in static benchmarks

🎯 Why It Matters

Static benchmarks may overestimate generalization.

AutoEvoEval demonstrates that:

Controlled input evolution consistently degrades performance
Composition depth amplifies robustness failure
Vulnerabilities persist across both close-ended and open-ended tasks

AutoEvoEval provides a composable, interpretable, and format-agnostic robustness evaluation framework for LLMs.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
code		code
eval_data		eval_data
log		log
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoEvoEval: Metamorphic Dataset Evolution for Robust Evaluation of Large Language Models

🧠 Motivation

🔧 Framework Overview

📌 Framework Diagram

🔬 Research Questions and Results

🧪 RQ1: Impact of Atomic Evolution Operations

🧪 RQ2: Consistency Under Perturbation (ROP Score)

🧪 RQ3: Effects of Combining Evolutionary Operations

🧪 RQ4: Impact of Long Evolution Chains

🧪 RQ5: Generalization to Open-Form Benchmarks

🧩 Contributions

🎯 Why It Matters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoEvoEval: Metamorphic Dataset Evolution for Robust Evaluation of Large Language Models

🧠 Motivation

🔧 Framework Overview

📌 Framework Diagram

🔬 Research Questions and Results

🧪 RQ1: Impact of Atomic Evolution Operations

🧪 RQ2: Consistency Under Perturbation (ROP Score)

🧪 RQ3: Effects of Combining Evolutionary Operations

🧪 RQ4: Impact of Long Evolution Chains

🧪 RQ5: Generalization to Open-Form Benchmarks

🧩 Contributions

🎯 Why It Matters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages