Skip to content

SYSUSELab/AutoEvoEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoEvoEval: Metamorphic Dataset Evolution for Robust Evaluation of Large Language Models

🧠 Motivation

Large Language Models (LLMs) are predominantly evaluated on static benchmarks, which fail to capture real-world distribution shifts, semantic perturbations, and compositional input evolution.

Do LLMs remain robust when evaluation data systematically evolves — across both option-based and open-form tasks?

To answer this question, we introduce AutoEvoEval — an automated, interpretable, and composable framework for evolving evaluation datasets to assess LLM robustness across both close-ended and open-ended tasks.


🔧 Framework Overview

AutoEvoEval enables controlled dataset evolution through:

  • 22 atomic evolution operations (for close-ended tasks)
  • 4 semantic-preserving and noise-invariant operations (for open-form tasks)
  • Multi-round compositional evolution pipelines
  • Robustness measurement across task formats

📌 Framework Diagram

AutoEvoEval Framework
Figure 1. AutoEvoEval pipeline: dataset → atomic operations → multi-round evolution → model evaluation.


🔬 Research Questions and Results


🧪 RQ1: Impact of Atomic Evolution Operations

How do different types of atomic perturbations affect model performance?

Method History Math Medicine Psychology AVG
AbbrOptCont -1.788 -0.125 -0.983 -3.085 -1.495
AbbrQ -5.264 -3.000 -8.746 -9.254 -6.566
AddAboveWrong -19.135 -25.000 -41.966 -36.890 -30.748
AddIrrOpts -3.950 -4.750 -2.195 -4.175 -3.768
AddStrongDist -7.484 -4.875 -2.624 -3.894 -4.719
OptToJudge -17.663 -14.750 -9.431 -14.863 -14.177
ExpendOptsIrr +4.167 +19.375 +8.591 +5.536 +9.417
ExpandOptsRel +5.063 +17.125 +8.896 +6.290 +9.344
ExpandQuesIrr +0.570 -3.375 -1.409 -2.493 -1.677
ExpandQuesRel +3.228 -3.500 +2.458 +4.352 +1.634
InsertIrrChars -6.377 -10.800 -14.206 -13.153 -11.134
RevQ -49.884 -26.375 -49.569 -49.248 -43.769
RewriteOpt -3.107 -0.375 -6.159 -8.233 -4.468
RewriteOptRAG -12.447 -- -20.548 -20.753 -17.916
RewriteQ -1.719 -1.000 -1.813 -0.859 -1.348
RewriteQRAG -0.717 -- -4.073 +0.361 -1.476
ShuffleOptIds -3.924 -2.625 -2.295 -4.393 -3.309
ShuffleOptOrder -2.896 -5.500 -2.495 -4.674 -3.891
SwapQOpt -12.368 -13.000 -11.269 -16.864 -13.375
TransOptEnZh +0.016 -3.125 -2.718 -6.177 -3.001
TransQEnZh -4.198 -7.250 -10.534 -10.602 -8.146
UpdateOptIds -4.789 -3.625 -4.055 -10.085 -5.639
AVG -6.576 -4.828 -8.052 -9.234 -7.283

Table 1. Average accuracy evolution operation across all models.

  • RevQ (logic reversal): −43.77
  • AddAboveWrong (None of the above): −30.75
  • Average degradation: −7.28
  • Models are highly sensitive to both structural and semantic edits.

🧪 RQ2: Consistency Under Perturbation (ROP Score)

Do models maintain correct answers under minor and moderate perturbations?

Model RQ SOO IIC UOI COTJ SQWO Others AVG
DeepSeek-R1 0.912 0.930 0.928 0.926 0.846 0.908 0.857 0.871
DeepSeek-V3 0.909 0.832 0.818 0.882 0.643 0.752 0.853 0.840
Gemini-1.5 0.897 0.810 0.810 0.777 0.673 0.608 0.766 0.765
GLM-4 0.897 0.860 0.698 0.900 0.567 0.717 0.804 0.795
Llama-3.1 0.579 0.236 0.251 0.161 0.410 0.201 0.460 0.418
Mistral-small 0.833 0.801 0.609 0.804 0.574 0.518 0.747 0.731
GPT-3.5 0.896 0.872 0.699 0.914 0.598 0.721 0.804 0.799
GPT-4 0.909 0.855 0.733 0.885 0.586 0.734 0.805 0.799
Average 0.854 0.774 0.693 0.781 0.612 0.645 -- --

Table 2. Recall of Performance (ROP) for various models across perturbation types.

  • AutoEvoEval replicates and extends PertEval with richer transformations.
  • Stronger models (GPT-4, DeepSeek-R1) show higher robustness.
  • ROP drops below 0.5 for weaker models under realistic edits.

🧪 RQ3: Effects of Combining Evolutionary Operations

Do combinations of two operations amplify performance degradation?

Method Avg Drop
AddIrrOpts -3.950
AddStrongDist -7.484
ShuffleOptIds -3.924
ShuffleOptOrder -2.896
AddIrrOpts + AddIrrOpts -6.809
AddIrrOpts + ShuffleOptIds -8.623
AddIrrOpts + ShuffleOptOrder -9.525
AddStrongDist + AddIrrOpts -8.629
AddStrongDist + AddStrongDist -7.874

Combinations lead to non-linear compounding degradation, with effects stronger than individual perturbations.

Table 3


  • Combinations like AddIrrOpts + ShuffleOrder cause −9.5 average degradation.
  • Some effects are non-linear, compounding beyond individual impact.

🧪 RQ4: Impact of Long Evolution Chains

Do longer evolution chains lead to greater robustness failure?

Method DeepSeek-R1 DeepSeek-V3 Gemini-1.5 GLM-4 Llama-3.1 Mistral-small GPT-3.5 GPT-4 AVG Drop
Origin 85.232 89.451 81.857 82.700 29.451 75.105 82.700 83.122 --
Rule -6.751 -45.148 -71.308 -68.354 -23.882 -70.886 -68.776 -68.354 -52.932
LLM -27.848 -27.848 -43.038 -29.114 -7.257 -37.553 -28.692 -28.692 -28.755
Hybrid -18.017 -24.219 -35.865 -21.688 -17.637 -23.376 -20.844 -22.110 -22.969

Table 4. Performance degradation across evolution pipelines (Rule-based, LLM-based, Hybrid).

Chain Type Avg Drop
Rule-based −52.93
LLM-based −28.76
Hybrid −22.97

Even top-tier models (e.g., GPT-4) exhibit large-scale performance degradation under multi-step evolution.


🧪 RQ5: Generalization to Open-Form Benchmarks

Dataset Rewrite Shrink AddIr InsertText
BoolQ -1.14 -1.14 -2.86 -5.57
DROP +0.43 -1.14 +3.00 -2.43
GSM8K -15.29 -3.86 -10.57 -5.71
Average -5.33 -2.05 -3.48 -4.57

Overall average drop across datasets: −3.86

These results confirm that dataset evolution generalizes beyond multiple-choice formats and consistently exposes robustness weaknesses in open-form reasoning tasks.


🧩 Contributions

  • 22 interpretable atomic operations for close-ended evaluation
  • 4 evolution operations for open-form tasks
  • Multi-round compositional pipelines
  • Evaluation across 8 LLMs and 4 domains
  • Cross-format validation on 3 open-form benchmarks
  • Empirical evidence of overestimated robustness in static benchmarks

🎯 Why It Matters

Static benchmarks may overestimate generalization.

AutoEvoEval demonstrates that:

  • Controlled input evolution consistently degrades performance
  • Composition depth amplifies robustness failure
  • Vulnerabilities persist across both close-ended and open-ended tasks

AutoEvoEval provides a composable, interpretable, and format-agnostic robustness evaluation framework for LLMs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages