CriterAlign

Criterion-Centric Rationale Alignment for Code Preference Judging. CriterAlign decomposes a pairwise A-vs-B preference call into four LLM stages — criterion generation, pairwise criterion judging, batched tie-driven criterion refinement (BTCR), swap-consistency criterion filtering (SCF), and final synthesis — and conditions every stage on Human-Preference-Aligned Guidance (HPAG) distilled offline from rationale gaps between human votes and a monolithic judge.

Zhenyu Li · Aleksandar Cvejić · Zehui Chen · Peter Wonka

🌐 Project page · 📄 arXiv (coming soon) · 💻 Code

Paper: CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging (under review).
Project page: docs/index.html (also published via GitHub Pages from the main branch's /docs folder).
Headline result: 66.3 % accuracy on the BigCodeReward validation split with a Qwen2.5-VL-32B judge, vs. 60.4 % for the same judge run monolithically (+5.9 pp), and +11.3 pp over the strongest reproduced criterion-generation baseline.

What's in the pipeline

  ┌───────────┐   ┌───────────────┐   ┌──────┐   ┌──────┐   ┌──────────────┐
  │ Criterion │ → │ Pairwise      │ → │ BTCR │ → │ SCF  │ → │ Final judge  │
  │ generator │   │ criterion     │   │      │   │      │   │ over evidence│
  │ (+ HPAG)  │   │ judge (+HPAG) │   │      │   │      │   │ (+ HPAG)     │
  └───────────┘   └───────────────┘   └──────┘   └──────┘   └──────────────┘

Pairwise criterion judging. Each criterion gets a direct comparative verdict v ∈ {A, B, tie, insufficient} rather than two independent pointwise scores.
BTCR — batched tie-driven criterion refinement. Tied criteria are decomposed into up to two finer comparative sub-criteria in a single batched call, then re-judged.
SCF — swap-consistency criterion filtering. Each criterion is re-judged with the candidate order swapped; only criteria whose verdict survives the swap operator π are kept.
HPAG — Human-Preference-Aligned Guidance. Compact natural-language guidance synthesised offline from the 20 % training split by comparing human votes with monolithic-judge rationales. We synthesise two layers: G-HPAG (dataset-wide principles) and C-HPAG (per task family — web development, game development, creative coding, diagram creation, scientific computing, problem solving). HPAG is injected into the criterion generator, criterion judge, and final judge.

Results

BigCodeReward validation (n = 3,785), Qwen2.5-VL-32B judge, execution outputs + screenshots provided. The split is 20 % / 80 % training / validation; the training split is used exclusively for HPAG synthesis. Starting from the monolithic baseline (60.4 %), each component of the pipeline adds a cumulative gain up to 66.3 % for full CriterAlign.

For the full set of comparisons against criterion-generation baselines, the per-stage HPAG ablation, and the cross-judge generalisation study, see the paper and the project page.

Install

The pixi path is the recommended way to install — it pins the full vLLM / torch stack used in the paper.

git clone https://github.com/zhyever/CriterAlign.git
cd CriterAlign
curl -fsSL https://pixi.sh/install.sh | bash         # if you don't have pixi
pixi install

Pure pip (install vLLM separately if hosting the judge locally):

pip install -r requirements.txt
pip install -e .
pip install "vllm==0.8.5.post1"                       # only if hosting the judge locally

Download the data

The benchmark uses three Hugging Face Hub datasets that download to a single local Arrow tree under ./code_benchmark/:

pixi run download-data
# → ./code_benchmark/bigcodereward/                  (4731 instructions + code A/B + execution traces)
# → ./code_benchmark/bigcodearena-preference-5k/     (five-aspect human pairwise labels)
# → ./code_benchmark/bigcodereward-experiment-results/ (pre-computed monolithic baselines for 15 judges)

The 80 / 20 validation / training stratified split file ships under splits/dataset_split_seed42.json (seed 42; the same split used for every number in the paper).

Serve the judge (Qwen2.5-VL-32B)

The pipeline talks to any OpenAI-compatible endpoint. The default config in src/criteralign/config/judge_model_config.yaml points at http://localhost:8000/v1 for the qwen2.5-vl-32b entry.

Local 4×80 GB GPUs (4×H200 in the paper; 4×H100 / 4×A100 also work):

pixi run vllm-serve-qwen32b

SLURM cluster (long-lived server, 3-day allocation, 4 GPUs):

sbatch scripts/sbatch_vllm_server.sh
# Once it's RUNNING:
HOST=$(squeue -u $USER -h -o "%N" -j <jobid>)
export VLLM_API_BASE="http://${HOST}:8000/v1"

VLLM_API_BASE, if set, overrides every api_base: for api_type: vllm entries in the YAML — so the same client code works against either a local or a remote vLLM server.

Reproduce the headline number (66.3 %)

Three steps after the server is up. The bundled HPAG JSON (guidance/guidance_v2_sonnet46.json) is the exact Sonnet 4.6-synthesised file used in the paper, so you do not need Bedrock credentials to reproduce the number.

# 1. Run the full pipeline (criterion gen → pairwise judge → BTCR → SCF → final synth)
python -m criteralign.scripts.run.run_pairwise_rrd \
    --guidance-file guidance/guidance_v2_sonnet46.json \
    --prompt-version v7 \
    --workers 32

# Output:
#   outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/
#       train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl
#   (the legacy output-dir name corresponds to the full CriterAlign pipeline:
#    pairwise RRD + BTCR + SCF + HPAG)

# 2. Evaluate on the validation split with the canonical "naive_overall" aggregator
python -m criteralign.scripts.eval.evaluate_decomposed \
    --results outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
    --judge qwen2_5_vl_32b \
    --label-source reward \
    --split-file splits/dataset_split_seed42.json \
    --split-side val

Expected: naive_overall accuracy ≈ 66.3 %.

Use --max-records 50 --workers 4 for a quick smoke test.

Reproduce the monolithic baseline (60.4 %)

python -m criteralign.scripts.run.reproduce_baseline --workers 32
python -m criteralign.scripts.eval.compare_results \
    --reproduced outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
    --judge qwen2_5_vl_32b

Re-synthesise HPAG from scratch (optional)

The bundled guidance/guidance_v2_sonnet46.json was synthesised offline from the 20 % training split using Claude Sonnet 4.6 as the guidance synthesiser. The two-stage synthesis pipeline is included so you can rebuild HPAG (e.g. with a different synthesiser model). Both stages need outbound LLM access (Bedrock for Sonnet 4.6 in our setup, but any model registered in the YAML works).

# Stage 1 — Human-rationale reconstruction.
#   A strong LLM writes a concise rationale hypothesis for each human vote.
python -m criteralign.scripts.run.generate_human_rationales \
    --model sonnet46-bedrock --mode enhanced --workers 16

# Stage 2 — Guidance synthesis (HPAG).
#   Compare human and monolithic-judge rationales; distil recurring rationale
#   gaps into global + category-level guidance.
python -m criteralign.scripts.run.generate_prompt_guidance \
    --model sonnet46-bedrock \
    --template src/criteralign/prompts/rationale_gap/guidance_synthesis_v2.md \
    --output-file guidance/guidance_v2_<your-tag>.json

Synthesis is non-deterministic; expect small (≤ 1 pp) variation in the downstream validation accuracy across resyntheses. The paper's Table 4 shows HPAG is robust to the choice of synthesiser, including self-synthesis by the judge model itself.

Curate the project-page examples

The docs/ site is data-driven. To regenerate the seven before / after cases after you've run the pipeline:

python scripts/curate_examples.py \
    --mono-jsonl       outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
    --criteralign-jsonl outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
    --reward-dir       code_benchmark/bigcodereward/train \
    --split-file       splits/dataset_split_seed42.json \
    --split-side val --split-ratio 0.20 --n 6

This rewrites examples/case_NN/, the web mirror under docs/assets/examples/, and the manifest docs/examples.json that the page reads at load time. To swap a single slot for a different session id without re-curating the whole list, edit scripts/replace_examples.py.

Browse the project page locally

python -m http.server 8765 --directory docs
# → http://127.0.0.1:8765/

To publish via GitHub Pages: in the repo settings, set Pages → Source → Deploy from a branch → main / docs. No build step.

Repo layout

CriterAlign/
├── README.md, LICENSE, pyproject.toml, pixi.toml, requirements.txt
├── src/criteralign/
│   ├── config/       settings.py + judge_model_config.yaml (model registry)
│   ├── data/         loader.py (pyarrow), split.py (val/train stratified split)
│   ├── judge/        prompt builder, API client, response parser, monolithic judger
│   ├── criterion/    generator, criterion_judger, pairwise_rrd (BTCR + SCF), naive_judger
│   ├── eval/         pairwise_rrd_runner (with checkpoint/resume), runner
│   ├── prompts/      v7/ (CriterAlign prompts), pairwise_rrd_batch/ (BTCR redundancy / conflict checks),
│   │                 pairwise_rrd_tie_decompose/, rationale_gap/ (HPAG synthesis),
│   │                 judge_prompt_unified.md (monolithic baseline)
│   └── scripts/
│       ├── run/      reproduce_baseline, run_pairwise_rrd,
│       │             generate_human_rationales, generate_prompt_guidance
│       └── eval/     evaluate_decomposed, compare_results
├── scripts/
│   ├── download_data.py        # HF datasets → ./code_benchmark/
│   ├── start_vllm_mgpus.sh     # local 4-GPU vLLM server boot
│   ├── sbatch_vllm_server.sh   # SLURM long-lived server (optional)
│   ├── curate_examples.py      # regenerates docs/examples.json
│   └── replace_examples.py     # surgical per-slot swap for hand-picked cases
├── guidance/
│   └── guidance_v2_sonnet46.json  # bundled HPAG, used for the 66.3 % run
├── splits/
│   └── dataset_split_seed42.json  # 20 % train / 80 % val stratified by category
├── examples/                   # canonical case-study artefacts (one folder per case)
└── docs/                       # GitHub Pages site (vanilla HTML/CSS/JS)
    ├── index.html, style.css, app.js, examples.json
    ├── assets/figures/             # teaser.svg, pipeline.svg
    └── assets/examples/case_NN/    # code, screenshots, criteria for the in-page viewer

Note on legacy paths. The output directory pairwise_rrd_v7v2_bf_bd_psf_with_execution/ and the JSONL field pairwise_rrd_metadata.n_position_swap_filtered predate the final paper naming. They correspond to the full CriterAlign pipeline — pairwise RRD with BTCR (BF + BD), SCF (PSF), and HPAG (V7v2 nested guidance). We keep the legacy names so existing runs remain readable; new documentation and the project page use the paper terminology.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CriterAlign

What's in the pipeline

Results

Install

Download the data

Serve the judge (Qwen2.5-VL-32B)

Reproduce the headline number (66.3 %)

Reproduce the monolithic baseline (60.4 %)

Re-synthesise HPAG from scratch (optional)

Curate the project-page examples

Browse the project page locally

Repo layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
guidance		guidance
scripts		scripts
splits		splits
src/criteralign		src/criteralign
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CriterAlign

What's in the pipeline

Results

Install

Download the data

Serve the judge (Qwen2.5-VL-32B)

Reproduce the headline number (66.3 %)

Reproduce the monolithic baseline (60.4 %)

Re-synthesise HPAG from scratch (optional)

Curate the project-page examples

Browse the project page locally

Repo layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages