Criterion-Centric Rationale Alignment for Code Preference Judging. CriterAlign decomposes a pairwise A-vs-B preference call into four LLM stages — criterion generation, pairwise criterion judging, batched tie-driven criterion refinement (BTCR), swap-consistency criterion filtering (SCF), and final synthesis — and conditions every stage on Human-Preference-Aligned Guidance (HPAG) distilled offline from rationale gaps between human votes and a monolithic judge.
Zhenyu Li · Aleksandar Cvejić · Zehui Chen · Peter Wonka
🌐 Project page · 📄 arXiv (coming soon) · 💻 Code
- Paper: CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging (under review).
- Project page:
docs/index.html(also published via GitHub Pages from themainbranch's/docsfolder). - Headline result: 66.3 % accuracy on the BigCodeReward validation split with a Qwen2.5-VL-32B judge, vs. 60.4 % for the same judge run monolithically (+5.9 pp), and +11.3 pp over the strongest reproduced criterion-generation baseline.
┌───────────┐ ┌───────────────┐ ┌──────┐ ┌──────┐ ┌──────────────┐
│ Criterion │ → │ Pairwise │ → │ BTCR │ → │ SCF │ → │ Final judge │
│ generator │ │ criterion │ │ │ │ │ │ over evidence│
│ (+ HPAG) │ │ judge (+HPAG) │ │ │ │ │ │ (+ HPAG) │
└───────────┘ └───────────────┘ └──────┘ └──────┘ └──────────────┘
- Pairwise criterion judging. Each criterion gets a direct comparative
verdict
v ∈ {A, B, tie, insufficient}rather than two independent pointwise scores. - BTCR — batched tie-driven criterion refinement. Tied criteria are decomposed into up to two finer comparative sub-criteria in a single batched call, then re-judged.
- SCF — swap-consistency criterion filtering. Each criterion is re-judged with the candidate order swapped; only criteria whose verdict survives the swap operator π are kept.
- HPAG — Human-Preference-Aligned Guidance. Compact natural-language guidance synthesised offline from the 20 % training split by comparing human votes with monolithic-judge rationales. We synthesise two layers: G-HPAG (dataset-wide principles) and C-HPAG (per task family — web development, game development, creative coding, diagram creation, scientific computing, problem solving). HPAG is injected into the criterion generator, criterion judge, and final judge.
BigCodeReward validation (n = 3,785), Qwen2.5-VL-32B judge, execution outputs + screenshots provided. The split is 20 % / 80 % training / validation; the training split is used exclusively for HPAG synthesis. Starting from the monolithic baseline (60.4 %), each component of the pipeline adds a cumulative gain up to 66.3 % for full CriterAlign.
For the full set of comparisons against criterion-generation baselines, the per-stage HPAG ablation, and the cross-judge generalisation study, see the paper and the project page.
The pixi path is the recommended way to install — it pins the full vLLM / torch stack used in the paper.
git clone https://github.com/zhyever/CriterAlign.git
cd CriterAlign
curl -fsSL https://pixi.sh/install.sh | bash # if you don't have pixi
pixi installPure pip (install vLLM separately if hosting the judge locally):
pip install -r requirements.txt
pip install -e .
pip install "vllm==0.8.5.post1" # only if hosting the judge locallyThe benchmark uses three Hugging Face Hub datasets that download to a single
local Arrow tree under ./code_benchmark/:
pixi run download-data
# → ./code_benchmark/bigcodereward/ (4731 instructions + code A/B + execution traces)
# → ./code_benchmark/bigcodearena-preference-5k/ (five-aspect human pairwise labels)
# → ./code_benchmark/bigcodereward-experiment-results/ (pre-computed monolithic baselines for 15 judges)The 80 / 20 validation / training stratified split file ships under
splits/dataset_split_seed42.json (seed 42; the same split used for every
number in the paper).
The pipeline talks to any OpenAI-compatible endpoint. The default config in
src/criteralign/config/judge_model_config.yaml points at
http://localhost:8000/v1 for the qwen2.5-vl-32b entry.
Local 4×80 GB GPUs (4×H200 in the paper; 4×H100 / 4×A100 also work):
pixi run vllm-serve-qwen32bSLURM cluster (long-lived server, 3-day allocation, 4 GPUs):
sbatch scripts/sbatch_vllm_server.sh
# Once it's RUNNING:
HOST=$(squeue -u $USER -h -o "%N" -j <jobid>)
export VLLM_API_BASE="http://${HOST}:8000/v1"VLLM_API_BASE, if set, overrides every api_base: for api_type: vllm
entries in the YAML — so the same client code works against either a local
or a remote vLLM server.
Three steps after the server is up. The bundled HPAG JSON
(guidance/guidance_v2_sonnet46.json) is the exact Sonnet 4.6-synthesised
file used in the paper, so you do not need Bedrock credentials to
reproduce the number.
# 1. Run the full pipeline (criterion gen → pairwise judge → BTCR → SCF → final synth)
python -m criteralign.scripts.run.run_pairwise_rrd \
--guidance-file guidance/guidance_v2_sonnet46.json \
--prompt-version v7 \
--workers 32
# Output:
# outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/
# train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl
# (the legacy output-dir name corresponds to the full CriterAlign pipeline:
# pairwise RRD + BTCR + SCF + HPAG)
# 2. Evaluate on the validation split with the canonical "naive_overall" aggregator
python -m criteralign.scripts.eval.evaluate_decomposed \
--results outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
--judge qwen2_5_vl_32b \
--label-source reward \
--split-file splits/dataset_split_seed42.json \
--split-side valExpected: naive_overall accuracy ≈ 66.3 %.
Use --max-records 50 --workers 4 for a quick smoke test.
python -m criteralign.scripts.run.reproduce_baseline --workers 32
python -m criteralign.scripts.eval.compare_results \
--reproduced outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
--judge qwen2_5_vl_32bThe bundled guidance/guidance_v2_sonnet46.json was synthesised offline from
the 20 % training split using Claude Sonnet 4.6 as the guidance synthesiser.
The two-stage synthesis pipeline is included so you can rebuild HPAG (e.g.
with a different synthesiser model). Both stages need outbound LLM access
(Bedrock for Sonnet 4.6 in our setup, but any model registered in the YAML
works).
# Stage 1 — Human-rationale reconstruction.
# A strong LLM writes a concise rationale hypothesis for each human vote.
python -m criteralign.scripts.run.generate_human_rationales \
--model sonnet46-bedrock --mode enhanced --workers 16
# Stage 2 — Guidance synthesis (HPAG).
# Compare human and monolithic-judge rationales; distil recurring rationale
# gaps into global + category-level guidance.
python -m criteralign.scripts.run.generate_prompt_guidance \
--model sonnet46-bedrock \
--template src/criteralign/prompts/rationale_gap/guidance_synthesis_v2.md \
--output-file guidance/guidance_v2_<your-tag>.jsonSynthesis is non-deterministic; expect small (≤ 1 pp) variation in the downstream validation accuracy across resyntheses. The paper's Table 4 shows HPAG is robust to the choice of synthesiser, including self-synthesis by the judge model itself.
The docs/ site is data-driven. To regenerate the seven before / after
cases after you've run the pipeline:
python scripts/curate_examples.py \
--mono-jsonl outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
--criteralign-jsonl outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
--reward-dir code_benchmark/bigcodereward/train \
--split-file splits/dataset_split_seed42.json \
--split-side val --split-ratio 0.20 --n 6This rewrites examples/case_NN/, the web mirror under
docs/assets/examples/, and the manifest docs/examples.json that the page
reads at load time. To swap a single slot for a different session id without
re-curating the whole list, edit scripts/replace_examples.py.
python -m http.server 8765 --directory docs
# → http://127.0.0.1:8765/To publish via GitHub Pages: in the repo settings, set Pages → Source → Deploy from a branch → main / docs. No build step.
CriterAlign/
├── README.md, LICENSE, pyproject.toml, pixi.toml, requirements.txt
├── src/criteralign/
│ ├── config/ settings.py + judge_model_config.yaml (model registry)
│ ├── data/ loader.py (pyarrow), split.py (val/train stratified split)
│ ├── judge/ prompt builder, API client, response parser, monolithic judger
│ ├── criterion/ generator, criterion_judger, pairwise_rrd (BTCR + SCF), naive_judger
│ ├── eval/ pairwise_rrd_runner (with checkpoint/resume), runner
│ ├── prompts/ v7/ (CriterAlign prompts), pairwise_rrd_batch/ (BTCR redundancy / conflict checks),
│ │ pairwise_rrd_tie_decompose/, rationale_gap/ (HPAG synthesis),
│ │ judge_prompt_unified.md (monolithic baseline)
│ └── scripts/
│ ├── run/ reproduce_baseline, run_pairwise_rrd,
│ │ generate_human_rationales, generate_prompt_guidance
│ └── eval/ evaluate_decomposed, compare_results
├── scripts/
│ ├── download_data.py # HF datasets → ./code_benchmark/
│ ├── start_vllm_mgpus.sh # local 4-GPU vLLM server boot
│ ├── sbatch_vllm_server.sh # SLURM long-lived server (optional)
│ ├── curate_examples.py # regenerates docs/examples.json
│ └── replace_examples.py # surgical per-slot swap for hand-picked cases
├── guidance/
│ └── guidance_v2_sonnet46.json # bundled HPAG, used for the 66.3 % run
├── splits/
│ └── dataset_split_seed42.json # 20 % train / 80 % val stratified by category
├── examples/ # canonical case-study artefacts (one folder per case)
└── docs/ # GitHub Pages site (vanilla HTML/CSS/JS)
├── index.html, style.css, app.js, examples.json
├── assets/figures/ # teaser.svg, pipeline.svg
└── assets/examples/case_NN/ # code, screenshots, criteria for the in-page viewer
Note on legacy paths. The output directory
pairwise_rrd_v7v2_bf_bd_psf_with_execution/and the JSONL fieldpairwise_rrd_metadata.n_position_swap_filteredpredate the final paper naming. They correspond to the full CriterAlign pipeline — pairwise RRD with BTCR (BF + BD), SCF (PSF), and HPAG (V7v2 nested guidance). We keep the legacy names so existing runs remain readable; new documentation and the project page use the paper terminology.
MIT — see LICENSE.

