Skip to content

zhyever/CriterAlign

Repository files navigation

CriterAlign

Criterion-Centric Rationale Alignment for Code Preference Judging. CriterAlign decomposes a pairwise A-vs-B preference call into four LLM stages — criterion generation, pairwise criterion judging, batched tie-driven criterion refinement (BTCR), swap-consistency criterion filtering (SCF), and final synthesis — and conditions every stage on Human-Preference-Aligned Guidance (HPAG) distilled offline from rationale gaps between human votes and a monolithic judge.

Zhenyu Li · Aleksandar Cvejić · Zehui Chen · Peter Wonka

🌐 Project page · 📄 arXiv (coming soon) · 💻 Code

CriterAlign teaser: offline HPAG synthesis + inference-time injection into a frozen judge

  • Paper: CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging (under review).
  • Project page: docs/index.html (also published via GitHub Pages from the main branch's /docs folder).
  • Headline result: 66.3 % accuracy on the BigCodeReward validation split with a Qwen2.5-VL-32B judge, vs. 60.4 % for the same judge run monolithically (+5.9 pp), and +11.3 pp over the strongest reproduced criterion-generation baseline.

What's in the pipeline

  ┌───────────┐   ┌───────────────┐   ┌──────┐   ┌──────┐   ┌──────────────┐
  │ Criterion │ → │ Pairwise      │ → │ BTCR │ → │ SCF  │ → │ Final judge  │
  │ generator │   │ criterion     │   │      │   │      │   │ over evidence│
  │ (+ HPAG)  │   │ judge (+HPAG) │   │      │   │      │   │ (+ HPAG)     │
  └───────────┘   └───────────────┘   └──────┘   └──────┘   └──────────────┘
  • Pairwise criterion judging. Each criterion gets a direct comparative verdict v ∈ {A, B, tie, insufficient} rather than two independent pointwise scores.
  • BTCR — batched tie-driven criterion refinement. Tied criteria are decomposed into up to two finer comparative sub-criteria in a single batched call, then re-judged.
  • SCF — swap-consistency criterion filtering. Each criterion is re-judged with the candidate order swapped; only criteria whose verdict survives the swap operator π are kept.
  • HPAG — Human-Preference-Aligned Guidance. Compact natural-language guidance synthesised offline from the 20 % training split by comparing human votes with monolithic-judge rationales. We synthesise two layers: G-HPAG (dataset-wide principles) and C-HPAG (per task family — web development, game development, creative coding, diagram creation, scientific computing, problem solving). HPAG is injected into the criterion generator, criterion judge, and final judge.

Results

BigCodeReward validation (n = 3,785), Qwen2.5-VL-32B judge, execution outputs + screenshots provided. The split is 20 % / 80 % training / validation; the training split is used exclusively for HPAG synthesis. Starting from the monolithic baseline (60.4 %), each component of the pipeline adds a cumulative gain up to 66.3 % for full CriterAlign.

Component ablation: pairwise RRD, BTCR, G-HPAG, SCF, and C-HPAG each contribute a cumulative gain over the monolithic Qwen2.5-VL-32B baseline.

For the full set of comparisons against criterion-generation baselines, the per-stage HPAG ablation, and the cross-judge generalisation study, see the paper and the project page.

Install

The pixi path is the recommended way to install — it pins the full vLLM / torch stack used in the paper.

git clone https://github.com/zhyever/CriterAlign.git
cd CriterAlign
curl -fsSL https://pixi.sh/install.sh | bash         # if you don't have pixi
pixi install

Pure pip (install vLLM separately if hosting the judge locally):

pip install -r requirements.txt
pip install -e .
pip install "vllm==0.8.5.post1"                       # only if hosting the judge locally

Download the data

The benchmark uses three Hugging Face Hub datasets that download to a single local Arrow tree under ./code_benchmark/:

pixi run download-data
# → ./code_benchmark/bigcodereward/                  (4731 instructions + code A/B + execution traces)
# → ./code_benchmark/bigcodearena-preference-5k/     (five-aspect human pairwise labels)
# → ./code_benchmark/bigcodereward-experiment-results/ (pre-computed monolithic baselines for 15 judges)

The 80 / 20 validation / training stratified split file ships under splits/dataset_split_seed42.json (seed 42; the same split used for every number in the paper).

Serve the judge (Qwen2.5-VL-32B)

The pipeline talks to any OpenAI-compatible endpoint. The default config in src/criteralign/config/judge_model_config.yaml points at http://localhost:8000/v1 for the qwen2.5-vl-32b entry.

Local 4×80 GB GPUs (4×H200 in the paper; 4×H100 / 4×A100 also work):

pixi run vllm-serve-qwen32b

SLURM cluster (long-lived server, 3-day allocation, 4 GPUs):

sbatch scripts/sbatch_vllm_server.sh
# Once it's RUNNING:
HOST=$(squeue -u $USER -h -o "%N" -j <jobid>)
export VLLM_API_BASE="http://${HOST}:8000/v1"

VLLM_API_BASE, if set, overrides every api_base: for api_type: vllm entries in the YAML — so the same client code works against either a local or a remote vLLM server.

Reproduce the headline number (66.3 %)

Three steps after the server is up. The bundled HPAG JSON (guidance/guidance_v2_sonnet46.json) is the exact Sonnet 4.6-synthesised file used in the paper, so you do not need Bedrock credentials to reproduce the number.

# 1. Run the full pipeline (criterion gen → pairwise judge → BTCR → SCF → final synth)
python -m criteralign.scripts.run.run_pairwise_rrd \
    --guidance-file guidance/guidance_v2_sonnet46.json \
    --prompt-version v7 \
    --workers 32

# Output:
#   outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/
#       train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl
#   (the legacy output-dir name corresponds to the full CriterAlign pipeline:
#    pairwise RRD + BTCR + SCF + HPAG)

# 2. Evaluate on the validation split with the canonical "naive_overall" aggregator
python -m criteralign.scripts.eval.evaluate_decomposed \
    --results outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
    --judge qwen2_5_vl_32b \
    --label-source reward \
    --split-file splits/dataset_split_seed42.json \
    --split-side val

Expected: naive_overall accuracy ≈ 66.3 %.

Use --max-records 50 --workers 4 for a quick smoke test.

Reproduce the monolithic baseline (60.4 %)

python -m criteralign.scripts.run.reproduce_baseline --workers 32
python -m criteralign.scripts.eval.compare_results \
    --reproduced outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
    --judge qwen2_5_vl_32b

Re-synthesise HPAG from scratch (optional)

The bundled guidance/guidance_v2_sonnet46.json was synthesised offline from the 20 % training split using Claude Sonnet 4.6 as the guidance synthesiser. The two-stage synthesis pipeline is included so you can rebuild HPAG (e.g. with a different synthesiser model). Both stages need outbound LLM access (Bedrock for Sonnet 4.6 in our setup, but any model registered in the YAML works).

# Stage 1 — Human-rationale reconstruction.
#   A strong LLM writes a concise rationale hypothesis for each human vote.
python -m criteralign.scripts.run.generate_human_rationales \
    --model sonnet46-bedrock --mode enhanced --workers 16

# Stage 2 — Guidance synthesis (HPAG).
#   Compare human and monolithic-judge rationales; distil recurring rationale
#   gaps into global + category-level guidance.
python -m criteralign.scripts.run.generate_prompt_guidance \
    --model sonnet46-bedrock \
    --template src/criteralign/prompts/rationale_gap/guidance_synthesis_v2.md \
    --output-file guidance/guidance_v2_<your-tag>.json

Synthesis is non-deterministic; expect small (≤ 1 pp) variation in the downstream validation accuracy across resyntheses. The paper's Table 4 shows HPAG is robust to the choice of synthesiser, including self-synthesis by the judge model itself.

Curate the project-page examples

The docs/ site is data-driven. To regenerate the seven before / after cases after you've run the pipeline:

python scripts/curate_examples.py \
    --mono-jsonl       outputs/qwen2.5-vl-32b/with_execution/train-judge_qwen2.5-vl-32b_with_execution.jsonl \
    --criteralign-jsonl outputs/qwen2.5-vl-32b/pairwise_rrd_v7v2_bf_bd_psf_with_execution/train-pairwise_rrd_tie_decompose_qwen2.5-vl-32b_with_execution.jsonl \
    --reward-dir       code_benchmark/bigcodereward/train \
    --split-file       splits/dataset_split_seed42.json \
    --split-side val --split-ratio 0.20 --n 6

This rewrites examples/case_NN/, the web mirror under docs/assets/examples/, and the manifest docs/examples.json that the page reads at load time. To swap a single slot for a different session id without re-curating the whole list, edit scripts/replace_examples.py.

Browse the project page locally

python -m http.server 8765 --directory docs
# → http://127.0.0.1:8765/

To publish via GitHub Pages: in the repo settings, set Pages → Source → Deploy from a branch → main / docs. No build step.

Repo layout

CriterAlign/
├── README.md, LICENSE, pyproject.toml, pixi.toml, requirements.txt
├── src/criteralign/
│   ├── config/       settings.py + judge_model_config.yaml (model registry)
│   ├── data/         loader.py (pyarrow), split.py (val/train stratified split)
│   ├── judge/        prompt builder, API client, response parser, monolithic judger
│   ├── criterion/    generator, criterion_judger, pairwise_rrd (BTCR + SCF), naive_judger
│   ├── eval/         pairwise_rrd_runner (with checkpoint/resume), runner
│   ├── prompts/      v7/ (CriterAlign prompts), pairwise_rrd_batch/ (BTCR redundancy / conflict checks),
│   │                 pairwise_rrd_tie_decompose/, rationale_gap/ (HPAG synthesis),
│   │                 judge_prompt_unified.md (monolithic baseline)
│   └── scripts/
│       ├── run/      reproduce_baseline, run_pairwise_rrd,
│       │             generate_human_rationales, generate_prompt_guidance
│       └── eval/     evaluate_decomposed, compare_results
├── scripts/
│   ├── download_data.py        # HF datasets → ./code_benchmark/
│   ├── start_vllm_mgpus.sh     # local 4-GPU vLLM server boot
│   ├── sbatch_vllm_server.sh   # SLURM long-lived server (optional)
│   ├── curate_examples.py      # regenerates docs/examples.json
│   └── replace_examples.py     # surgical per-slot swap for hand-picked cases
├── guidance/
│   └── guidance_v2_sonnet46.json  # bundled HPAG, used for the 66.3 % run
├── splits/
│   └── dataset_split_seed42.json  # 20 % train / 80 % val stratified by category
├── examples/                   # canonical case-study artefacts (one folder per case)
└── docs/                       # GitHub Pages site (vanilla HTML/CSS/JS)
    ├── index.html, style.css, app.js, examples.json
    ├── assets/figures/             # teaser.svg, pipeline.svg
    └── assets/examples/case_NN/    # code, screenshots, criteria for the in-page viewer

Note on legacy paths. The output directory pairwise_rrd_v7v2_bf_bd_psf_with_execution/ and the JSONL field pairwise_rrd_metadata.n_position_swap_filtered predate the final paper naming. They correspond to the full CriterAlign pipeline — pairwise RRD with BTCR (BF + BD), SCF (PSF), and HPAG (V7v2 nested guidance). We keep the legacy names so existing runs remain readable; new documentation and the project page use the paper terminology.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors