Research project investigating whether structured reasoning generation combined with step-aware reward-based selection improves best-of-N inference on dual-head language models (DHRD-style), across both classification and mathematical reasoning tasks.
.
├── docs/
│ ├── proposal.md # Project proposal (title / description / main goal)
│ └── experiment_design.md # Phase 1 experiment design (2×2 factorial) + Phase 2 sketch
├── paper/
│ ├── c2.pdf # C2 paper
│ ├── dual-head.pdf # DHRD paper
│ ├── step_wise.pdf # SRaR paper
│ ├── c2_code/C2/ # C2 reference code
│ └── step_wise_code/SRaR/ # SRaR reference code (fork of verl)
└── README.md
This repository includes two third-party codebases, both Apache-2.0 (their original LICENSE files are preserved in place):
- C2 — Cooperative-Critical Reward Modeling — upstream: https://github.com/asahi-research/C2
- SRaR — Step-wise Rubrics as Rewards — upstream: https://github.com/akarinmoe/SRaR
Phase 1 (minimum viable demo): see docs/experiment_design.md.
Phase 1 trains no models — the generator is frozen and only inference + selection are evaluated in a 2×2 factorial (structured/unstructured CoT × outcome-only/step-aware selection).