Research Grader is a Codex skill for evaluating research papers with an area-chair style workflow: collect papers, extract local PDF text, send them through multiple anonymous reviewer passes, then produce a calibrated ranking with concrete rationales.
It is designed for moments when a single-paper summary is not enough. The strongest use case is comparing several papers from the same research area at once, preferably alongside already accepted conference papers or well-regarded baselines. That comparison makes the score more meaningful: novelty, rigor, and transferable value are judged against the field instead of against an isolated abstract.
Most paper reviews collapse into either a summary or a vibe score. Research Grader tries to do something stricter:
- Uses a fixed scoring formula:
40% Innovation + 40% Intrinsic Paper Value + 20% Rigor. - Keeps writing quality and visual polish as a separate reference score, not part of the final score.
- Prioritizes method sections, evaluation protocol, ablations, limitations, and appendices.
- Penalizes benchmark chasing, thin disclosure, weak baselines, and marketing-style claims.
- Supports anonymous multi-reviewer scoring, where different reviewer styles inspect the same paper independently.
In practice, this works best when Codex is run with a strong model and high reasoning budget. In our usage, Codex with GPT-5.5 at xhigh reasoning produced the most useful results: the anonymous reviewers were better at finding different failure modes, and the final aggregation was more stable when several close papers had to be ranked.
The skill defaults to multiple blind reviewer passes when sub-agents are available. Each reviewer receives the same rubric and local extracted paper text, but uses a different lens:
strict-rigor: methodology, ablations, reproducibility, statistical caution.innovation: novelty, technical idea quality, and whether the work changes practice.value: reusable lessons, failure modes, negative results, and field-level insight.skeptical-meta: overclaiming, hidden assumptions, evaluator weakness, benchmark leakage.
The main Codex agent acts as the area chair. It does not expose one reviewer's scores to another reviewer before collection. After the reviews are in, it reconciles disagreements, reports uncertainty, and emits the final ranking.
- Put target PDFs into one directory.
- Add accepted or canonical papers from the same area for calibration when possible.
- Ask Codex to use this skill to score the whole directory.
- Let the skill extract PDF text into
analysis/text/. - Run anonymous multi-reviewer scoring.
- Read the final ranking and the disagreement notes, not just the total scores.
Example prompt:
Use the Research Grader skill to evaluate all PDFs in this directory.
Use multiple anonymous sub-agents with different reviewer styles.
Compare these papers against the accepted conference papers in the same folder,
then rank them by innovation, intrinsic paper value, and rigor.
The standard output is a compact ranking table:
| Rank | Report | Innovation | Paper Value | Rigor | Aesthetics (ref) | Total |
|---|
Each paper also gets a short rationale covering:
- Primary contribution.
- Why it scored where it did.
- Main limitation or reason it did not rank higher.
Optional tier labels are available:
S 夯爆: paradigm-level paper.A 很夯: strong method route with broad research value.B 够硬: robust and valuable engineering or applied research.C 能打: credible and useful but less foundational.D 偏拉: thin, mostly capability report, or limited paper insight.
.
├── SKILL.md
├── README.md
├── README.zh-CN.md
├── agents/
│ └── openai.yaml
├── references/
│ └── rubric.md
└── scripts/
└── extract_pdf_reports.py
Run from a directory containing PDFs:
python3 scripts/extract_pdf_reports.py .The script writes extracted text and summaries under analysis/:
analysis/text/<pdf-stem>.txtanalysis/pdf_summaries.json- optional representative page PNGs with
--render-pages
Research Grader scores the paper text and disclosed evidence. It does not verify that benchmark numbers are true, that private datasets exist, or that released code exactly matches the paper unless you explicitly ask Codex to perform that external audit.