Research Grader

Research Grader is a Codex skill for evaluating research papers with an area-chair style workflow: collect papers, extract local PDF text, send them through multiple anonymous reviewer passes, then produce a calibrated ranking with concrete rationales.

It is designed for moments when a single-paper summary is not enough. The strongest use case is comparing several papers from the same research area at once, preferably alongside already accepted conference papers or well-regarded baselines. That comparison makes the score more meaningful: novelty, rigor, and transferable value are judged against the field instead of against an isolated abstract.

Why This Skill

Most paper reviews collapse into either a summary or a vibe score. Research Grader tries to do something stricter:

Uses a fixed scoring formula: 40% Innovation + 40% Intrinsic Paper Value + 20% Rigor.
Keeps writing quality and visual polish as a separate reference score, not part of the final score.
Prioritizes method sections, evaluation protocol, ablations, limitations, and appendices.
Penalizes benchmark chasing, thin disclosure, weak baselines, and marketing-style claims.
Supports anonymous multi-reviewer scoring, where different reviewer styles inspect the same paper independently.

In practice, this works best when Codex is run with a strong model and high reasoning budget. In our usage, Codex with GPT-5.5 at xhigh reasoning produced the most useful results: the anonymous reviewers were better at finding different failure modes, and the final aggregation was more stable when several close papers had to be ranked.

Anonymous Reviewer Protocol

The skill defaults to multiple blind reviewer passes when sub-agents are available. Each reviewer receives the same rubric and local extracted paper text, but uses a different lens:

strict-rigor: methodology, ablations, reproducibility, statistical caution.
innovation: novelty, technical idea quality, and whether the work changes practice.
value: reusable lessons, failure modes, negative results, and field-level insight.
skeptical-meta: overclaiming, hidden assumptions, evaluator weakness, benchmark leakage.

The main Codex agent acts as the area chair. It does not expose one reviewer's scores to another reviewer before collection. After the reviews are in, it reconciles disagreements, reports uncertainty, and emits the final ranking.

Recommended Workflow

Put target PDFs into one directory.
Add accepted or canonical papers from the same area for calibration when possible.
Ask Codex to use this skill to score the whole directory.
Let the skill extract PDF text into analysis/text/.
Run anonymous multi-reviewer scoring.
Read the final ranking and the disagreement notes, not just the total scores.

Example prompt:

Use the Research Grader skill to evaluate all PDFs in this directory.
Use multiple anonymous sub-agents with different reviewer styles.
Compare these papers against the accepted conference papers in the same folder,
then rank them by innovation, intrinsic paper value, and rigor.

Output

The standard output is a compact ranking table:

Rank	Report	Innovation	Paper Value	Rigor	Aesthetics (ref)	Total

Each paper also gets a short rationale covering:

Primary contribution.
Why it scored where it did.
Main limitation or reason it did not rank higher.

Optional tier labels are available:

S 夯爆: paradigm-level paper.
A 很夯: strong method route with broad research value.
B 够硬: robust and valuable engineering or applied research.
C 能打: credible and useful but less foundational.
D 偏拉: thin, mostly capability report, or limited paper insight.

Repository Layout

.
├── SKILL.md
├── README.md
├── README.zh-CN.md
├── agents/
│   └── openai.yaml
├── references/
│   └── rubric.md
└── scripts/
    └── extract_pdf_reports.py

Extraction Script

Run from a directory containing PDFs:

python3 scripts/extract_pdf_reports.py .

The script writes extracted text and summaries under analysis/:

analysis/text/<pdf-stem>.txt
analysis/pdf_summaries.json
optional representative page PNGs with --render-pages

Notes

Research Grader scores the paper text and disclosed evidence. It does not verify that benchmark numbers are true, that private datasets exist, or that released code exactly matches the paper unless you explicitly ask Codex to perform that external audit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research Grader

Why This Skill

Anonymous Reviewer Protocol

Recommended Workflow

Output

Repository Layout

Extraction Script

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
references		references
scripts		scripts
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

Research Grader

Why This Skill

Anonymous Reviewer Protocol

Recommended Workflow

Output

Repository Layout

Extraction Script

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages