ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review

Benchmark Design

  VenueRawDataFetchers       Canonicalization        InputGenerators       Core Benchmark Logic
 ┌─────────────────┐       ┌───────────────┐       ┌───────────────┐       ┌─────────────────┐
 │                 │       │               │       │               │       │                 │
 │    Venue 1      │──┐    │               │──────▶│ R3 Generator  │──┐    │                 │
 │  (e.g. ICLR)    │  │    │               │       │               │  │    │                 │
 └─────────────────┘  │    │   Parsers     │       └───────────────┘  │    │   Processor     │
                      ├───▶│               │                          ├───▶│                 │
       ...            │    │  raw_venue    │       ┌───────────────┐  │    │ extract_claims  │
                      │    │  ──▶ papers   │──────▶│    Control    │──┘    │ assess_comments │
 ┌─────────────────┐  │    │  ──▶ comments │       │   Generators  │       │  score          │
 │                 │  │    │               │       └───────────────┘       │                 │
 │    Venue n      │──┘    │               │                               │                 │
 │  (e.g. eLife)   │       │               │──── Human reviews ──────-────▶│                 │
 └─────────────────┘       └───────────────┘  (parsed into comments)       └─────────────────┘

Pipeline Architecture

All data flows through Cloud SQL (PostgreSQL) via src/db/queries.py → src/db/client.py (psycopg, connected via DATABASE_URL).

Step	File	Reads	Writes
Fetch	`src/fetchers/<venue>.py`	Venue API or local files	`raw_venue_data`
Parse	`src/parsers/<venue>.py`	`raw_venue_data`	`papers` + `comments`
Extract claims	`src/processor/extract_claims.py`	`papers` PDFs via Gemini	`claims`
Assess comments	`src/processor/assess_comments.py`	`comments` + `claims` + PDFs via Gemini context caching	`assessments`
Score	`src/processor/score.py`	`assessments` + `claims`	`results/` (stats CSV + figures)

Assessment Pipeline

Comment assessment uses three separate LLM calls per source per paper, each with a dedicated prompt. Claim extraction runs once per paper before assessment.

Step	Prompt	Model	Input
Extract claims	`src/prompts/extract_claims.md`	Gemini 3 Pro	Paper
Comment parsing	`src/prompts/comment_parser.md`	Gemini 3 Pro	Paper + comments
Claim mapping	`src/prompts/claim_mapping.md`	Gemini 3 Pro	Paper + comments + extracted claims
Critique typing	`src/prompts/critique_typing.md`	Gemini 3 Pro	Paper + comments

Scoring and Output

The score step orchestrates four modules that run after assessment:

Module	File	Description
Metrics	`src/processor/metrics.py`	Computes per-paper rates (mapped, consequential, specification, justification, etc.) and cross-source comparisons (claim overlap, stance-matched overlap)
Statistics	`src/processor/statistics.py`	Pairwise Wilcoxon signed-rank tests with Holm-Bonferroni correction, Shapiro-Wilk normality tests, bootstrap 95% CIs, Cohen's d, and Cohen's kappa
Figures	`src/processor/figures.py`	Generates all `.png` figures, organized by paper section (1-7)
Score	`src/processor/score.py`	Orchestrator that runs metrics, statistics, and figures for each venue

All output is written to results/ (figures and per-venue stats CSVs).

Data Accessibility

The benchmark database is hosted on Cloud SQL (PostgreSQL) and is publicly accessible with read-only credentials:

Host:     34.28.98.243
Port:     5432
Database: reviewbench
User:     user
Password: public

Connect with any PostgreSQL client (e.g. psql, DBeaver, TablePlus):

psql postgresql://user:public@34.28.98.243:5432/reviewbench

Extensibility

The system is designed for multi-venue evaluation. Adding a new venue requires only writing a fetcher (to dump raw API data) and a parser (to extract papers and comments into canonical tables). The generators, processor, and scoring pipeline operate on the canonical schema and require no modification.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for setup instructions, guidelines, and the pull request workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cli.py		cli.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review

Benchmark Design

Pipeline Architecture

Assessment Pipeline

Scoring and Output

Data Accessibility

Extensibility

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review

Benchmark Design

Pipeline Architecture

Assessment Pipeline

Scoring and Output

Data Accessibility

Extensibility

Contributing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages