Skip to content

wanshuiyin/Anti-Autoresearch

Repository files navigation

Anti-Autoresearch 🛡️

Join Community · Cite · Parent ARIS stars · ARIS Report arXiv:2605.03042 · ARIS · HF Daily #1 · ARIS on PaperWeekly · ARIS in awesome-agent-skills · License: MIT · README 中文

🔬 The field has tolerated unreliable autoresearch long enough — Anti-Autoresearch is the read that finally catches it.

天下苦 autoresearch 久矣 —— Anti-Autoresearch 替研究者们一眼看穿不靠谱的工作。

🏆 Built on a battle-tested foundation: ARIS (~12.5k★ · HuggingFace Daily Papers #1 · 78+ skills across 7+ platforms). Anti-Autoresearch points ARIS's production audit DNA (experiment-audit · paper-claim-audit · citation-audit · kill-argument) outward — auditing a third party's submission instead of your own.

Autoresearch has gone mainstream, and a fast-growing share of what reaches the review pile is machine-generated — and a lot of it doesn't hold up: tables that don't match the text, baselines that aren't there, open-sourced code that won't reproduce its own paper. Reviewers, area chairs, and honest authors increasingly need to verify that, not just suspect it.

Regardless of who or what wrote a paper, does the science hold together and reflect its own evidence? Anti-Autoresearch audits a submission for self-consistency and fabrication, and produces a span-anchored, reviewer-ready report. It is not an opaque AI-text classifier (no authorship probabilities, no "AI-written" verdict) and does not judge misconduct — it surfaces discrepancies a human reviewer should investigate. Separately, it lists transparent, itemized AI writing-style impressions in a quarantined, zero-verdict-weight section (a paper can be integrity-CLEAN while listing many), because reviewers react to them.


📰 News

  • v0.5 (2026-06) — Added the AIS track (AI Writing-Style Impressions): 13 transparent, itemized writing-style signals (defensive hedging, LLM phrasing tics, clause-then-formula walls, bullet/bold spam, invented codenames, single-style figures, …) reported in a separate, zero-verdict-weight section — a paper can be integrity-CLEAN_GIVEN_EVIDENCE while listing many. The 5 pure-style patterns moved out of family F into AIS. Taxonomy restructured to 46 integrity patterns (A–H) + 13 AIS + 2 advisory; new /ai-style-impressions skill; the adjudicator now provably excludes zero-weight findings from the verdict (regression-tested). These are transparent impressions, never an authorship verdict — we are not an opaque AI-text classifier.

  • v0.4 (2026-06) — Taxonomy v0.4: 51 hack-patterns across 8 families — A. Numeric self-consistency (数值自洽:表内·表文·增量算术对得上) · B. Method & scope (方法与范围:说的方法/范围≠实际做的) · C. Baseline integrity (baseline 诚信:对比基线缺失·偏弱·不公平) · D. Experiment integrity (实验诚信:假 GT·幽灵结果·代码≠数字,需代码) · E. Citation integrity (引用诚信:伪造·张冠李戴·撤稿) · F. Presentation & surface signals (表面信号:排版·文风·配图) · G. Proof & derivation integrity (证明诚信:漏证·循环论证·无效推导) · H. Evaluation design & validity (评测设计有效性:数据泄漏·LLM 裁判可信度·选择性报告, new). The deterministic eval gate grew 3→8 patterns (GRIM / GRIMMER / statcheck, plus a conservative defensive-hedge density screen); added CI, the eval-design-forensics skill, the HP-INVENTED-CODENAME surface pattern, and a prior-art acknowledgments section. Two more checkable self-consistency patterns — HP-ACRONYM-DRIFT (family B) and HP-UNDEFINED-NOTATION (family G) — were distilled from a "vibe-paper tells" thread while refusing its pure-stylometry items (we are not a vibe classifier).

  • v0.1 (2026-06) — Initial release: reviewer-side integrity forensics for autoresearch / AI-Scientist papers. Ships the evidence ledger, deterministic adjudicator, and observability tiers. Not an AI-text detector.

🚀 Quickstart

Agent workflow (normal use)

Anti-Autoresearch runs as a Claude Code skill workflow — the Python tools are the deterministic spine inside that workflow, not the usual interface.

# 1) Install the skills + workflow (global, or pass a project's .claude/skills dir)
git clone https://github.com/wanshuiyin/Anti-Autoresearch.git
./Anti-Autoresearch/tools/install_anti_autoresearch.sh              # → ~/.claude/skills
# project-local instead: ./Anti-Autoresearch/tools/install_anti_autoresearch.sh ./.claude/skills

# 2) Wire the cross-model reviewer (end state: Claude Code exposes mcp__codex__codex)
claude mcp add codex -- codex mcp-server
claude mcp list

# 3) Audit a paper
claude
> /anti-autoresearch ~/papers/submission

The run writes REPORT.md + report.json + claims.json + per-skill *.findings.json into the paper directory. Put the code/result artifacts alongside the paper to unlock L2 checks; PDF/source-only runs are observability-limited by design.

Zero verdict weight — the AIS + advisory tracks (reported, never moves the verdict)

Three skills produce outputs that are reported but carry zero weight on the integrity verdict — the non-integrity categories that round out a report: the AIS writing-style track and the advisory memos. They matter to a human reviewer (a style impression, the worst-case rejection paragraph, prior-art overlap), so the report shows them in their own section — but the deterministic verdict stays driven only by the 46 integrity patterns. A paper can be CLEAN_GIVEN_EVIDENCE while listing many. /anti-autoresearch runs them automatically; to run one standalone, build the ledger first (next section) and invoke it like any auditor.

Skill What it writes
/ai-style-impressions (AIS · separate report section) AI writing-style impressions: defensive hedging, LLM phrasing tics, clause-then-formula walls, bullet/bold spam, invented codenames, single-style figures
/adversarial-case-builder (memo, no verdict) the single strongest evidence-bound rejection paragraph a hostile reviewer would write
/novelty-duplication-advisory (memo, no verdict) prior-work overlap: trivial-combination ("缝合 / stapling") and duplicate-publication candidates, laid out for a human to weigh

Single-skill use

Every auditor is also a standalone skill — the installer drops all of them plus the workflow, so you can run just the axis you care about. They share one contract, so run it in order:

claude
# 1) Build the evidence ledger ONCE — the spine every auditor anchors to. Skip it and
#    any auditor stops with:  NO_LEDGER: claims.json not found. Run /evidence-ledger FIRST
> /evidence-ledger ~/papers/submission        # → claims.json + observability level (L0/L1/L2)

# 2) Then run any auditor below against that ledger → <skill>.findings.json

The verdict-bearing auditors — each takes the paper dir, reads the ledger, and proposes span-anchored findings the deterministic adjudicator turns into the verdict (the zero-weight AIS + advisory skills are in the section above):

Skill What it catches
/consistency-audit the paper against itself: inflated / mismatched numbers, method & scope drift, appendix-vs-body contradictions
/citation-forensics citations: hallucinated references, and real papers cited for a claim they don't make
/baseline-comparison-audit the missing / weak / mistuned baselines hiding behind a "SOTA" or "outperforms" claim
/experiment-forensics (L2 — needs code+results) fake / derived ground truth, score self-normalization, phantom results, placeholder data, code output ≠ reported numbers
/proof-derivation-forensics (L1 — needs LaTeX source) the written proof: skipped obligations, circularity, invalid steps, symbol drift, smuggled assumptions
/eval-design-forensics the evaluation's validity: train/test leakage, a conflicted or unvalidated LLM-judge metric, selective reporting (dropped conditions / switched metrics)
/presentation-signals (capped at minor → at most SOFT) checkable surface tells: duplicate tables, leftover pipeline/template strings, LLM-generated figures, page-padding — context, never a verdict

A single skill only proposes span-anchored findings — it never returns a verdict. To get one, feed the findings to the deterministic adjudicator (the python3 tools/adjudicate_findings.py … --ledger … command in the next section); the model never grades. Two more notes: consistency-audit, presentation-signals, and ai-style-impressions also write a *.deterministic.findings.json (works with no cross-model reviewer wired); and /anti-autoresearch runs every auditor above in one shot, adding ingest (arxiv-id / pdf → workdir + pdftotext), automatic observability, auto-selection of which auditors apply, and the final cross-dimension verdict + REPORT.md.

Deterministic core (CI / offline / zero-dependency)

This bypasses the agent layer and exercises only the eval-tested deterministic checks — use it for CI, regression tests, or environments with no cross-model reviewer (Python 3 stdlib, nothing to install):

# Prove the pipeline on clean + corrupted fixtures (the regression gate)
python3 eval/run_eval.py
#   clean / delta_inflate / dup_table / headline_inflate  → all PASS
#   injected-defect recall: 100% (7 deterministic patterns) · clean FP: none
python3 tests/test_adjudicator.py        # gate unit tests (the anti-slop guarantee)

# Or run the spine by hand on a real paper:
python3 tools/build_claim_ledger.py --paper-id mypaper --latex main.tex sections/*.tex \
    --observability-level 1 --out claims.json
python3 tools/check_numeric_consistency.py --ledger claims.json --out findings.json
python3 tools/adjudicate_findings.py --findings findings.json --ledger claims.json \
    --paper-id mypaper --observability-level 1 --out report.json --md REPORT.md
#   --ledger is REQUIRED: a finding must quote a verbatim ledger span or it fails closed to info.

🎯 Why this exists

Machine-generated papers and reviews are now a measurable share of the literature, and the failure that matters for an area chair is rarely "was this text written by an LLM?" (a human can write a dishonest paper; an LLM can write an honest one). It is: does the paper contradict itself, and is it backed by its own evidence?

That is what autoresearch pipelines get wrong — they hallucinate local coherence: an abstract number that no table reports, a "16% improvement" that the operands say is 6%, a citation for a claim the cited paper never makes, a method described one way and evaluated another.

Those are checkable under a declared observability level. Concretely, taxonomy v0.5 names 46 integrity patterns across 8 families (numeric self-consistency · method / scope · baseline integrity · experiment integrity · citation integrity · presentation / surface signals · proof & derivation integrity · evaluation design & validity) — the repo's coverage vocabulary, not a detector benchmark — plus a 13-signal AI writing-style impression track (AIS) that carries zero verdict weight.

Shipped v0: the deterministic spine and the seven ✓-marked patterns (across the representative list below and the full catalog) are eval-tested; the other 39 integrity patterns are agent-layer contracts (a cross-model reviewer proposes span-anchored findings, the deterministic adjudicator scores or demotes them) — not bundled-eval detector claims.

The full catalog, with detection signals and false-positive cases, lives in the taxonomy. A representative ten (✓ = gated by the deterministic eval today):

  • HP-NUM-INFLATE — abstract says 85.3%, but Table 2 never gets past 84.7%. ✓
  • HP-DELTA-ERROR — a "16% improvement" from 73.1 to 78.0 is really 6.7%. ✓
  • HP-DUP-TABLE — two tables carry the identical ordered numbers — usually copy-paste padding. ✓
  • HP-METHOD-DRIFT — the method section says "no labels"; the eval quietly uses gold-label calibration.
  • HP-SCOPE-INFLATE — "comprehensive" turns out to be two datasets, one domain, maybe one seed.
  • HP-MISSING-BASELINE — SOTA is claimed while the obvious recent baseline never appears in the table.
  • HP-FAKE-GT — (L2) the "reference" targets are model outputs, then reported as ground truth.
  • HP-PHANTOM-RESULT — (L2) a headline number points at a result file or metric key that isn't there.
  • HP-PROOF-CIRCULARITY — (L1) the "proof" restates the claim in different words and calls it done — it proves nothing.
  • HP-CITE-HALLUC — the DOI / arXiv id / venue / author list simply doesn't exist.
… the other 36 integrity patterns + the 13 AIS impressions, in full

A · Numeric self-consistency

  • HP-AGG-DRIFT — they write "mean over seeds", but the number is really the best seed.
  • HP-DENOM-DRIFT — one table averages all tasks; the conclusion quietly uses the applicable-only subset.
  • HP-UNIT-DIR-MISMATCH — points silently become percent, or a lower-better metric is celebrated upward.
  • HP-CAPTION-MISMATCH — the caption promises N=5 and method B; the plot shows neither.
  • HP-APPENDIX-CONTRA — the appendix reruns the same quantity and disagrees with the main text.
  • HP-GRANULARITY-IMPOSSIBLE — "84.7% on 500 items" is arithmetically impossible — no integer k/500 rounds to it (GRIM). ✓
  • HP-VARIANCE-IMPOSSIBLE — a reported SD bigger than a bounded metric can have at that mean (e.g. SD 18% at mean 98% — cap ≈15.7%). ✓
  • HP-STAT-INCONSISTENCY — the reported p contradicts its own test statistic and overstates significance ("z=1.10, p=.036" → really p≈.27). ✓

B · Method & scope

  • HP-ABLATION-ATTRIB — they credit component X, but every ablation keeps X bundled with Y.
  • HP-THEOREM-SCOPE-DRIFT — the abstract sells a general theorem; the assumptions do nearly all the work.
  • HP-ARGUMENT-CHAIN-BREAK — a substantive missing link: the problem motivated isn't the one the method addresses, or the experiments measure something the mechanism doesn't predict.
  • HP-CAUSAL-EVIDENCE-LEAP — a causal / equivalence relation is concluded that no experiment in the paper actually varies or tests.
  • HP-RESOURCE-IDENTITY-MISMATCH — a named dataset/model/benchmark described with a property its public record contradicts ("ImageNet-1k, 5,000 classes" — it's 1,000).
  • HP-ACRONYM-DRIFT — the same load-bearing component/term gets two incompatible names or acronym expansions across the paper.

C · Baseline integrity

  • HP-WEAK-BASELINE — the new method gets tuning and compute the baseline plainly did not.
  • HP-SIG-OVERLAP — "outperforms" by crumbs, with overlapping error bars or no seeds shown.

D · Experiment integrity (needs code/results — L2)

  • HP-SELF-NORM — (L2) the score nears 1.0 because it's divided by the model's own max.
  • HP-DEAD-METRIC — (L2) a metric function exists with no call site and no result, yet is discussed.
  • HP-SUSPICIOUS-REGULARITY — (L2) rows differ by a suspiciously clean offset — check the files before calling it fake.
  • HP-PLACEHOLDER-DATA — (L2) released code still ships placeholder/dummy/fake data feeding a reported figure or number.
  • HP-RESULT-ARTIFACT-MISMATCH — (L2) the released code / artifacts, run as written, produce numbers different from the paper's.
  • HP-MISSING-REPRO-ARTIFACT — (L2) an empirical paper ships neither code nor the prompts/configs its results depend on.

E · Citation integrity

  • HP-CITE-CONTEXT — real paper, wrong job: cited for a claim it explicitly doesn't make (incl. semantic-hallucination + a support/contrast/mention intent label).
  • HP-CITE-RETRACTED — a load-bearing citation that resolves to a retracted paper, with no note of the retraction (Crossref / Retraction Watch).

F · Presentation & surface signals (capped at minor — never a verdict)

  • HP-THIN-FLOAT — a "broad empirical study" somehow has two tables and one lonely figure.
  • HP-LLM-FIGURE — the "figure" is decorative model art, not a plot or a real diagram.
  • HP-PAGE-PADDING — oversized floats, repeated text, or empty prose doing page-count labor.
  • HP-PIPELINE-ARTIFACT — a leftover pipeline/template string ("As an AI language model", "regenerate response", "[INSERT X]") survives into the finished text. ✓ (exact-match, low-FP)

G · Proof & derivation integrity (verdict-bearing at L1 — from the written math)

  • HP-PROOF-OBLIGATION-GAP — (L1) a required lemma / case / transition is skipped with "clearly" across a real gap.
  • HP-DERIVATION-INVALID — (L1) an algebra / probability / calculus step does not follow (a misapplied inequality, a wrong limit).
  • HP-SYMBOL-SEMANTIC-DRIFT — (L1) a symbol / operator / inequality direction changes meaning between definition, formula, and proof.
  • HP-ASSUMPTION-SMUGGLE — (L1) the proof relies on an assumption (independence, convexity, …) the theorem statement never lists.
  • HP-UNDEFINED-NOTATION — (L1) a load-bearing symbol is used in a key equation/proof but never defined and not inferable from standard convention.

H · Evaluation design & validity (L0/L1 stated → L2 confirmed)

  • HP-EVAL-LEAKAGE — train/test leakage (preprocess-before-split, duplicates across splits, temporal leak, pretraining contamination) means the score may not measure generalization. Adopts the Kapoor–Narayanan leakage taxonomy.
  • HP-JUDGE-VALIDITY — the load-bearing metric is an LLM judge that's conflicted (same family as a compared system) or unvalidated (no human-agreement check).
  • HP-SELECTIVE-REPORTING — a condition the setup declared (a dataset / baseline / metric / seed-count) is dropped from the results, or the metric is switched to favor the method.

AIS · AI writing-style impressions (zero verdict weight — reported in a separate section, NEVER move the verdict; impressions, not integrity findings)

  • AIS-NARRATIVE-ARC-BREAK — abrupt 1–2¶ intro / dump-like abstract; no background → contribution → evidence arc.
  • AIS-LLM-PHRASE-TICS — LLM phrasing tics ("it is worth noting", "not only … but also", clichéd em-dash/semicolon, flowery adverbs).
  • AIS-DEFENSIVE-HEDGE — pervasive "we do not claim … / not X but rather Y" instead of stating what was done (deterministic density screen).
  • AIS-JARGON-STUFF — dense term-stuffing with no surrounding content.
  • AIS-INVENTED-CODENAME — an undefined, internal-flavored run/experiment codename used as if defined.
  • AIS-CLAUSE-FORMULA-WALL — a short clause then a wall of formulas, repeated, no connective prose.
  • AIS-GRATUITOUS-PSEUDOCODE — pseudocode that just restates the prose / adds no operational content.
  • AIS-BULLET-LIST-OVERUSE — sequential logic flattened into parallel-looking bullets.
  • AIS-BOLD-MODULE-SPAM — verbose module names with excessive bolding.
  • AIS-RESTATE-OVERCLAIM — a rhetorical restatement loop ("we propose an X …" repeated).
  • AIS-FOCUS-DRIFT — high-level motivation pivots to a minor implementation detail.
  • AIS-SINGLE-STYLE-FIGURES — figures share a generic generated visual grammar.
  • AIS-APPENDIX-DUMPING-GROUND — the appendix reads like an unintegrated AI-trace dump.

This is not hypothetical. Paraphrased from a public reviewer account during the NeurIPS 2026 cycle (illustrative, not a citation), one batch maps almost one-to-one onto the taxonomy this repo encodes:

  • Paper 1 — "data tables don't match the text; several rows are misaligned; there's an obvious add/subtract regularity across backbones — it doesn't look like it was actually run." → consistency · HP-SUSPICIOUS-REGULARITY
  • Paper 2 — "two tables fill a page and are identical; the one figure is LLM-generated; and it still didn't fill 9 pages." → HP-DUP-TABLE · presentation signals
  • Paper 3 — "formula derivations don't hold; the experiments look complete but the math can't give those results." → proof-derivation-forensics · HP-DERIVATION-INVALID
  • Paper 4 — "open-sourced, beautifully written and drawn — but I ran the code and it gives completely different results from the paper." → experiment-forensics (L2)

The fourth case is this repo's thesis in one line: surface polish is not integrity.

🔒 How it stays honest (the anti-"LLM-slop" design)

The obvious dismissal of any such tool is "an LLM grading another LLM's paper is just noise." Three structural defenses, not just a disclaimer:

  1. Evidence ledger. One deterministic pass turns the paper into claims.json — span-anchored, hashed claims. Every finding must cite a claim_id + verbatim span. No span → it cannot be a high-severity finding.
  2. The LLM never grades. Auditors propose findings; a deterministic adjudicator (tools/adjudicate_findings.py, pure rules) computes the verdict. Same findings → same verdict, with no model in the final decision.
  3. Observability levels. A run declares what it could see (L0 PDF-only → L2 repo+results); findings that need code are auto-demoted on a PDF-only run. You can never shout "fraud" from a PDF. See references/observability-levels.md.

Surface signals and AI writing-style impressions have firewalls. Family-F surface tells (duplicate tables, LLM-generated figures, page-padding, leftover pipeline strings) are reported only as high-false-positive context: the adjudicator hard-caps them at minor (SURFACE_ONLY_SKILLS / SURFACE_PATTERNS), so they reach at most SOFT_FLAGS. The pure AI writing-style impressions (the AIS track — defensive hedging, LLM phrasing tics, …) go further: they carry zero verdict weight — forced to info, excluded from overall_verdict / counts / dimensions, and rendered in a separate "NOT integrity" section. A paper can be CLEAN_GIVEN_EVIDENCE while listing many. Both caps are enforced in code (_is_zero_weight + the weight-1-only verdict in tools/adjudicate_findings.py), not just promised — and regression-tested.

And an eval harness (eval/) proves the deterministic core on clean + synthetically-corrupted fixtures every change — measured false-positive / recall, not vibes.

🏗️ Architecture

input (pdf | pdf+latex | pdf+repo+results)
   │
   ▼  [evidence-ledger]  artifact_manifest.json (+ observability level) + claims.json   ← deterministic
   │
   ▼  fan out auditors (each reads the ledger, emits span-anchored findings):
        consistency-audit          flagship · paper vs itself · ARIS paper-claim-audit
        citation-forensics         exists? correct? right context? · ARIS citation-audit
        baseline-comparison-audit  missing/weak/mistuned baselines · ARIS paper-claim-audit
        experiment-forensics       L2: fake GT / self-norm / phantom · ARIS experiment-audit
        proof-derivation-forensics L1: proof gap / circularity / invalid step · verdict-bearing · ARIS proof-checker
        eval-design-forensics      L0/L1: data leakage / conflicted-or-unvalidated LLM judge / selective reporting
        presentation-signals       checkable surface tells · auxiliary, capped at minor
        ai-style-impressions       AI writing-style impressions · ZERO verdict weight · separate section
        adversarial-case-builder   evidence-bound memo, no verdict · ARIS kill-argument
        novelty-duplication-advisory  prior-work overlap memo, no verdict · ARIS novelty-check
   │
   ▼  [adjudicate_findings.py]  rules, not a model → REPORT.md + report.json            ← deterministic
Path What
skills/ the eleven auditor / impression skills (LLM proposes findings, span-anchored)
workflows/anti-autoresearch/ the end-to-end orchestrator
tools/ deterministic spine: manifest/observability derivation · ledger builder · numeric checks · adjudicator
schemas/ JSON contracts: claims · finding · report · artifact manifest
references/ hack-pattern taxonomy (the core contribution) · observability levels · reviewer independence · forensics contract
eval/ clean + synthetic-corruption fixtures + the regression harness
tests/ gate unit tests for the adjudicator (the anti-slop invariants)
docs/ positioning vs existing work · limitations

⚠️ Honest limitations

  • Forensics ≠ proof of misconduct. Output is flags for a human, never an accusation.
  • PDF-only (L0) catches inconsistency + tells, not all fabrication — it cannot verify external ground truth or run code.
  • False positives exist (legitimate round numbers, single-seed pilots, deliberate scope choices). Hence tiers, FP-risk tags, and the eval harness — not a binary "guilty".
  • The taxonomy is a living document. Adversaries who know a signal can route around it; this is a safety net, not a guarantee. See docs/limitations.md.

🧬 Provenance: derived from ARIS

ARIS — Auto Research in Sleep is an AI research-agent skill platform that runs end-to-end research pipelines (literature → idea → experiment → paper) — and does so with integrity guardrails built in, which is what makes it a credible base for the auditor:

  • 🛡️ A three-layer audit stack keeps ARIS's own output honest: experiment-audit (fake GT / score-normalization / phantom results), result-to-claim (is the claim scientifically supported?), and zero-context paper-claim-audit + citation-audit (do the reported numbers and references hold up?). Anti-Autoresearch is these same audits pointed outward.
  • 🔬 Cross-model adversarial review is the core doctrine: the executor and the reviewer must be different model families, so no LLM ever judges its own output. Anti-Autoresearch inherits this and hardens it — here the model only proposes findings; a deterministic adjudicator decides.

Two sides of one coin. ARIS is how to do autoresearch responsibly; Anti-Autoresearch is how to flag autoresearch that wasn't. A generator that publishes its own audit stack knows precisely how these pipelines fail — because it engineered against those failures from the inside. That is the perspective this repo brings.

👉 ARIS main repo: https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep

How the skills map — Anti-Autoresearch's skills are ARIS's audit skills, copied and reframed for a third party auditing an unknown submission rather than an author checking their own work: consistency-auditpaper-claim-audit, experiment-forensicsexperiment-audit, citation-forensicscitation-audit, baseline-comparison-auditpaper-claim-audit, proof-derivation-forensicsproof-checker, adversarial-case-builderkill-argument, novelty-duplication-advisorynovelty-check, plus the new evidence-ledger spine, presentation-signals, eval-design-forensics, and the zero-weight ai-style-impressions (the AIS track).

🤝 Prior art & acknowledgments

Anti-Autoresearch's design borrows ideas — and in places, taxonomy structure — from a body of integrity, reproducibility, and evaluation-hygiene work that predates it. We credit it explicitly. Taxonomies and ideas are adapted with credit; no external code is vendored — where a tool is GPL/AGPL we reimplemented the method from its paper rather than copying its source, and where a tool is proprietary we credit the concept only.

Deterministic self-consistency (the closest methodological cousins).

  • statcheck — Nuijten & Epskamp. Recomputes reported NHST p-values from their test statistics; the canonical "paper against itself" check. (GPL-3 — method reimplemented from the paper, source not vendored.) Informs HP-STAT-INCONSISTENCY.

  • GRIM / GRIMMER — Brown & Heathers (GRIM); Anaya (GRIMMER). Tests whether reported means/SDs are arithmetically possible for the stated sample size. Informs HP-GRANULARITY-IMPOSSIBLE / HP-VARIANCE-IMPOSSIBLE.

  • scrutiny — Jung. An R toolkit packaging GRIM/GRIMMER-style consistency tests. (MIT.)

    These inform taxonomy family A (numeric self-consistency); tools/check_stat_consistency.py is an independent, pure-stdlib reimplementation.

Evaluation integrity & LLM-judge validity (why the model never grades).

  • Leakage taxonomy — Kapoor & Narayanan, Leakage and the Reproducibility Crisis in ML-based Science. The priority reference behind the planned family-H eval-integrity patterns.
  • LLM-as-judge validity — Zheng et al. (judging LLM-as-a-judge), Panickssery et al. (evaluators favor their own generations), Wang et al. (position / identity bias). Precisely why our adjudicator is deterministic and the model only proposes findings.
  • "Show Your Work" — Dodge et al. Reporting-hygiene discipline behind the planned selective-reporting checks.

Citation & claim integrity.

  • Retraction Watch — the retraction-record project; conceptual basis for citation-status awareness.
  • Problematic Paper Screener — Cabanac, Labbé, Magazinov. Corpus-scale screening for tell-tale strings; the basis for HP-PIPELINE-ARTIFACT.
  • scite — supporting / contrasting citation context. (Proprietary — conceptual credit; informs HP-CITE-CONTEXT.)
  • SciFact — Wadden et al. Scientific-claim verification dataset/model behind the claim–evidence framing.
  • Fabricated-citation taxonomy — Ansari. Informs HP-CITE-HALLUC / HP-CITE-CONTEXT.

Reproducibility detection (prior art for planned L2 repro patterns — not yet implemented).

  • ODDPub — Riedel et al. Detects open-data / open-code statements. (AGPL-3 — conceptual/method prior art; no code vendored; any implementation will be independent.)
  • RTransparent — Serghiou et al. Large-scale data/code-sharing transparency detection. (GPL-3 — same.)
  • SciScore — automated methods-rigor / reproducibility checker. (Proprietary — conceptual credit only.)

Author-side self-audit (evaluated, scope-bounded).

  • academic-integrity-skill — 1anj. An author-side, wet-lab/biomedical self-audit skill (image-forensics-heavy). We evaluated its non-image deterministic screens — raw-data terminal-digit / exact-duplicate forensics (last-digit / Benford tradition) and reported-vs-source reconciliation — and adopt neither as a check: digit forensics need raw per-sample tables our reviewer-side tiers rarely see, and reported-vs-source overlaps family D. Credited as prior art and the author-side counterpart to this reviewer-side toolkit. (MIT.)
  • anti-defensive-writing — Kiterlin. An author-side Codex skill that revises defensive writing (removes unnecessary caveats/hedges, strengthens prose). The clean dual of our AIS-DEFENSIVE-HEDGE: they fix it for the author, we flag it for the reviewer (zero-weight). We cross-referenced its discouraged-construction list to extend our deterministic hedge templates ("this is not to say", "this should not be taken to mean", "rather than arguing X, we argue Y"). (MIT.)

Anti-Autoresearch's own contribution is not any single one of these checks but their combination into an autoresearch-specific taxonomy under a deterministic adjudicator and explicit observability tiers (see docs/positioning.md).

🔭 Related projects

Where Anti-Autoresearch sits relative to neighboring tools (stars / last update as gathered 2026-06-27; not a ranking).

Project Updated Relation to Anti-Autoresearch
SakanaAI/AI-Scientist 14.1k 2025-12 A generator whose output we audit — the class of pipeline this repo is built to check.
karpathy/autoresearch 88.8k 2026-03 A generator whose output we audit; the namesake of the failure surface.
scienceverse/metacheck 45 2026-06 Closest cousin: modular deterministic paper checks. We add an autoresearch taxonomy + observability tiers + cross-model proposers.
MicheleNuijten/statcheck 189 2026-03 Deterministic self-consistency (NHST p-values) — narrow, but exactly our spirit; informs family A.
lhdjung/scrutiny 8 2026-05 GRIM/GRIMMER consistency tests (R); same deterministic-self-consistency family.
allenai/scifact 265 2023-10 Claim verification against evidence — the claim–evidence framing, applied to external literature rather than the paper's own.
DEFENSE-SEU/FactReview 70 2026-06 Closest framing neighbor: an LLM reviewer that audits empirical claims and makes no accept/reject call. Differs by grounding against external literature + executing the repo (an L3 move we refuse) and model-produced claim statuses — vs our deterministic self-consistency + observability taxonomy. (AGPL-3.0)
1anj/academic-integrity-skill 51 2026-05 Closest sibling skill, mirror stance: an author-side wet-lab/biomedical self-audit skill (image-forensics-heavy — blot/microscopy/flow, which we exclude). Authors self-check pre-submission vs us auditing third-party autoresearch output; its numeric/citation screens run on the author's raw data tables, not reviewer-side PDF/LaTeX. (MIT)
Kiterlin/anti-defensive-writing 10 2026-06 Author-side dual of our AIS-DEFENSIVE-HEDGE: a Codex skill that revises defensive writing (caveats/hedges → direct, claim-forward prose). They fix it pre-submission; we flag it reviewer-side at zero verdict weight. We cross-referenced its discouraged-construction list to extend our hedge templates. (MIT)
ahans30/Binoculars 390 2024-05 AI-text detector — what we are NOT: it answers "was this LLM-written?", a question orthogonal to integrity.
baoguangsheng/fast-detect-gpt 414 2026-02 AI-text detector — same boundary; stylometry ≠ integrity.

A few framing-relevant efforts have no open repository and are credited by name only: Pangram, GPTZero, and the Problematic Paper Screener.

💬 Community

The taxonomy grows with the community. Caught an autoresearch / AI-Scientist paper pulling a trick that isn't in the pattern catalog yet? That is the single most valuable contribution here — open an issue with the concrete example, or send a PR adding the pattern (with an eval fixture + a false-positive case so it doesn't over-fire). New auditor skills, adjudicator gates, and corruption fixtures are just as welcome. CONTRIBUTING.md explains how a pattern is structured and the honesty rules every flag must follow (describe a checkable discrepancy, never impute misconduct or authorship).

Join the WeChat group (shared with the ARIS community) to swap autoresearch failure modes:

WeChat group QR code (shared with the ARIS community)

(The group QR rotates weekly — if it's expired, open an issue and we'll post a fresh one.)

📖 Citation

Anti-Autoresearch is derived from ARIS and reuses its audit DNA. If this repository helped your research / paper / review, please cite the ARIS methodology paper:

@article{yang2026aris,
  title={ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration},
  author={Yang, Ruofeng and Li, Yongcan and Li, Shuai},
  journal={arXiv preprint arXiv:2605.03042},
  year={2026}
}

⚖️ License

MIT — see LICENSE.

About

Don't trust an autoresearch paper at face value. Reviewer-side integrity forensics (self-consistency + fabrication), deterministic verdict. 61 signals: 46 integrity hack-patterns (families A–H, verdict-bearing) + 13 zero-weight AI writing-style impressions (AIS) + 2 advisory. Not an opaque AI-text classifier. The dual of ARIS.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors