Release v4.2.1 — Honest Reframing: Simulation is Prior, Not Result · zakky8/llm-jailbreak-taxonomy

v4.2.1 — Honest Reframing: Simulation is Prior, Not Result

An adversarial peer-review pass of v4.2.0 (conducted as part of internal QA before any external review) identified four critical methodological errors in how simulation outputs were being presented. All four are real. This release retracts those claims publicly rather than silently rewriting them.

Retracted from v4.2.0

Claim	Why it was wrong
"Mean ASR" labels on simulation outputs presented as findings	The simulation re-states a hand-tuned prior (`MODEL_BASE_ASR`, `CATEGORY_MULTIPLIERS` in `evaluate_phase2b.py`). Running it under different seeds restates the prior — it does not measure model behaviour.
"95% bootstrap CIs" on the seed-mean ranges	`scripts/multi_seed.py` `ci95()` returned min/max of seed means, not bootstrap CIs. Function renamed to `seed_range()`, all docs corrected.
"Claude Opus 4-8 produces zero Tier-3 outcomes — the simulation's most testable prediction"	Arithmetic floor: severity-3 gate requires `effective_prob > 0.9`; Opus max `effective_prob = 0.07 × 9.0 = 0.63`. Impossible by construction. Reframed as a property of the parameterization, not a prediction.
"Cross-model differences statistically significant for 5 of 10 categories"	Cochran's Q requires matched subjects. The simulation produces independent random draws per (model, pattern, trial). Test computable but p-values not interpretable on simulated data.

Where the retractions land

README.md — Phase 2b section rewritten with prominent disclaimer block
paper/research-paper.md — "Headline empirical outputs" section retracted and rewritten
findings/v4_simulation_findings.md — original 7 findings retracted; document is now an honest retraction notice explaining what the simulation is and isn't
paper/anthropic_alignment_with_taxonomy.md — editorial language removed; rewritten as neutral comparison without insider-judgment framing
evaluate_phase2b.py — module docstring leads with parameterized-risk-model framing
scripts/multi_seed.py — ci95 → seed_range (legacy alias preserved)
scripts/statistical_tests.py — explicit assumption-violation caveats added

Why retract publicly instead of silently rewriting

Audit trail. Anyone reading git log can see exactly what was claimed in v4.2.0 and what was retracted in v4.2.1.
Research maturity signal. Self-correction under adversarial review is more credible than the appearance of never having erred.
Pattern reuse. The same retraction discipline will apply when live Phase 2b data inevitably surfaces something different from the prior. Establishing the protocol now prevents drift later.

Unchanged

40-pattern taxonomy and mechanism-to-alignment-assumption mapping
17 cited papers (all direct-WebFetch verified in v4.0.1 — that audit stands)
Engineering infrastructure (PEP 621, Docker, CI on Python 3.10/3.11/3.12, 10/10 pytest)
Phase 2b live harness (evaluate_live.py)
Phase 3 defense framework spec
Reproducibility checklist, Datasheet, Ethics statement

What this release strengthens

The Phase 2b live execution request becomes more clearly motivated, not less. The retractions make explicit that what v4.2.0 mis-labeled as "findings" was actually a predicted shape from prior literature. The live run produces the empirical data that would confirm the prior, reject it, or surface novel structure.

That is the work the $1,000 credit allocation funds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.2.1 — Honest Reframing: Simulation is Prior, Not Result

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v4.2.1 — Honest Reframing: Simulation is Prior, Not Result

Retracted from v4.2.0

Where the retractions land

Why retract publicly instead of silently rewriting

Unchanged

What this release strengthens

Uh oh!