Releases · safiqsindha/Ditto-V5

v5.0 — Phase D Closeout (2026-04-30)

Five-cell parallel violation-detection diagnostic on Claude Haiku 4.5 at n=1,200 chains/cell, two API calls per chain (paired baseline / intervention).

Headline result

4 of 5 cells significant past Bonferroni at α/5 by many orders of magnitude.

Cell	Det@Base	Det@Int	Δ	95% CI	FP@Base	FP@Int	b	c	χ²	p_Bonf
pubg	75.9%	100.0%	+24.1%	[+21.8, +26.6]	0.0%	0.0%	0	289	287.0	1.1e-63
nba	57.4%	100.0%	+42.6%	[+39.8, +45.4]	0.8%	9.9%	0	511	509.0	5.2e-112
csgo	65.1%	98.1%	+32.9%	[+30.3, +35.7]	11.3%	29.8%	0	395	393.0	9.2e-87
rocket_league	0.2%	6.2%	+5.9%	[+4.7, +7.3]	0.0%	0.0%	0	71	69.0	4.9e-16
poker	100.0%	99.9%	-0.1%	[-0.25, 0.00]	0.5%	1.1%	1	0	0.0	1.000

4-tier representational hierarchy

Tier	Cells	Anchored	Observable	Unary-reducible	Profile
0 Saturated	poker	✓	✓	✓	rule pre-internalized; no lift possible
1 Aligned	pubg, nba	✓	✓	✓	clean intervention lift to perfect detection
2 Partial	csgo	✓	✗ (bomb sites)	✓	high lift, elevated FP — confabulation signature
3 Misaligned	rocket_league	✓	✗ (positions)	✗	tiny but real lift; strict grounding suppresses confabulation

Defensible claim

Constraint reasoning in LLMs is gated by representational alignment between the rule and the observable event structure. It succeeds when violations reduce to observable unary predicates over event streams; it degrades predictably under missing observability; it is suppressed entirely under strict grounding when required variables are absent.

Mechanistic evidence (Layer-2 CoT)

NBA residual FPs (50/119 analyzed): 92% cite the 24-second shot clock — second-order shot-clock confabulation despite D-44's time_in_possession_s rename.
CSGO residual FPs (50/358 analyzed): 80% cite "Bomb plants only at sites A or B" — textbook constraint-triggered confabulation when the predicate variable isn't surfaced in the rendered chain.

The model fails exactly where the rule's required variables are absent — strongest evidence in v5 that the failure mode is representational, not reasoning-bound.

Total spend

~$5 across the entire experiment (Phase D + crash recovery + prior diagnostics).

What's next

v5.1: cross-model replication via OpenRouter (~11 models, n=300/cell, derived-state-marker ablation as second axis). Scoped separately.
v5.2: awpy CSGO + per-event RL extraction. Run after cross-model lands.

Documents

STATUS.md — end-state status
MEMO.md — internal memo + bridge document for the eventual V1–V5.1 arXiv preprint
DECISION_LOG.md — D-0 through D-45, every methodology decision
RESULTS/phase_d_final.json — locked headline numbers
RESULTS/phase_d_raw_batches/ — 23,998 archived raw API responses (~6 MB)
RESULTS/phase_d_cot_residual_fps.json — CoT mechanistic evidence

Known issues

9 pre-existing test failures in Fortnite + CSGO pipeline mocks (#5). Pre-date this release, unrelated to v5 results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release list

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v5.0 — Phase D Closeout (2026-04-30)

Headline result

4-tier representational hierarchy

Defensible claim

Mechanistic evidence (Layer-2 CoT)

Total spend

What's next

Documents

Known issues

Uh oh!

Releases: safiqsindha/Ditto-V5

Release list

v5.0 — Phase D Closeout

v5.0 — Phase D Closeout (2026-04-30)

Headline result

4-tier representational hierarchy

Defensible claim

Mechanistic evidence (Layer-2 CoT)

Total spend

What's next

Documents

Known issues

Uh oh!