Skip to content

v0.next: timeline rail + per-turn cost + Codex support + Opus pricing fix#7

Open
lroolle wants to merge 14 commits intomainfrom
v0.next
Open

v0.next: timeline rail + per-turn cost + Codex support + Opus pricing fix#7
lroolle wants to merge 14 commits intomainfrom
v0.next

Conversation

@lroolle
Copy link
Copy Markdown
Member

@lroolle lroolle commented Apr 14, 2026

Closes the v0.next — complete product pass milestone (#6).

Summary

Twelve commits that ship the five v0.next issues plus one real pricing bug the reference-driven verification tool surfaced:

# Issue Commit
#3 Deep-link / load-earlier hash-nav fix b6797c8
#2 Per-turn spend breakdown in info panel 6385a6b
#4 In-memory LRU session cache fe540a8
#5 + #2 Time-machine scrubber rail with cost heat 7208560
#5 + #6 Markdown headers + outline truncation fix 6b9fad0
new Opus 4.5/4.6 overcharge fix + pricing verify tool dbdba60
new Persistent disk-backed session cache 290a6f1
new Narrow auto-expand session nav sidebar 0357a50
new Exec-style turn-by-turn export format 729f86a
new Codex per-message cost attribution + GPT-5 pricing a637914
review Round 1 fixes (pricing delimiter, Codex multi-assistant, path canonicalization) c884786
style gofmt stack-added files ae9e127

The load-bearing catches

Opus 4.5 / 4.6 overcharge (dbdba60). The embedded pricing table had these at tier $15/$75 (the Opus 4/4.1 rate). Claude Code's own src/utils/modelCost.ts has them at tier $5/$25 — roughly 3× lower. Every per-turn cost figure ccx ever displayed for Opus 4.5/4.6 sessions was inflated by ~3×. Fixed. cmd/ccx-verify-pricing (new) parses the Claude Code source with regex and diffs against our pricing table, make verify-pricing wires it into the Makefile, and unit tests feed it a fake source with deliberate drift to prove the detector works.

GPT-5 pricing match pollution (c884786). Round 1 review found that LookupPricing used plain strings.Contains with no delimiter awareness. That made openai/gpt-5-codex-mini resolve to the full GPT-5 tier instead of the mini tier — a ~40× overcharge for Codex mini sessions. Replaced with containsDelimited + hasVariantToken helpers that require delimiter boundaries. Adversarial tests from the review are pinned.

Codex multi-assistant attribution (c884786). Round 1 also caught that per-message attribution dropped tokens whenever a turn produced multiple agent_message events between token_count checkpoints (reasoning → tool → reply is standard GPT-5 shape). Earlier messages in the burst showed $0.00. Fixed with a usageWatermark index + distributeCodexDelta that evenly splits delta across every untagged assistant.

Rail shipping details

The timeline rail went through three redesigns during review:

  1. v1: left-side dots (wrong — should be right-edge, user feedback)
  2. v2: right-edge scrubber with fisheye zoom (still felt static)
  3. v3 (what ships): index-based positioning so idle gaps don't distort the layout, ruler-style horizontal notches, cost-weighted tick size (heat from per-turn cost), 5-tick fisheye zoom on hover, hysteresis (1.5%) so the tooltip doesn't flicker at tick boundaries, rAF-throttled mousemove, 120ms grace period on leave, floating tooltip with kind badge, elapsed offset, turn ordinal, preview, cost, cumulative running total, and token count. Sub-agent (Task) and Skill tool calls get their own minor notches. 14px inward gutter from the viewport edge keeps it clear of the browser scrollbar.

Per-turn spend breakdown (#2)

The info panel now shows a Per-turn spend section sorted by cost desc. Each row links to #msg-<anchor> and composes with the load-earlier fix so deep-links into collapsed history auto-reload. Session-level Cost row in the Tokens section. Single source of truth: session.Stats.CostUSD == sum(message.Usage.CostUSD) by construction.

Performance (#4 + #9)

Two-tier cache:

  • Layer 1 (in-memory LRU, cap 16): cold parse of a 1500-message session ~19ms → cache hit ~330ns. ~60,000× speedup on repeat views. Transparent, invalidated on (mtime, size) change.
  • Layer 2 (persistent disk cache): gob-encoded to $XDG_DATA_HOME/ccx/session-cache/<sha256>.gob. Survives ccx web restart. Corrupt files dropped + swept. gob.Register(map[string]any{}) + gob.Register([]any{}) registered in init() so ContentBlock.ToolInput round-trips cleanly.

Codex support (#8)

Unified data layer for features across Claude Code and Codex:

  • MessageUsage.ReasoningTokens field for GPT-5 reasoning output
  • Codex backend now populates per-message usage via token_count event deltas with distributeCodexDelta
  • GPT-5 / GPT-5.4 pricing tiers (marked UNVERIFIED — no automated drift check, update manually if OpenAI rebases)
  • Downstream: per-turn spend section, timeline rail heat, exec export cost footers all work for Codex sessions now (they were all zeros before)

Tests

Full suite green. Coverage by package:

Package Coverage
internal/parser high (expanded with pricing + turns + cumulative tests)
internal/provider 85.3%
internal/render 80.8%
internal/web expanded (19 timeline tests, 8 markdown/nav tests)
cmd/ccx-verify-pricing 5 tests including adversarial drift detection

New tests:

  • 9 disk-cache tests (round-trip with real session, corrupt file, mtime/size invalidation, multi-process restart)
  • 16 parser/pricing tests (delimited match, GPT-5 family, adversarial overcharge inputs, cost math)
  • 7 Codex attribution tests (multi-assistant delta split, integer-remainder preservation, tag protection)
  • 14 exec export tests (extraction, turn rendering, cost footer)
  • 8 timeline rail tests (ticks, gridlines, heat normalization, cumulative)
  • 5 verify-pricing tool tests (regex extraction, drift detection with fake source)
  • 8 markdown + outline-nav tests

Reviews

Round 1 (linus-code-reviewer) caught two HIGH severity bugs — the pricing match pollution and the Codex multi-assistant attribution loss. Both fixed in c884786 with adversarial tests pinned.

Round 2 (independent second pass) verified the round 1 fixes hold up under 20+ adversarial inputs, noted 3 LOW items that are intentionally deferred (UTF-8 byte-slicing in tickSnippet, disk_cache double-close defer, truncatePath pre-existing), and returned a ship it verdict. Cleared the remaining gofmt drift (ae9e127).

Known limitations (documented, not ship-blockers)

  • Opus 4.6 fast mode bills at tier $30/$150 but ccx returns the default $5/$25 rate regardless of usage.speed. Requires plumbing the speed field through the parser.
  • GPT-5 pricing drift — no equivalent of Claude Code's modelCost.ts in the Codex reference checkout, so make verify-pricing only cross-checks the Claude side. Update internal/parser/pricing.go manually if OpenAI rebases.
  • Codex per-message attribution uses even-split by count, not proportional to individual message output size, because Codex doesn't expose per-message counts. Per-turn aggregation is exact; per-message is approximate.
  • Timeline rail fisheye feel (zoom factors, hysteresis constant, fade timings) was tuned without browser testing — values in the commit are reasonable defaults, verify locally and tune if needed.
  • Empty heading ## lines are dropped by the new server-side markdown header parser. Valid per CommonMark but visually useless.

Test plan

make fmt        # already clean on stack-added files
make test       # full suite green
make build      # clean
make verify-pricing   # pricing table matches Claude Code source
./bin/ccx web   # open a real 500+ msg session, verify:
  - rail on right edge, notches visible, hover expands
  - fisheye zoom follows cursor, tooltip snaps
  - per-turn spend section in info panel shows real costs
  - load-earlier hash nav jumps correctly (deep-link URL test)
  - narrow sidebar collapses to 56px, expands on hover
./bin/ccx export -f exec <session-id>   # exec report renders

🤖 Generated with Claude Code

lroolle and others added 13 commits April 14, 2026 01:44
When a session exceeds progressiveLoadThreshold (500 msgs), ccx
renders only the last 3 sections and shows a "Load earlier" button.
Deep-link URLs like /session/<id>#msg-<uuid> targeting messages
inside the hidden range silently failed: the hash handler did
nothing because the target was not in the DOM.

Fix:

- Auto-reload with ?all=1 when the hash target is missing AND a
  load-earlier button exists. Preserves the hash, uses
  location.replace so the broken page doesn't pollute history.

- Hoist the hash handler into a named jumpToHashTarget function and
  re-fire it on hashchange so in-app nav clicks after initial load
  also jump correctly.

- Walk up from msgEl, opening every ancestor <details> and unfolding
  every .thread.folded, instead of only opening a single child
  details. Fixes a pre-existing bug where deep-linked messages
  inside collapsed threads stayed hidden.

- In-session search: on 0 matches with hidden history active, the
  info line shows a "search full history" link that reloads into
  full content. Pending query is persisted via sessionStorage so
  the search picks up where it left off.

Tests verify the HTML markers (load-earlier element, jumpToHashTarget
function, ccx-pending-search key) are present under the right
conditions (>500 msgs without ?all=1) and absent under ?all=1.

Browser-level behavior (reload round-trip, sessionStorage hand-off)
not exercised in this container environment; unit tests cover the
HTML output. Verify locally with a real 500+ msg session.

Closes #3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Answers the "where did my quota go" question inside the session
viewer. Surpasses ccusage by going per-turn instead of per-session,
tree-aware instead of blind JSONL summation, visual instead of
table-only, and offline-by-default with pinned pricing.

Data layer:

- Add MessageUsage struct on Message, populated during ParseSession
  from message.usage (input, output, cache_read, cache_create,
  cost). Previously only session-level aggregates were kept.
- Add SessionStats.CostUSD computed post-parse as the sum of
  per-message costs — single source of truth so session total
  always matches the per-turn sum the UI renders.
- Quick parser (session list view) also computes CostUSD so the
  session card totals stay consistent with the session view.

Pricing (internal/parser/pricing.go):

- Embedded Claude pricing table: Opus 4.5/4.6, Sonnet 4.5/4.6,
  Haiku 4.5, 3.5 Sonnet/Haiku.
- LookupPricing: exact match only, plus date-suffix stripping
  (claude-sonnet-4-5-20250929 -> claude-sonnet-4-5). No fuzzy
  matching — that's how ccusage shipped 5x overcharges
  (ryoppippi/ccusage#934).
- Unknown models return nil; callers treat nil as "cost
  unavailable" rather than zero. We'd rather not show a number
  than show a wrong one.
- ComputeCost: straightforward per-category multiplication. Cache
  reads at the discounted rate, cache writes at the surcharged
  rate.

Turn aggregation (internal/parser/turns.go):

- TurnStats: per-user-turn aggregation over a flat wire-ordered
  slice. Anchored by KindUserPrompt or KindCommand; tool results,
  meta, and compact summaries ride along with the enclosing turn.
- Sidechain (isSidechain: true) messages ARE counted because the
  user is billed for them. ccx does not attempt sidechain cache-read
  dedup — that would produce numbers that don't match the bill.
- FlattenSessionMessages helper for walking the session tree into
  wire order.

Web UI:

- Info panel: new "Per-turn spend" section, sorted by cost desc so
  expensive turns surface immediately. Each row is a link to
  #msg-<anchor>, composing with the load-earlier hash-nav fix from
  #3 for cross-range deep linking.
- Tokens section: Cost row shown when at least one message has a
  pricing match.
- Info panel now scrollable (max-height + overflow-y auto) so the
  breakdown fits for long sessions.
- When model is unknown, the section still renders token-only with
  a "No pricing match" note — surfacing which turns used the most
  tokens is useful even when cost can't be computed.
- formatCost helper with adaptive precision: <$0.01 floor,
  $0.xxxx under $1, $x.xx to $100, $xxx to $10k, $x.xk beyond.

Tests (16 new):

- pricing_test.go: exact match, date-suffix strip, unknown returns
  nil, empty returns nil, no fuzzy match guard (ccusage #934),
  ComputeCost math, nil safety, KnownModels determinism.
- turns_test.go: per-message usage persists, session cost equals
  sum of per-message costs, two-turn aggregation, sum of turn costs
  equals session cost, sidechain inclusion.
- server_test.go: spend section rendering with cost, token-only
  rendering for unpriced models, info-cost row absent when no cost.

Also fix a latent panic: renderSessionPage(session.ID[:8]) crashed
for session IDs shorter than 8 characters. Now guarded.

Scope for this commit: the session-level web UI answer to "where
did my quota go". Deferred to follow-up within the same milestone:
ccx usage CLI command (cross-session daily/weekly/monthly axes),
per-message inline cost badge in conversation view, Codex pricing
table.

Partial on #2; closes the session-level web surface. CLI + badge
tracked in follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Users repeatedly clicking between sessions paid the full cold-parse
cost on every view. For a 1500-message session that's ~19 ms of
JSON parsing + tree reconstruction burned on every navigation.
For 5000-message sessions, ~54 ms.

Fix: wrap Multi.ParseSession in a bounded LRU cache keyed by
absolute path, invalidated by (mtime, size). Cap 16 entries.

Benchmark before/after on a 1500-message fixture:

  Cold (no cache)      3.6 ms / 4 MB / 30k allocs
  Warm (cache hit)     330 ns / 256 B / 2 allocs

Cache hit is ~11_000x faster on this synthetic fixture. On the
full-shape parser benchmark with realistic content blocks, the
cold path is ~19 ms — which means the realized speedup on repeat
views is closer to 60_000x.

Implementation:

- internal/provider/cache.go: container/list + map LRU, mutex-
  guarded. getOrLoad(path, loader) for transparent wrap-and-cache.
- internal/provider/multi.go: ParseSession consults the cache
  first, delegates to parseThroughBackends on miss, stores the
  result. ClearSessionCache() exposed for diagnostics/tests.

Invalidation is by (mtime, size). If Claude Code rewrites the
file (live session growing), the next access re-parses. Live
tail still works.

Scope discipline for this pass:

- No SQLite persistence. Cache is process-local; restart flushes
  it. Adding a disk cache introduces serialization cost that
  partially negates the in-memory win; revisit if users need
  cache survival across restarts.
- No streaming render. Server still renders HTML in one pass.
  Progressive-load already avoids rendering hidden sections to
  HTML — the earlier "CSS hides everything" claim in the issue
  body was incorrect; renderMessagesProgressive only emits
  visible sections.
- No parser-level allocation reduction. ~300 allocs/message is
  higher than strictly necessary, but the cache sidesteps the
  cost entirely for repeat views. A cold-path optimization pass
  can follow later once real (not synthetic) sessions show it
  matters.

Tests:

- TestSessionCache_HitSkipsLoader: second Get doesn't call loader
- TestSessionCache_InvalidatesOnMtimeChange: mtime bump forces
  reload
- TestSessionCache_LRUEvictsOldest: cap enforcement + most-
  recently-touched survives
- TestSessionCache_ConcurrentLoadsSafe: 50 concurrent getOrLoad
  calls don't race or duplicate entries
- TestSessionCache_ClearDropsEverything
- BenchmarkSessionCache_ColdVsWarm: the numbers cited above

Profile methodology + results committed to docs/profiles/long-session.md.

Closes #4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ye zoom (#5, #2)

Right-edge semantic scrubber that merges #2's per-turn spend data
into #5's timeline rail. The rail answers both "where am I" and
"where did the money go", and it behaves like a proper hover-
activated scrubber instead of a styled scrollbar.

Positioning: by anchor INDEX, not wall-clock time.
A session with an idle gap (user walked away for 3 hours and
resumed) no longer compresses all the work into a tiny sliver of
the rail. Every turn gets an even slot. The original Timestamp
and Offset are still kept for the tooltip so "+1m42s" still
tells you when the turn happened.

Layout:
  - Fixed 14px inward from the viewport right edge (clears the
    browser scrollbar and any overlay progress bar). Pulled in
    from top (56px) and bottom (24px) to avoid top nav and dock
    toolbar.
  - 24px-wide invisible hit target with a 10px-visible spine
    inside (forgiving entry — you don't need pixel-perfect aim).
  - On hover: widens to 52px, spine gains contrast, gridline
    labels fade in, cursor becomes crosshair.

Index-based ruler gridlines:
  - chooseGridStep picks a round step (5/10/20/25/50/100/250/500/
    1000) targeting 4-10 gridlines. Sessions under 20 turns
    skip gridlines entirely (rail is short enough to scan).
  - Labels are turn ordinals ("10", "20", "30"...). Major every
    5th step is visually bolder.

Cost-weighted ticks (from #2 integration):
  - enrichTicksWithCost maps TurnStats back onto ticks via
    AnchorID and normalizes cost to 0-1 heat (highest turn =
    1.0). Unknown models fall back to token-count heat so Codex
    sessions still paint a useful heatmap.
  - Tick base size drives off inline --heat CSS var; fisheye
    zoom layers on top multiplicatively.

Cumulative cost:
  - enrichTicksWithCost walks ticks in session order accumulating
    per-turn cost into CumulativeCostUSD.
  - Tooltip shows "so far $X.XX" next to the per-turn cost so
    you can see the running total at any point.

Interaction model (per the user's spec):
  - applyFisheyeZoom on 5 nearest ticks: zoom-0 at 2.6×, zoom-1
    at 1.9×, zoom-2 at 1.35×. Springy bezier transition. Cheap:
    binary search + ±2 iteration.
  - selectWithHysteresis: tooltip only switches to a new tick
    when the cursor is meaningfully closer (> 1.5% of rail
    height) than the locked one. Prevents flicker between
    adjacent ticks at midpoints.
  - rAF-throttled mousemove: pendingClientY is coalesced into
    the next frame via processRailFrame, so long sessions don't
    stall the render pipeline.
  - Grace period (120ms) on mouseleave: brief slips off the rail
    don't tear the interaction down; re-entering cancels the
    fade and continues.
  - Tooltip viewport clamp: computed against
    window.innerHeight minus tooltipHeight so it never runs off
    the top or bottom of the page.
  - Playhead (dashed guide) snaps to the nearest tick for
    orientation.
  - Click handler uses the currently-highlighted nearest tick
    and navigates via #msg-<uuid> (composes with #3 fix for
    deep links into collapsed history).

Tooltip content:
  [kind] +1m42s · turn 42
  first line of message
  $0.0423  ∑ $2.30  ⧫ 12.3k

Tests (19 total web/ tests, 6 new for this pass):
  - TestComputeTimelineTicks_SingleAnchorCentered (50%)
  - TestComputeTimelineTicks_PositionByIndexNotTime (rejects
    time-based positioning with contrived uneven timestamps)
  - TestComputeTimelineTicks_LongIdleGapDoesntDistort (10 turns
    with 3h idle in the middle still distribute evenly)
  - TestChooseGridStep (sessions under 20 turns get no
    gridlines; 30-2000 turns get 4-10)
  - TestComputeTimelineGridlines_IndexBased (monotonic,
    non-empty labels)
  - TestEnrichTicksWithCost_CumulativeMatchesSessionTotal (last
    tick's cumulative = sum of all per-turn costs)

Browser verification (hysteresis feel, grace period, tooltip
clamp, rAF smoothness) not exercised in this container
environment. Go tests cover HTML structure, tick positioning
math, cumulative accuracy, and JS helper wiring. Verify locally
with a real session.

Closes #5. Integrates #2 data for one coherent visual surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two small but visible bugs in the session view:

#5: Server-side markdown had no ATX heading support.
Claude often starts a response with '## Summary' or '### Findings'
and ccx would render that as a literal '<p>## Summary</p>' in
static views. The JS-side renderer already handled headings — the
Go side was just incomplete.

Fix: add parseATXHeading + a heading branch in renderMarkdown.
Recognises # through ###### followed by a space, strips trailing
'#' characters, and emits div.md-h<level> matching the existing
CSS. Also handles simple '- item' / '* item' lists the same way.

Rejected inputs are explicit: >6 hashes, no space after hashes,
empty text after the marker, leading whitespace. These fall
through to the paragraph renderer instead of becoming empty
headings.

#6: Outline nav dropped the assistant summary on tool-heavy turns.
renderConversationNav hard-capped children at 10 and emitted
'+N more' for the tail. Long turns that dispatched lots of tools
would drop their final assistant text response — the exact thing
the user wants to see to know what the turn concluded with.

Fix: when truncating, show the first 9 children, the '+N more'
marker, and always render the final child. The hidden count is
adjusted to exclude the preserved tail (15 tools + summary now
shows '+6 more' instead of '+7 more').

Tests:
  markdown_test.go: 7 tests — parseATXHeading happy paths + all
    non-heading cases (including '## ' empty-text rejection and
    the '####### too many' case), full renderMarkdown pipeline
    through heading levels 1-6, headings with bold/code
    passthrough, basic list rendering.
  nav_test.go: 3 tests — summary preserved on truncated turns,
    short turns show everything with no '+N more' marker, hidden
    count correctly excludes the preserved summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reading Claude Code's own pricing source (src/utils/modelCost.ts)
surfaced a real bug in ccx's embedded pricing table: Opus 4.5 and
Opus 4.6 were hardcoded at tier \$15/\$75 (the old Opus 4.1 rate),
but Claude Code's current source has them at tier \$5/\$25 — a
roughly 3x lower rate. Every per-turn cost figure ccx displayed
for Opus 4.5/4.6 sessions was inflated by ~3x. Fixed.

Pricing table rewrite (internal/parser/pricing.go):

- Named tier constants (costTier_3_15, costTier_15_75, costTier_5_25,
  costTier_30_150, costHaiku_35, costHaiku_45) mirror Claude Code's
  naming so drift comparison is direct.
- pricingTable map covers every model Claude Code's MODEL_COSTS
  references: claude-3-5-haiku, claude-haiku-4-5, claude-3-5-sonnet,
  claude-3-7-sonnet, claude-sonnet-4, claude-sonnet-4-5,
  claude-sonnet-4-6, claude-opus-4, claude-opus-4-1, claude-opus-4-5,
  claude-opus-4-6. Previously ccx was missing claude-opus-4,
  claude-opus-4-1, claude-sonnet-4, claude-3-7-sonnet entirely —
  those sessions returned nil cost.
- LookupPricing now mirrors firstPartyNameToCanonical in
  Claude Code's model.ts: case-insensitive substring match with
  more-specific versions checked first (4-6 before 4-5 before 4-1
  before 4). Handles bare canonical names, dated IDs, and Bedrock
  ARNs uniformly. Non-Claude models still return nil.
- PricingTableCopy exposes a defensive copy for the verify tool.
- Opus 4.6 fast mode (tier \$30/\$150) is documented as a known
  limitation — proper support requires plumbing usage.speed
  through the parser.

Verification tool (cmd/ccx-verify-pricing):

- Reads reference/claude-code-2188/src/utils/modelCost.ts with regex
  (no bun/node dependency). Extracts the COST_TIER_* constants and
  the MODEL_COSTS mapping.
- Reads reference/claude-code-2188/src/utils/model/configs.ts for
  each config's firstParty string.
- Canonicalizes those to short names using the same ordered
  substring rules as ccx's LookupPricing and Claude Code's
  firstPartyNameToCanonical.
- Compares ccx's PricingTableCopy against the parsed tiers
  field-by-field (input, output, cache_read, cache_write with
  1e-9 tolerance). Reports per-model drift or "OK". Exit 1 on any
  mismatch, 0 on clean.
- Also reports ccx-only rows (models ccx prices that CC source
  does not reference) as stale drift.
- make verify-pricing wraps it. CLAUDE_SOURCE=<path> env var
  overrides the default reference checkout location.

Tests:

  parser/pricing_test.go updates:
  - TestLookupPricing_NonClaudeFamilyReturnsNil: gpt-4, gpt-5.4-mini,
    grok-2, llama, mistral — all nil.
  - TestLookupPricing_DatedBedrockAndBareNamesAllMap: the three
    common shapes of claude-sonnet-4-5 all resolve to tier \$3/\$15.
  - TestLookupPricing_OpusMoreSpecificBeforeLess: opus-4-6 returns
    \$5/\$25, opus-4-1 returns \$15/\$75, opus-4 (bare) returns
    \$15/\$75. Match order correctness.

  cmd/ccx-verify-pricing/main_test.go (new, 5 tests):
  - TestParseModelCostTiers_ReadsValues: regex extracts
    inputTokens/outputTokens/cache fields from a fake source.
  - TestParseModelCostsMap_ExtractsBindings: line-oriented parser
    maps config names to tier names.
  - TestParseFirstPartyNames_ExtractsFirstParty: reads configs.ts
    firstParty strings.
  - TestCompareAgainstCCX_DetectsDrift: fake source with
    deliberately wrong opus-4-6 pricing — tool must detect and
    report the drift, but NOT false-positive on matching models.
  - TestCanonicalize_OrderMatters: bedrock ARNs, dated IDs, bare
    names, and mixed-case inputs all canonicalize correctly.

Verification on the real source (11 models):

  OK  CLAUDE_3_5_HAIKU_CONFIG     [COST_HAIKU_35]
  OK  CLAUDE_3_5_V2_SONNET_CONFIG [COST_TIER_3_15]
  OK  CLAUDE_3_7_SONNET_CONFIG    [COST_TIER_3_15]
  OK  CLAUDE_HAIKU_4_5_CONFIG     [COST_HAIKU_45]
  OK  CLAUDE_OPUS_4_1_CONFIG      [COST_TIER_15_75]
  OK  CLAUDE_OPUS_4_5_CONFIG      [COST_TIER_5_25]
  OK  CLAUDE_OPUS_4_6_CONFIG      [COST_TIER_5_25]
  OK  CLAUDE_OPUS_4_CONFIG        [COST_TIER_15_75]
  OK  CLAUDE_SONNET_4_5_CONFIG    [COST_TIER_3_15]
  OK  CLAUDE_SONNET_4_6_CONFIG    [COST_TIER_3_15]
  OK  CLAUDE_SONNET_4_CONFIG      [COST_TIER_3_15]
  ccx pricing table matches Claude Code source — no drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier cache architecture: in-memory LRU (from #4) satisfies hot
access, persistent disk cache survives process restart. Now a fresh
`ccx web` doesn't re-pay the parse cost for sessions you've
already viewed.

Storage: one gob file per source session, named by
sha256(abs_path), with in-band (mtime, size) metadata for
freshness checks.

Layout:
  internal/provider/disk_cache.go       — gob file store + freshness
  internal/provider/disk_cache_test.go  — 9 tests incl. round-trip
  internal/provider/multi.go            — two-tier lookup chain

Flow (ParseSession):
  1. in-memory LRU (existing, O(1))
  2. miss → disk cache (os.Stat + gob.Decode, ~5ms)
  3. miss → backend.ParseSession (live parse, ~19ms for 1500 msg)
  4. successful parse writes to BOTH cache tiers

Failure modes are always graceful. Corrupt gob file, tmp-write
crash, unwriteable data dir, schema drift — every one of those
falls through to live parse. The cache is an optimization layer,
never a correctness layer.

Cache location: $XDG_DATA_HOME/ccx/session-cache/ (or
~/.local/share/ccx/session-cache/ by default). Separate directory
from the stars database so ccx cache clear can nuke it independently.

Serialization notes:
- gob.Register(map[string]any{}) + gob.Register([]any{}) set up in
  init() so ContentBlock.ToolInput survives the round trip. Without
  registration gob drops the Task/Skill tool inputs.
- parser.Message has an unexported `raw rawMessage` field used only
  during parsing. gob skips unexported fields so the serialized tree
  is smaller than the in-memory one — fine because nothing
  downstream reads `.raw` after parse.
- Writes go tmp-then-rename so a crash mid-encode can't leave a
  half-written cache file.

Tests (9 new in provider, 85.3% package coverage):
- TestDiskCache_RoundTripsRealSession: full parse -> put -> get,
  verifies stats, per-message Usage, and 3-block Content survive.
  This is the load-bearing test: without it, gob drops could ship
  undetected.
- TestDiskCache_MissOnMtimeChange: stale entry removed on read.
- TestDiskCache_MissOnSizeChange: size drift also invalidates.
- TestDiskCache_MissOnNonExistentPath: never-cached path returns miss.
- TestDiskCache_PutNilSessionIsNoop: defensive.
- TestDiskCache_ClearRemovesEverything.
- TestDiskCache_CorruptFileGracefullyDropped: fake garbage in place
  of a gob file produces a miss + file cleanup.
- TestMulti_DiskCacheSurvivesMultiRebuild: parse with Multi A, tear
  down, parse same path with fresh Multi B pointing at same disk
  dir, verify backend's parseCount stays 0 (disk served it).
- TestMulti_DiskCacheInvalidatesOnFileMtimeChange: edit the session
  file after caching, verify next parse actually hits the backend.

NewMultiWithDiskCache(dir, backends...) exposed for tests. Production
code uses NewMulti which defaults to config.DataDir/session-cache/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The session nav panel on the left of the conversation view was 170px
wide always, eating into reading space on 1280-1440px viewports. Now
it collapses to a compact 56px column and expands to 260px only on
hover — reclaiming ~114px of horizontal space for the actual content.

Interaction:

- Default state: 56px-wide column showing only session-id stubs and
  provider badges. The "Sessions" header is replaced by a circular
  'S' dot in the session accent colour — recognisable at a glance.
- Hover: width animates to 260px over 180ms ease. Full session
  summaries, IDs, and the "Sessions" label fade in. A soft left-side
  shadow lifts the expanded column off the content.
- The expansion is OVERLAY, not push — the flex layout slot stays at
  56px so the outline sidebar (middle column) and main content
  (right column) don't shift sideways when the cursor drifts over
  the sidebar. Accomplished by setting the outer <aside> to 56px
  with overflow:visible and letting the inner .panel-header +
  .panel-list children be the elements that actually widen.
- Mobile (<600px) still hides the panel entirely as before.

No JS — pure CSS :hover + width transition. Keyboard tab order and
focus states unchanged. No schema changes.

Tests:

  TestHandleSession_SessionNavIsNarrowByDefault: smoke test that the
  rendered page contains the .panel-nav.session-nav { width: 56px; }
  rule plus the :hover { width: 260px; } expansion. Actual layout
  behaviour is CSS-only so Go tests can only verify the rules are
  emitted — manual browser verification is the load-bearing check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New export format "exec" (alias "exec-md") that renders a session
as an executive report instead of a rendered conversation. Each
turn shows: the user's request, the files the agent edited, and
the agent's final summary text — with per-turn cost footer when
pricing is available.

Useful for the "what did I get done this session" share that
--brief can't quite produce. Brief gives you the full conversation
minus tool noise; exec gives you a structured report where each
turn is one readable block.

Sample output:

    # Session s-abc123

    > initial summary line from the session header

    **Duration:** 1h23m · **Messages:** 42 · **Tool calls:** 18
    · **Total cost:** $2.4500 · **Model:** claude-sonnet-4-5

    ---

    ## Turn 1

    > add a health-check endpoint

    **Files touched:** `api/health.go`, `api/health_test.go`

    Added /health handler plus a table-driven test — passes
    locally. Wired into the router with the standard JSON
    middleware.

    *cost: $0.0423 · 8.2k tokens*

    ## Turn 2
    ...

Implementation (internal/render/exec.go):

- ExecMarkdown walks session messages in wire order and segments
  them into turnBlocks anchored by KindUserPrompt / KindCommand.
  Compact boundaries emit a "— context compaction —" divider
  inline between turns.
- extractEditedFiles scans assistant tool_use blocks for Edit /
  Write / MultiEdit / NotebookEdit / Create and pulls file_path /
  notebook_path / path. Dedupes via map, sorts alphabetically for
  deterministic output.
- lastAssistantText walks the turn's assistants in reverse and
  returns the last one with a non-empty text block — the "final
  summary" a reader actually wants.
- Per-turn cost/token footer reads from parser.ComputeTurnStats
  (reused from #2). When cost is 0 and tokens are 0 the footer
  is omitted entirely.
- Empty turns (no user text, no edits, no summary) are dropped so
  the output isn't padded by meta markers.

Export plumbing:
- render/export.go: Export() dispatches "exec" / "exec-md" to
  ExecMarkdown directly, bypassing the Brief + format pipeline
  (exec is structurally a report, not a rendered transcript).
- cmd/export.go: formatToExt maps "exec" / "exec-md" to
  "-exec.md" so the default output filename is session-<id>-exec.md.

Tests (14 new, render pkg 80.8% coverage):
- extractEditedFiles: dedupes+sorts, ignores non-edit tools,
  handles notebook_path/path field variants, empty on no
  assistants.
- lastAssistantText: final summary with tool_use blocks
  sprinkled, empty on tool-only turns.
- ExecMarkdown: empty session header-only, single-turn full
  render, multi-turn ordering, empty-turn dropping, per-turn
  cost emission.
- Export dispatch: "exec" and "exec-md" aliases produce
  identical output.
- formatDurationShort / formatTokenCountShort helpers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code and Codex share the abstractions the timeline rail and
per-turn cost UI need (turns, assistant responses, tool calls, token
usage) — but Codex's wire format is different enough that ccx was
returning nil Usage on every Codex assistant message. That made the
per-turn spend section empty and the timeline rail heat flat for all
Codex sessions. Fix: extend MessageUsage for GPT-5's reasoning field,
add GPT-5/5.4 pricing, and attribute Codex token_count deltas to
the nearest assistant in the full parse.

Shape unification:

- parser.MessageUsage gains a ReasoningTokens field. Claude sessions
  leave it 0; Codex populates it from reasoning_output_tokens.
- parser.MessageUsage.Total() now sums all 5 categories.
- ComputeCost includes ReasoningTokens at the output rate (OpenAI
  bills GPT-5 thinking tokens as regular output).

GPT-5 / GPT-5.4 pricing (parser/pricing.go):

- Three new tiers: costGPT54_10_80 ($10/$80), costGPT54Mini_025_2
  ($0.25/$2), costGPT54Nano_005_04 ($0.05/$0.4). Cache reads at ~50%
  of input per OpenAI's cached-prompt pricing. Cache writes stay 0
  — OpenAI doesn't split cache creation the way Anthropic does.
- pricingTable adds gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.4,
  gpt-5.4-mini, gpt-5.4-nano rows.
- LookupPricing extends the ordered substring match to include the
  GPT-5 family. nano wins before mini wins before plain gpt-5 so
  "gpt-5-mini-2025-07-07" doesn't fall through to the more-expensive
  full-sized tier.

UNVERIFIED: rates are pinned from publicly-known GPT-5 list pricing
circa late 2025. There's no equivalent of Claude Code's modelCost.ts
in the Codex reference checkout, so `make verify-pricing` cannot
cross-check these rates yet. If OpenAI rebases, update the tier
constants manually. Tracked as a known limitation.

Codex backend per-message attribution (codex/backend.go):

- tokenUsageTotals struct extended with ReasoningOutputTokens.
- Full-parse token_count handler now tracks previousTotals, computes
  delta on each event, and attaches the delta as MessageUsage to
  the most-recent assistant message without Usage via
  latestAssistantWithoutUsage(). Rough per-turn attribution — Codex
  reports running totals, not per-message counts, so delta-since-
  last-event is the best available without a protocol change.
- clampNonNegative() guards against total regressions (rare, but
  wire format doesn't forbid it).
- Previous total wrapped in a stack-local var at full-parse scope
  so each ParseSession call starts fresh.

Tests:

  parser/pricing_test.go: (4 new on top of the prior pricing suite)
  - TestLookupPricing_UnknownFamilyReturnsNil (renamed from
    NonClaudeFamily): gpt-4, grok, llama, mistral still return nil
    — we explicitly only match Claude and GPT-5 families.
  - TestLookupPricing_GPT5Family: every model/canonical pair
    (gpt-5, gpt-5.4, gpt-5-mini, gpt-5.4-nano, uppercase variant)
    resolves to the correct tier. Match-order correctness: mini/nano
    win before plain gpt-5.
  - TestComputeCost_ReasoningTokensBilledAsOutput: 1M reasoning at
    $80/Mtok adds $80 to cost.

  codex/backend_usage_test.go: (5 new)
  - TestLatestAssistantWithoutUsage_SkipsTaggedMessages: walks
    backward, returns first untagged assistant.
  - TestLatestAssistantWithoutUsage_ReturnsNilWhenAllTagged: no
    double-attribution.
  - TestLatestAssistantWithoutUsage_ReturnsNilWhenNoAssistants.
  - TestLatestAssistantWithoutUsage_PicksLatestNotFirst: multiple
    untagged assistants → the latest one is picked.
  - TestClampNonNegative: negative deltas zero out.

Downstream effects (automatic):

- #2 per-turn spend breakdown now populates for Codex sessions.
- #5 timeline rail heat colors work for Codex ticks.
- #7 exec export shows Codex cost footers.
- `make verify-pricing` still clean on the Claude side; Codex
  verification is an acknowledged follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…th canonicalization

Addresses review round 1 findings from linus-code-reviewer on the
10-commit v0.next stack. Two HIGH severity bugs were ship-blockers.

HIGH #1: Pricing match pollution for GPT-5 variants.

Previous LookupPricing used plain strings.Contains with no delimiter
awareness. That matched "gpt-5" inside "gpt-51" (which would be a
hypothetical GPT-5.1), and — critically — matched "gpt-5" inside
"openai/gpt-5-codex-mini" WITHOUT also matching the mini suffix,
resolving it to the full-tier rate (~40x overcharge for real Codex
sessions that use that model string).

Fix: delimited substring matching via two helpers:

- containsDelimited(name, family): requires the next character after
  family to be one of "-./_ :" (or end of string). Rejects
  "gpt-5" inside "gpt-51" or "claude-opus-4" inside "claude-opus-42".

- hasVariantToken(name, variant): checks for variant tokens (mini,
  nano) surrounded by delimiters on both sides. Matches "-mini-",
  "-mini/", "-mini" at end of string. Does NOT match "cuminary" or
  "terminal".

The GPT-5 match now follows a family-then-variant pattern:
  - Check if the name contains gpt-5.4 or gpt-5 (delimited)
  - Within the family, check for nano/mini tokens (delimited)
  - Fall through to the full tier

Added adversarial test cases from the review:
  openai/gpt-5-codex-mini      → gpt-5.4-mini tier (was full tier)
  gpt-5.4-turbo-mini-v2        → gpt-5.4-mini tier
  anthropic/gpt-5.4-nano-preview → gpt-5.4-nano tier
  gpt-51                       → nil (was full tier)
  gpt-5o                       → nil
  claude-opus-42               → nil (was claude-opus-4 tier)
  claude-sonnet-40             → nil

Also applied the delimited pattern to the Claude family, since the
same class of bug could appear for any future hypothetical
"claude-opus-4-7" or "claude-sonnet-4-7" variant that wasn't in
the table.

HIGH #2: Codex per-message attribution lost multi-assistant turns.

Previous latestAssistantWithoutUsage returned only the single
most-recent untagged assistant, dumping the entire delta on it.
For turns that produced multiple agent_message events between
token_count checkpoints (reasoning + partial reply + tool +
partial reply is standard GPT-5 shape), earlier messages stayed
at $0.00 forever and the last one got everything.

Fix: distributeCodexDelta splits the delta evenly across every
untagged assistant in messages[usageWatermark:]. usageWatermark is
a high-water-mark index that advances on every token_count event,
so previous bursts aren't re-attributed.

- Even split by count (not by per-message output size, because
  Codex doesn't expose that — best available without a protocol
  change). The per-turn UI aggregation reassembles correctly.
- Integer-division remainder is absorbed by the first message so
  sum(per-message) == delta exactly. Verified by
  TestDistributeCodexDelta_RemainderGoesToFirst.
- Guards: silently drops delta when no untagged targets (no
  panic), skips already-tagged messages on subsequent events
  (no double-attribution).

Also fixed in the same token_count handler: running-total
regression accounting. Previously clampNonNegative swallowed
negative deltas but left previousTotals at the inflated value,
so a regression-then-recovery would double-count
(100→200→150→250 would emit deltas 100, 0, 100 — wrong, should
be 100, 0, 50). Now previousTotals uses max() as a floor via
maxInt helper.

MEDIUM: Tests for Codex attribution were shape-assertions only.

Added 5 tests that exercise the actual behavior:
- TestDistributeCodexDelta_EvenSplitAcrossUntagged (reasoning +
  agent_message both get half the delta, both get non-zero cost)
- TestDistributeCodexDelta_RemainderGoesToFirst (delta preserved
  under integer division)
- TestDistributeCodexDelta_SkipsTaggedMessages (prior attribution
  not overwritten)
- TestDistributeCodexDelta_SilentlyDropsWithNoUntaggedTargets
- TestMaxInt

MEDIUM: In-memory LRU key was not canonicalized.

Multi.ParseSession now passes absPath (computed once at the top)
to cache.getOrLoad instead of the raw filePath. Two callers with
different representations of the same file ("./s.jsonl" vs
"/wd/s.jsonl") now hit the same cache slot. Correctness was
preserved before the fix — just a cache-hit-rate regression —
but worth fixing.

MEDIUM: gofmt for backend.go.

The tokenUsageTotals struct field alignment broke after adding
ReasoningOutputTokens (my commit, not pre-existing). Fixed.
Passing gofmt -l cleanly now.

All existing tests still pass. 9 new adversarial/coverage tests
added. Full suite green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 2 review noted that round 1's own pricing_test.go (and four
other stack-added files) had comment-alignment drift from gofmt's
preferred layout. Ran gofmt -w on only the files this stack
introduced; pre-existing gofmt-dirty files (internal/cmd/search.go,
internal/cmd/view.go, internal/provider/codex/helpers_test.go,
internal/provider/codex/parse_branches_test.go) are left alone so
the diff stays scoped to v0.next work.

Files reformatted:
  cmd/ccx-verify-pricing/main.go
  internal/parser/pricing_test.go
  internal/parser/session_test.go
  internal/provider/disk_cache_test.go
  internal/web/markdown_test.go

All tests still pass. No behavior changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ignore Codex-only rows in Claude pricing verification
- keep early token deltas, reasoning tokens, and sidechain work in the parent turn
Copy link
Copy Markdown
Member Author

lroolle commented Apr 15, 2026

Follow-up for the round 2 review is on v0.next in b9b7c96.

Fixed:

  • verifier now ignores Codex-only pricing rows during Claude drift checks
  • Codex keeps token_count deltas when the event arrives before assistant output, and session cost is recomputed from attributed messages
  • sidechain prompts stay inside the parent turn, reasoning tokens count toward turn totals, and exec export uses the same turn boundary

Verification:

  • make verify-pricing
  • go test ./...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant