Skip to content

fix(codex): skip replayed parent token history in thread_spawn subagent sessions#1218

Merged
ryoppippi merged 5 commits into
mainfrom
pullfrog/950-fix-codex-subagent-replay-overcounting
Jun 8, 2026
Merged

fix(codex): skip replayed parent token history in thread_spawn subagent sessions#1218
ryoppippi merged 5 commits into
mainfrom
pullfrog/950-fix-codex-subagent-replay-overcounting

Conversation

@pullfrog

@pullfrog pullfrog Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #950 — Massive token overcounting for Codex subagent sessions (91x inflation).

When OpenAI Codex spawns subagent threads via thread_spawn, the subagent rollout JSONL files contain a full replay of the parent thread's token usage history, re-timestamped to the subagent creation time. This caused usage to be reported up to 91x higher than actual.

Root Cause (3-layer inflation)

  1. Parent history replay: Subagent files replay the parent's full token usage history with timestamps set to subagent creation time
  2. Duplicate entries: ~47% of replayed entries are exact duplicates within the same file
  3. Multiple subagents: 12 subagents each independently replay the same parent history

Fix

Detects thread_spawn subagent sessions by scanning for the thread_spawn byte pattern in the file prefix. For subagent sessions, a pre-scan identifies the replay timestamp pattern (≥2 token_count entries sharing the same second). In the main parse loop, all token_count entries matching the confirmed replay timestamp are skipped.

Changes

  • parser.rs: Added is_codex_subagent_session(), detect_subagent_replay_second(), and replay skip logic in visit_codex_session_file
  • loader.rs: Added two test fixtures:
    • Single subagent with replayed parent history (2 replay + 2 real entries)
    • Three subagent files each replaying the same parent history

All 215 existing tests pass unchanged.

Pullfrog  | View workflow run | via Pullfrog | Using DeepSeek Pro (free via Pullfrog for OSS) | 𝕏


Summary by cubic

Fixes #950. Corrects token overcounting in Codex subagent sessions by skipping replayed parent history in thread_spawn rollouts and preserving cumulative baselines, preventing up to 91x inflation.

  • Bug Fixes

    • Detect subagent files by scanning for thread_spawn.
    • Pre-scan for the replay second (≥2 token_count lines in the same second).
    • Skip token_count entries with that timestamp; stop after the first non-replay line.
    • Preserve replayed total_token_usage as the baseline so the first real subagent entry is counted as a delta.
    • Added tests for single/multiple subagent files and the baseline-only case.
  • Dependencies

    • Pin actions/checkout@v6 and pullfrog/pullfrog@v0 to commit SHAs in the workflow.

Written for commit a668d2c. Summary will update on new commits.

Review in cubic

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 6, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
ccusage-guide a668d2c Commit Preview URL

Branch Preview URL
Jun 08 2026, 09:21 PM

@ryoppippi

Copy link
Copy Markdown
Owner

@pullfrog rebase main

@pullfrog

pullfrog Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased pullfrog/950-fix-codex-subagent-replay-overcounting onto main. Resolved a merge conflict in .github/workflows/pullfrog.yml (pinned commit hashes vs unpinned with zizmor ignore comments — kept the pinned hashes, which supersede the unpinned approach). Force-pushed the rebased branch.

Task list (5/5 completed)
  • Fetch latest main branch
  • Rebase current branch onto main
  • Resolve merge conflict in .github/workflows/pullfrog.yml
  • Push rebased branch
  • Report results

Pullfrog  | View workflow run | via Pullfrog | Using DeepSeek Pro (free via Pullfrog for OSS) | 𝕏

pullfrog Bot added 4 commits June 6, 2026 13:57
…nt sessions

When Codex spawns subagent threads via thread_spawn, the subagent rollout JSONL files contain a full replay of the parent thread's token usage history, re-timestamped to the subagent creation time. This caused usage to be reported up to 91x higher than actual.

Detect subagent sessions by scanning for the thread_spawn byte pattern, then pre-scan the file for the replay timestamp pattern (>=2 token_count entries with the same second). In the main parse loop, skip all token_count entries matching the confirmed replay timestamp.

Fixes #950
@pullfrog pullfrog Bot force-pushed the pullfrog/950-fix-codex-subagent-replay-overcounting branch from 29c8a73 to a614b36 Compare June 6, 2026 14:08
@pkg-pr-new

pkg-pr-new Bot commented Jun 6, 2026

Copy link
Copy Markdown

Open in StackBlitz

ccusage

npx https://pkg.pr.new/ccusage@1218

@ccusage/ccusage-darwin-arm64

npx https://pkg.pr.new/@ccusage/ccusage-darwin-arm64@1218

@ccusage/ccusage-darwin-x64

npx https://pkg.pr.new/@ccusage/ccusage-darwin-x64@1218

@ccusage/ccusage-linux-arm64

npx https://pkg.pr.new/@ccusage/ccusage-linux-arm64@1218

@ccusage/ccusage-linux-x64

npx https://pkg.pr.new/@ccusage/ccusage-linux-x64@1218

@ccusage/ccusage-win32-arm64

npx https://pkg.pr.new/@ccusage/ccusage-win32-arm64@1218

@ccusage/ccusage-win32-x64

npx https://pkg.pr.new/@ccusage/ccusage-win32-x64@1218

commit: a668d2c

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: 29c8a7300373
Base SHA: bee4a26e6cf5

This compares the PR package against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new bee4a26e6cf5 1.331s 572.1ms 29.3ms 3
PR pkg.pr.new 29c8a73 490.3ms 479.3ms 29.5ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: bee4a26e6cf5; PR package: 29c8a73. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 548.6ms 549.9ms 1.00x 298.83 MiB 305.83 MiB 1.02x 1.84 GiB/s 1.83 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 367.3ms 365.6ms 1.00x 79.58 MiB 71.45 MiB 0.90x 2.74 GiB/s 2.75 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 536.8ms 1.88 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 513.6ms 1.96 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 358.7ms 2.81 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 343.4ms 2.93 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 29.1ms 28.0ms 1.04x 43.61 MiB 43.73 MiB 1.00x 0.05 MiB/s 0.06 MiB/s
claude session --offline --json 0.00 MiB 28.2ms 28.2ms 1.00x 43.48 MiB 43.61 MiB 1.00x 0.05 MiB/s 0.05 MiB/s
codex daily --offline --json 0.00 MiB 27.4ms 27.2ms 1.01x 43.48 MiB 43.48 MiB 1.00x 0.03 MiB/s 0.03 MiB/s
codex session --offline --json 0.00 MiB 28.2ms 28.0ms 1.01x 43.61 MiB 43.48 MiB 1.00x 0.03 MiB/s 0.03 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 551.7ms 541.2ms 1.02x 326.20 MiB 283.58 MiB 0.87x 1.82 GiB/s 1.86 GiB/s
codex --offline --json 1.01 GiB 358.7ms 350.4ms 1.02x 78.83 MiB 76.70 MiB 0.97x 2.81 GiB/s 2.87 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.35 KiB +0.00 KiB 1.00x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: 29c8a7300373
Base SHA: bee4a26e6cf5

This compares the Rust PR release binary against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new bee4a26e6cf5 602.3ms 613.6ms 32.4ms 3
PR pkg.pr.new 29c8a73 547.0ms 620.5ms 32.7ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: bee4a26e6cf5; PR package: 29c8a73. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 558.3ms 555.7ms 1.00x 300.33 MiB 321.70 MiB 1.07x 1.80 GiB/s 1.81 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 370.7ms 366.0ms 1.01x 81.58 MiB 71.58 MiB 0.88x 2.72 GiB/s 2.75 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 548.7ms 1.83 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 541.7ms 1.86 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 364.5ms 2.76 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 337.7ms 2.98 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 29.0ms 4.0ms 7.29x 43.73 MiB 2.70 MiB 0.06x 0.05 MiB/s 0.39 MiB/s
claude session --offline --json 0.00 MiB 29.3ms 3.8ms 7.67x 43.61 MiB 2.70 MiB 0.06x 0.05 MiB/s 0.40 MiB/s
codex daily --offline --json 0.00 MiB 29.3ms 3.8ms 7.78x 43.73 MiB 2.70 MiB 0.06x 0.03 MiB/s 0.23 MiB/s
codex session --offline --json 0.00 MiB 28.4ms 3.6ms 7.86x 43.61 MiB 2.70 MiB 0.06x 0.03 MiB/s 0.24 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 564.4ms 536.2ms 1.05x 300.45 MiB 305.95 MiB 1.02x 1.78 GiB/s 1.88 GiB/s
codex --offline --json 1.01 GiB 375.4ms 336.2ms 1.12x 74.08 MiB 69.83 MiB 0.94x 2.68 GiB/s 2.99 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.35 KiB +0.00 KiB 1.00x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: a614b36e3b05
Base SHA: bee4a26e6cf5

This compares the Rust PR release binary against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new bee4a26e6cf5 627.3ms 667.9ms 30.4ms 3
PR pkg.pr.new a614b36 923.2ms 608.3ms 30.6ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: bee4a26e6cf5; PR package: a614b36. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 538.2ms 539.8ms 1.00x 327.70 MiB 316.83 MiB 0.97x 1.87 GiB/s 1.86 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 365.3ms 360.8ms 1.01x 80.83 MiB 81.58 MiB 1.01x 2.76 GiB/s 2.79 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 529.8ms 1.90 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 517.5ms 1.95 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 359.3ms 2.80 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 328.0ms 3.07 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 29.4ms 3.9ms 7.48x - 2.70 MiB - 0.05 MiB/s 0.39 MiB/s
claude session --offline --json 0.00 MiB 29.3ms 3.9ms 7.58x 43.61 MiB 2.70 MiB 0.06x 0.05 MiB/s 0.40 MiB/s
codex daily --offline --json 0.00 MiB 28.4ms 3.6ms 7.98x 43.48 MiB 2.70 MiB 0.06x 0.03 MiB/s 0.24 MiB/s
codex session --offline --json 0.00 MiB 28.3ms 3.6ms 7.84x - 2.70 MiB - 0.03 MiB/s 0.24 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 532.8ms 523.5ms 1.02x 316.45 MiB 313.45 MiB 0.99x 1.89 GiB/s 1.92 GiB/s
codex --offline --json 1.01 GiB 361.5ms 328.3ms 1.10x 80.45 MiB 79.33 MiB 0.99x 2.79 GiB/s 3.07 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.35 KiB +0.00 KiB 1.00x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: a614b36e3b05
Base SHA: bee4a26e6cf5

This compares the PR package against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new bee4a26e6cf5 575.5ms 565.3ms 30.4ms 3
PR pkg.pr.new a614b36 635.9ms 650.4ms 31.6ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: bee4a26e6cf5; PR package: a614b36. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 545.4ms 544.0ms 1.00x 319.45 MiB 307.70 MiB 0.96x 1.85 GiB/s 1.85 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 364.5ms 364.5ms 1.00x 76.58 MiB 79.70 MiB 1.04x 2.76 GiB/s 2.76 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 536.4ms 1.88 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 506.2ms 1.99 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 360.8ms 2.79 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 339.2ms 2.97 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 28.4ms 28.0ms 1.01x 43.61 MiB 43.48 MiB 1.00x 0.05 MiB/s 0.06 MiB/s
claude session --offline --json 0.00 MiB 28.3ms 28.7ms 0.99x 43.61 MiB 43.48 MiB 1.00x 0.05 MiB/s 0.05 MiB/s
codex daily --offline --json 0.00 MiB 28.6ms 28.7ms 1.00x 43.48 MiB 43.61 MiB 1.00x 0.03 MiB/s 0.03 MiB/s
codex session --offline --json 0.00 MiB 28.8ms 29.1ms 0.99x - 43.48 MiB - 0.03 MiB/s 0.03 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 552.7ms 551.0ms 1.00x 313.83 MiB 304.70 MiB 0.97x 1.82 GiB/s 1.83 GiB/s
codex --offline --json 1.01 GiB 363.4ms 356.1ms 1.02x 73.70 MiB 78.58 MiB 1.07x 2.77 GiB/s 2.83 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.35 KiB +0.00 KiB 1.00x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

When thread_spawn replay entries are skipped, keep their cumulative total_token_usage as the parser baseline. Without that baseline, Codex logs that only provide total_token_usage on the first real subagent entry are counted as the full replayed cumulative total instead of the post-replay delta.

Add a regression fixture that skips two replayed cumulative entries and verifies the following real entry is reported as a 100 input-token delta rather than the 1600-token cumulative total.
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: a668d2c61adc
Base SHA: ae2881ffb48f

This compares the PR package against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new ae2881ffb48f 578.3ms 546.3ms 31.3ms 3
PR pkg.pr.new a668d2c 746.2ms 517.2ms 32.8ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: ae2881ffb48f; PR package: a668d2c. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 569.6ms 544.5ms 1.05x 326.83 MiB 325.33 MiB 1.00x 1.77 GiB/s 1.85 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 373.0ms 377.3ms 0.99x 81.08 MiB 83.33 MiB 1.03x 2.70 GiB/s 2.67 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 556.7ms 1.81 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 520.7ms 1.93 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 367.9ms 2.74 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 344.3ms 2.92 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 30.3ms 30.3ms 1.00x 43.61 MiB 43.48 MiB 1.00x 0.05 MiB/s 0.05 MiB/s
claude session --offline --json 0.00 MiB 30.0ms 30.3ms 0.99x 43.48 MiB 43.73 MiB 1.01x 0.05 MiB/s 0.05 MiB/s
codex daily --offline --json 0.00 MiB 30.0ms 30.3ms 0.99x 43.48 MiB 43.61 MiB 1.00x 0.03 MiB/s 0.03 MiB/s
codex session --offline --json 0.00 MiB 30.1ms 29.9ms 1.01x 43.48 MiB 43.48 MiB 1.00x 0.03 MiB/s 0.03 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs the published ccusage package from pkg.pr.new, installed before measurement. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 554.8ms 556.1ms 1.00x 288.20 MiB 307.45 MiB 1.07x 1.81 GiB/s 1.81 GiB/s
codex --offline --json 1.01 GiB 363.6ms 371.8ms 0.98x 81.08 MiB 77.83 MiB 0.96x 2.77 GiB/s 2.71 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.50 KiB +0.15 KiB 0.99x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

ccusage performance comparison

PR SHA: a668d2c61adc
Base SHA: ae2881ffb48f

This compares the Rust PR release binary against the configured base package on the same CI runner.

Package runner startup

Execution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one bunx -p <url> ccusage --version run with an empty Bun install cache. Warm reuses that cache and reports the median of repeated runs.

Package SHA Execution setup Bunx temp cache Bunx warm median Warm samples
Base pkg.pr.new ae2881ffb48f 1.171s 868.2ms 34.0ms 3
PR pkg.pr.new a668d2c 888.3ms 850.5ms 34.4ms 3

Cached bunx execution performance

Runs the same large fixture through bunx -p <pkg.pr.new URL> ccusage after the Bun install cache has already been populated by the startup measurement. This separates cached package-runner execution from first-fetch package materialization.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base package: ae2881ffb48f; PR package: a668d2c. Both run through bunx -p <pkg.pr.new URL> ccusage using the warmed Bun install cache from package runner startup, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
bunx -p <pkg> ccusage claude --offline --json 1.01 GiB 560.2ms 570.3ms 0.98x 301.95 MiB 309.95 MiB 1.03x 1.80 GiB/s 1.77 GiB/s
bunx -p <pkg> ccusage codex --offline --json 1.01 GiB 375.3ms 382.9ms 0.98x 81.20 MiB 78.33 MiB 0.96x 2.68 GiB/s 2.63 GiB/s

Package runtime diagnostics

Compares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
All rows run --offline --json, measured by hyperfine with 0 warmups and 1 runs. This isolates wrapper overhead from the installed native optional dependency and the workspace release binary built on the runner.

Command Runtime Input Median Throughput Samples
claude --offline --json Package wrapper 1.01 GiB 561.3ms 1.79 GiB/s 1
claude --offline --json Installed native binary 1.01 GiB 539.6ms 1.87 GiB/s 1
codex --offline --json Package wrapper 1.01 GiB 373.2ms 2.70 GiB/s 1
codex --offline --json Installed native binary 1.01 GiB 348.8ms 2.89 GiB/s 1

Committed fixture performance

Committed small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage.

Fixtures: Claude apps/ccusage/test/fixtures/claude (0.00 MiB, 2 files), Codex apps/ccusage/test/fixtures/codex (0.00 MiB, 1 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 2 warmups and 7 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude daily --offline --json 0.00 MiB 30.5ms 4.2ms 7.28x 43.73 MiB 2.83 MiB 0.06x 0.05 MiB/s 0.37 MiB/s
claude session --offline --json 0.00 MiB 31.5ms 4.5ms 7.05x 43.61 MiB 2.83 MiB 0.06x 0.05 MiB/s 0.35 MiB/s
codex daily --offline --json 0.00 MiB 31.1ms 4.0ms 7.75x 43.48 MiB 2.83 MiB 0.07x 0.03 MiB/s 0.21 MiB/s
codex session --offline --json 0.00 MiB 30.8ms 4.0ms 7.63x 43.48 MiB 2.83 MiB 0.07x 0.03 MiB/s 0.21 MiB/s

Large real-world-shaped fixture performance

Generated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures.

Fixtures: Claude /home/runner/work/_temp/ccusage-large-fixture (1.01 GiB, 2,597 files), Codex /home/runner/work/_temp/ccusage-large-codex-fixture (1.01 GiB, 2,597 files)
Base runs the published ccusage package from pkg.pr.new, installed before measurement; PR runs rust/target/release/ccusage directly. Both run --offline --json, measured by hyperfine with 0 warmups and 1 runs.
Peak RSS is measured separately with /usr/bin/time using 1 runs. Lower RSS ratios are better.

Command Input Base median PR median PR vs base Base peak RSS PR peak RSS PR/base RSS Base throughput PR throughput
claude --offline --json 1.01 GiB 561.0ms 533.0ms 1.05x 320.45 MiB 324.20 MiB 1.01x 1.79 GiB/s 1.89 GiB/s
codex --offline --json 1.01 GiB 370.3ms 346.7ms 1.07x 75.95 MiB 80.20 MiB 1.06x 2.72 GiB/s 2.90 GiB/s

Artifact size

Artifact Base PR Delta Ratio
packed ccusage-*.tgz 14.35 KiB 14.50 KiB +0.15 KiB 0.99x
installed native package binary 3289.62 KiB 3289.74 KiB +0.13 KiB 1.00x

Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees.

@ryoppippi ryoppippi merged commit 22e5944 into main Jun 8, 2026
37 checks passed
@ryoppippi ryoppippi deleted the pullfrog/950-fix-codex-subagent-replay-overcounting branch June 8, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Massive token overcounting for Codex subagent sessions (91x inflation)

1 participant