fix(codex): skip replayed parent token history in thread_spawn subagent sessions#1218
Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
ccusage-guide | a668d2c | Commit Preview URL Branch Preview URL |
Jun 08 2026, 09:21 PM |
|
@pullfrog rebase main |
|
Rebased Task list (5/5 completed)
|
…nt sessions When Codex spawns subagent threads via thread_spawn, the subagent rollout JSONL files contain a full replay of the parent thread's token usage history, re-timestamped to the subagent creation time. This caused usage to be reported up to 91x higher than actual. Detect subagent sessions by scanning for the thread_spawn byte pattern, then pre-scan the file for the replay timestamp pattern (>=2 token_count entries with the same second). In the main parse loop, skip all token_count entries matching the confirmed replay timestamp. Fixes #950
29c8a73 to
a614b36
Compare
ccusage
@ccusage/ccusage-darwin-arm64
@ccusage/ccusage-darwin-x64
@ccusage/ccusage-linux-arm64
@ccusage/ccusage-linux-x64
@ccusage/ccusage-win32-arm64
@ccusage/ccusage-win32-x64
commit: |
ccusage performance comparisonPR SHA: This compares the PR package against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |
ccusage performance comparisonPR SHA: This compares the Rust PR release binary against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |
ccusage performance comparisonPR SHA: This compares the Rust PR release binary against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |
ccusage performance comparisonPR SHA: This compares the PR package against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |
When thread_spawn replay entries are skipped, keep their cumulative total_token_usage as the parser baseline. Without that baseline, Codex logs that only provide total_token_usage on the first real subagent entry are counted as the full replayed cumulative total instead of the post-replay delta. Add a regression fixture that skips two replayed cumulative entries and verifies the following real entry is reported as a 100 input-token delta rather than the 1600-token cumulative total.
ccusage performance comparisonPR SHA: This compares the PR package against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |
ccusage performance comparisonPR SHA: This compares the Rust PR release binary against the configured base package on the same CI runner. Package runner startupExecution setup measures any pre-benchmark package materialization used by the execution benchmark. Bunx temp cache measures one
Cached bunx execution performanceRuns the same large fixture through Fixtures: Claude
Package runtime diagnosticsCompares the PR package wrapper, the installed native optional dependency binary, and the workspace release binary on the same large fixture. This identifies whether slow package results come from JavaScript wrapper overhead, the published native binary build, or the Rust core itself. Fixtures: Claude
Committed fixture performanceCommitted small fixtures for stable PR-to-PR feedback and explicit Claude/Codex command coverage. Fixtures: Claude
Large real-world-shaped fixture performanceGenerated fixtures shaped from aggregate local log statistics: thousands of JSONL files, many small sessions, and a long tail of larger sessions. No real prompts, paths, or outputs are stored in the fixtures. Fixtures: Claude
Artifact size
Lower medians and smaller artifacts are better. CI runner noise still applies; use same-run ratios as directional PR feedback, not release guarantees. |

Summary
Fixes #950 — Massive token overcounting for Codex subagent sessions (91x inflation).
When OpenAI Codex spawns subagent threads via
thread_spawn, the subagent rollout JSONL files contain a full replay of the parent thread's token usage history, re-timestamped to the subagent creation time. This caused usage to be reported up to 91x higher than actual.Root Cause (3-layer inflation)
Fix
Detects thread_spawn subagent sessions by scanning for the
thread_spawnbyte pattern in the file prefix. For subagent sessions, a pre-scan identifies the replay timestamp pattern (≥2 token_count entries sharing the same second). In the main parse loop, all token_count entries matching the confirmed replay timestamp are skipped.Changes
parser.rs: Addedis_codex_subagent_session(),detect_subagent_replay_second(), and replay skip logic invisit_codex_session_fileloader.rs: Added two test fixtures:All 215 existing tests pass unchanged.
DeepSeek Pro(free via Pullfrog for OSS) | 𝕏Summary by cubic
Fixes #950. Corrects token overcounting in Codex subagent sessions by skipping replayed parent history in
thread_spawnrollouts and preserving cumulative baselines, preventing up to 91x inflation.Bug Fixes
thread_spawn.token_countlines in the same second).token_countentries with that timestamp; stop after the first non-replay line.total_token_usageas the baseline so the first real subagent entry is counted as a delta.Dependencies
actions/checkout@v6andpullfrog/pullfrog@v0to commit SHAs in the workflow.Written for commit a668d2c. Summary will update on new commits.