Skip to content

test: add Codex JSONL shape benchmark#1017

Merged
steipete merged 1 commit into
steipete:mainfrom
hhh2210:codex/jsonl-shape-benchmark
May 18, 2026
Merged

test: add Codex JSONL shape benchmark#1017
steipete merged 1 commit into
steipete:mainfrom
hhh2210:codex/jsonl-shape-benchmark

Conversation

@hhh2210
Copy link
Copy Markdown
Contributor

@hhh2210 hhh2210 commented May 17, 2026

Refs #1016.

This adds an RFC benchmark/survey harness for the Codex JSONL scanner workload:

  • CostUsageJsonlShapeBenchmarkTests builds a synthetic JSONL file shaped after a local 30-day Codex sessions sample, then compares the current streaming scanner with a front-buffer baseline.
  • Scripts/cost_jsonl_shape_survey.swift lets maintainers collect the same aggregate shape numbers locally without printing prompts, tool payloads, model values, file paths, or raw JSONL rows.
  • A small end-to-end parser assertion covers the relevant attribution shape: a turn_context row followed by token_count rows that do not carry an explicit model.

The benchmark intentionally does not assert a speed threshold. It is meant to make scanner tradeoffs reproducible without making CI sensitive to runner noise. The long turn_context correctness fix is separate in #1014; this PR only provides measurement and workload-shape coverage for #1016.

Local aggregate survey from my default Codex sessions over 30 days:

files: 532
total bytes: 1202851897
lines: 146748
relevant Codex scanner lines: 57418
line length p50/p90/p95/p99/max: 651 / 5065 / 12080 / 52701 / 6904210 bytes
lines > 32 KiB: 2602
lines > 256 KiB: 697
turn_context lines: 1945
turn_context lines > 32 KiB: 208
turn_context lines > 256 KiB: 0
turn_context model offset p50/p95/max: 443 / 2255 / 2382 bytes
turn_context model offset < 32 KiB: 1945 / 1945
turn_context model offset < 256 KiB: 1945 / 1945
token_count rows missing an explicit model: 22381 / 22381

Local benchmark sample:

Codex JSONL shape benchmark: divisor=20 bytes=19621792 lines=7290 truncated=129 current=113.9MB/s baseline=46.9MB/s speedup=2.4x

Validation:

  • swift test --filter CostUsageJsonlShapeBenchmarkTests
  • Scripts/cost_jsonl_shape_survey.swift --days 30
  • ./Scripts/lint.sh lint
  • git diff --check

@hhh2210 hhh2210 marked this pull request as ready for review May 17, 2026 18:28
Copilot AI review requested due to automatic review settings May 17, 2026 18:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reproducible benchmark + local survey harness to characterize Codex JSONL (NDJSON) scanner workload shape and compare the current streaming scanner against a front-buffer baseline, plus a small end-to-end attribution assertion for model-less token_count rows.

Changes:

  • Add CostUsageJsonlShapeBenchmarkTests to generate a synthetic JSONL fixture matching a 30-day Codex session “shape” and benchmark scanning throughput vs a baseline implementation.
  • Add Scripts/cost_jsonl_shape_survey.swift to scan local Codex session logs and print aggregate shape stats while redacting sensitive content.
  • Add a parser assertion ensuring turn_context model attribution is applied to subsequent token_count rows that omit an explicit model.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
Tests/CodexBarTests/CostUsageJsonlShapeBenchmarkTests.swift New synthetic shape fixture + benchmark harness + attribution regression assertion.
Scripts/cost_jsonl_shape_survey.swift New standalone Swift script to survey local JSONL shape metrics with redacted output.
Comments suppressed due to low confidence (1)

Scripts/cost_jsonl_shape_survey.swift:288

  • The error paths also call Foundation.exit(...) and use stderr; both typically come from Darwin/Glibc rather than Foundation. Consider importing the appropriate C module (or using FileHandle.standardError) and calling exit(...) directly so the script compiles cross-platform.
} catch let error as UsageError {
    fputs("error: \(error.description)\n\n", stderr)
    printUsage()
    Foundation.exit(2)
} catch {
    fputs("error: \(error.localizedDescription)\n", stderr)
    Foundation.exit(1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +121 to +124
case "--help", "-h":
printUsage()
Foundation.exit(0)
default:

func printSummary(_ survey: Survey, options: SurveyOptions) {
print("root: \(redactedRootDescription(options.root))")
print("window days: \(Int(options.days))")
Comment on lines +273 to +281
private struct CodexJsonlShapePlan {
static let localThirtyDaySample = CodexJsonlShapePlan(
totalLines: 145_797,
relevantLines: 57063,
tokenCountWithoutModelLines: 22235,
turnContextLines: 1935,
longTurnContextLines: 207,
linesOver32KiB: 2584,
linesOver256KiB: 697)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10366add30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +47 to +49
let isRelevant = line.contains(Marker.eventMessage)
|| line.contains(Marker.turnContext)
|| line.contains(Marker.sessionMetadata)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict relevant-line counting to scanned event subtypes

The survey currently increments relevantLines for every "type":"event_msg" row, but the actual Codex scanner only keeps event_msg rows when they contain token_count or task_started (Sources/CodexBarCore/Vendored/CostUsage/CostUsageScanner.swift lines 805–809). As a result, the reported relevant Codex scanner lines metric is inflated whenever session logs contain other event messages, which skews the shape numbers this benchmark is meant to reproduce.

Useful? React with 👍 / 👎.

@steipete steipete merged commit dd2b3cc into steipete:main May 18, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants