test: add Codex JSONL shape benchmark#1017
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a reproducible benchmark + local survey harness to characterize Codex JSONL (NDJSON) scanner workload shape and compare the current streaming scanner against a front-buffer baseline, plus a small end-to-end attribution assertion for model-less token_count rows.
Changes:
- Add
CostUsageJsonlShapeBenchmarkTeststo generate a synthetic JSONL fixture matching a 30-day Codex session “shape” and benchmark scanning throughput vs a baseline implementation. - Add
Scripts/cost_jsonl_shape_survey.swiftto scan local Codex session logs and print aggregate shape stats while redacting sensitive content. - Add a parser assertion ensuring
turn_contextmodel attribution is applied to subsequenttoken_countrows that omit an explicit model.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| Tests/CodexBarTests/CostUsageJsonlShapeBenchmarkTests.swift | New synthetic shape fixture + benchmark harness + attribution regression assertion. |
| Scripts/cost_jsonl_shape_survey.swift | New standalone Swift script to survey local JSONL shape metrics with redacted output. |
Comments suppressed due to low confidence (1)
Scripts/cost_jsonl_shape_survey.swift:288
- The error paths also call
Foundation.exit(...)and usestderr; both typically come fromDarwin/Glibcrather than Foundation. Consider importing the appropriate C module (or usingFileHandle.standardError) and callingexit(...)directly so the script compiles cross-platform.
} catch let error as UsageError {
fputs("error: \(error.description)\n\n", stderr)
printUsage()
Foundation.exit(2)
} catch {
fputs("error: \(error.localizedDescription)\n", stderr)
Foundation.exit(1)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case "--help", "-h": | ||
| printUsage() | ||
| Foundation.exit(0) | ||
| default: |
|
|
||
| func printSummary(_ survey: Survey, options: SurveyOptions) { | ||
| print("root: \(redactedRootDescription(options.root))") | ||
| print("window days: \(Int(options.days))") |
| private struct CodexJsonlShapePlan { | ||
| static let localThirtyDaySample = CodexJsonlShapePlan( | ||
| totalLines: 145_797, | ||
| relevantLines: 57063, | ||
| tokenCountWithoutModelLines: 22235, | ||
| turnContextLines: 1935, | ||
| longTurnContextLines: 207, | ||
| linesOver32KiB: 2584, | ||
| linesOver256KiB: 697) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10366add30
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let isRelevant = line.contains(Marker.eventMessage) | ||
| || line.contains(Marker.turnContext) | ||
| || line.contains(Marker.sessionMetadata) |
There was a problem hiding this comment.
Restrict relevant-line counting to scanned event subtypes
The survey currently increments relevantLines for every "type":"event_msg" row, but the actual Codex scanner only keeps event_msg rows when they contain token_count or task_started (Sources/CodexBarCore/Vendored/CostUsage/CostUsageScanner.swift lines 805–809). As a result, the reported relevant Codex scanner lines metric is inflated whenever session logs contain other event messages, which skews the shape numbers this benchmark is meant to reproduce.
Useful? React with 👍 / 👎.
Refs #1016.
This adds an RFC benchmark/survey harness for the Codex JSONL scanner workload:
CostUsageJsonlShapeBenchmarkTestsbuilds a synthetic JSONL file shaped after a local 30-day Codex sessions sample, then compares the current streaming scanner with a front-buffer baseline.Scripts/cost_jsonl_shape_survey.swiftlets maintainers collect the same aggregate shape numbers locally without printing prompts, tool payloads, model values, file paths, or raw JSONL rows.turn_contextrow followed bytoken_countrows that do not carry an explicit model.The benchmark intentionally does not assert a speed threshold. It is meant to make scanner tradeoffs reproducible without making CI sensitive to runner noise. The long
turn_contextcorrectness fix is separate in #1014; this PR only provides measurement and workload-shape coverage for #1016.Local aggregate survey from my default Codex sessions over 30 days:
Local benchmark sample:
Validation:
swift test --filter CostUsageJsonlShapeBenchmarkTestsScripts/cost_jsonl_shape_survey.swift --days 30./Scripts/lint.sh lintgit diff --check