Skip to content

Fix codex / opencode env names and add persistent agent_session_id fallback#894

Merged
codyde merged 2 commits into
masterfrom
cody/agent-session-id-propagation-fixes
May 11, 2026
Merged

Fix codex / opencode env names and add persistent agent_session_id fallback#894
codyde merged 2 commits into
masterfrom
cody/agent-session-id-propagation-fixes

Conversation

@codyde
Copy link
Copy Markdown
Collaborator

@codyde codyde commented May 11, 2026

Summary

  • Two env var name bug fixes in the agent_session_id precedence chain. The CLI was checking for variables that no agent actually exports:
    • CODEX_SESSION_ID → real name is CODEX_THREAD_ID (UUID v7)
    • OPENCODE_SESSION_ID → real name is OPENCODE_RUN_ID (UUID v4)
  • New persistent agent_session_id file at ~/.railway/sessions/<16-hex>.session keyed on parent process identity (pid + boot time + argv0). Subsequent railway invocations from the same parent reuse the recorded UUID
  • Adds AGENT_THREAD_ID to the precedence chain (cross-agent convention exposed by Amp and a few other harnesses)
  • Tightens the process-tree claude substring match to require claude-code / claude_code / anthropic.claude-code / bare claude argv0 (previous bare substring was over-attributing Claude Desktop helper paths and ~/.claude/ scripts to claude_code)

Why

Audit on 2026-05-10 found claude_code event count 21x ahead and session count 17x ahead of any other agent in the warehouse. Hex provenance check revealed the real bug: 99.8% of claude_code's "sessions" were the per-process cli_<base64> fallback because CLAUDE_CODE_SESSION_ID doesn't survive the Bash tool boundary. Same shape for every other agent — and even worse for codex (100% fallback) and opencode (~100% fallback) because the CLI was looking for env var names that those agents don't actually export.

Verified 2026-05-11 by capturing the env inside live shell-tool invocations of each agent:

Agent Real session env var What the CLI was checking Warehouse fallback rate
Codex CODEX_THREAD_ID (UUID v7) (nothing — codex absent entirely) 100%
OpenCode OPENCODE_RUN_ID (UUID v4) OPENCODE_SESSION_ID (does not exist) ~100%
Amp AMP_CURRENT_THREAD_ID AMP_CURRENT_THREAD_ID (correct) high but population tiny
Cursor (none — no session var exported) CURSOR_TRACE_ID (kept for forward compat) 95%
Claude Code CLAUDE_CODE_SESSION_ID (UUID) CLAUDE_CODE_SESSION_ID (correct) 99.8% because Bash tool doesn't propagate

The persistent-file fallback is the unified fix for the last row and any future agent whose env doesn't propagate. UUID format is chosen to be a v4 UUID specifically so the dbt-side is_unstitched_agent_session macro treats it as a real stitched session (not subject to gap-windowing).

File lifecycle (the part worth scrutinizing in review)

  • Written only when is_agent_caller(caller) == true. Humans typing railway interactively (tty / tty:*) and CI runs never get a file
  • Reused as long as parent pid is still alive AND boot time matches the persisted value (defends against PID reuse after reboot)
  • 7-day hard age cap as backstop against very-long-lived parents or boot-time detection drift
  • Stale files cleaned up on every CLI invocation (parent gone or btime mismatch → delete); 100-file directory ceiling as defense-in-depth (oldest by mtime evicted first)
  • RAILWAY_SESSIONS_DIR env var overrides the directory location (used by tests)
  • Telemetry disabled (DO_NOT_TRACK=1 or RAILWAY_NO_TELEMETRY=1) short-circuits before the file is written, so users who opted out keep nothing on disk

Test plan

  • 17/17 unit tests passing (3 new: claude_substring_no_longer_overmatches, new_session_uuid_is_v4_format, new_session_uuid_does_not_match_cli_fallback_regex)
  • End-to-end against locally built binary:
    • Phase 1: clean env + agent caller → fresh file with parent pid + valid UUID v4
    • Phase 2: same parent invokes twice → UUID preserved
    • Phase 3: three concurrent live subshells → three distinct files, three distinct UUIDs, three distinct parent_pids
    • Phase 4: CODEX_THREAD_ID set → file NOT written (env wins)
    • Phase 5: OPENCODE_RUN_ID set → file NOT written
    • Phase 6: planted stale file (pid 999999) → cleanup removes, fresh file written
    • Phase 7: tty caller → no file written
  • Post-merge: confirm warehouse rows from a CLI build of this branch surface agent_session_id values that look like UUIDs (not cli_<base64>) for agent-attributed traffic
  • Post-release: confirm codex / opencode rows in mcp_submit_tool / cli_* tables stitch correctly via their new env entries

Companion PR

Analytics-side fix (gap-windowed sessionization for the unstitched fallback) lands in railwayapp/dbt-analytics#134. The two together address both producer and consumer of the problem; either can ship independently but they're best deployed close together.

🤖 Generated with Claude Code

…llback

The agent_session_id precedence chain in TelemetryContext had two entries
checking for env vars that no agent actually exports:

  CODEX_SESSION_ID   → real name is CODEX_THREAD_ID (UUID v7)
  OPENCODE_SESSION_ID → real name is OPENCODE_RUN_ID (UUID v4)

Both verified 2026-05-11 by capturing the env inside live codex/opencode
shell tool invocations. The wrong names matched the warehouse data:
codex was at 100% per-process fallback and opencode at ~100% because the
CLI was looking for variables that don't exist. Also adds AGENT_THREAD_ID
as a late fallback (cross-agent convention exposed by Amp and observed
in other harnesses' docs).

When no harness env var is present, the CLI now writes a persistent
session file to ~/.railway/sessions/<16-hex>.session keyed on parent
process identity (pid + boot time + argv0). Subsequent `railway`
invocations from the same parent reuse the recorded UUID, recovering
stable stitching for agents whose env doesn't propagate (notably
claude_code: 99.8% of sessions hit the per-process mint because
CLAUDE_CODE_SESSION_ID doesn't survive the Bash tool boundary).

File lifecycle:
  - Written only for agent callers (tty/ci never get a file).
  - Reused as long as parent pid is alive AND boot time matches.
  - 7-day hard age cap as backstop against PID reuse.
  - Stale files (parent gone or btime mismatch) deleted on every
    invocation; directory capped at 100 files (oldest-by-mtime evicted).
  - Override location via RAILWAY_SESSIONS_DIR for tests.

UUID format chosen to match the dbt-side is_unstitched_agent_session
macro: a v4 UUID does not match the cli_<22-char-base64> regex, so
persistent IDs are treated as real stitched sessions in the warehouse,
not heuristically gap-binned.

Tightens the process-tree claude substring match to require
claude-code / claude_code / anthropic.claude-code / bare `claude` argv0.
The previous bare `claude` substring over-attributed Claude Desktop
helper paths, MCP server binaries with "claude" in argv, and
~/.claude/ scripts to claude_code.

Verified end-to-end against a locally built binary across:
fresh write, same-parent reuse (UUID preserved), multi-parent isolation
(concurrent subshells get distinct files), env precedence (CODEX_THREAD_ID
and OPENCODE_RUN_ID win over disk), stale cleanup (dead-pid files
removed), tty caller suppression (no file written). 17/17 unit tests
passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codyde codyde added the release/patch Author patch release label May 11, 2026
CI rustfmt check failed on three multiline-preference nits:
the UUID format! args, an assert_eq! over 100 chars, and a multi-arg
assert! over 100 chars. No behavioral change; cargo fmt --all auto-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codyde codyde merged commit 01bde58 into master May 11, 2026
6 checks passed
@codyde codyde deleted the cody/agent-session-id-propagation-fixes branch May 11, 2026 06:35
codyde added a commit that referenced this pull request May 12, 2026
…diate parent (#896)

* fix(telemetry): anchor persistent session on agent ancestor, not immediate parent

#894 introduced a persistent ~/.railway/sessions/*.session fallback, but
parent_identity() reads me.ppid directly, so the session is keyed on the
immediate parent. For claude_code's claude_code -> bash -> railway
invocation chain, the parent is the short-lived bash spawned per Bash
tool call, which dies between invocations. Result: every railway call
mints a fresh UUID instead of reusing the file.

Warehouse confirms: ~107k single-event UUID sessions from claude_code
alone in 48h after 4.57.3 shipped, with stitching empirically worse
than the prior cli_<22b64> regime because the new UUIDs aren't
pattern-matchable as fallbacks (the dbt is_unstitched_agent_session
macro can't catch them).

Fix: extract the ancestor walk into agent_ancestor_pid() (mirrors the
15-level walk in agent_from_process_tree) and anchor the persistent
session on the recognized harness process — claude_code, codex,
cursor, etc. These are long-lived and stable across the agent's many
short-lived shell subprocesses. Falls back to the immediate parent
only when no recognized agent ancestor exists, preserving stitching
for unknown-but-long-lived parents.

Tests cover the claude_code-via-bash and codex-via-sh chains, the
no-agent fallback path, and a self-referential ppid cycle guard.
4 new tests, 21/21 telemetry tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: rustfmt the long claude_code argv line in the new test

CI rustfmt failed on the multiline-preference threshold for the
`node(1, "...")` argument in the claude_code anchor test. Apply
cargo fmt --all auto-fix; no behavioral change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release/patch Author patch release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant