feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a) by epappas · Pull Request #114 · techlab-innov/llmtrace

epappas · 2026-04-22T13:01:05Z

Summary

First concrete step of the E2E adversarial test framework (umbrella #91).

Pre-flight audit (Loop E2E-L1) identified one real gap: the proxy always generates a server-side trace_id at request entry and never echoes it, so the harness can't correlate per-request observations (findings, metrics deltas, judge verdicts) with scenarios.
This PR closes that gap (Loop E2E-L1a) with a minimal, typed change and unblocks Loops E2E-L4 (metrics-delta observer) and E2E-L5 (judge verdict collector).

Changes

`crates/llmtrace-proxy/src/proxy.rs`

pub const TRACE_ID_HEADER = "x-llmtrace-trace-id" and a new extract_or_generate_trace_id(&HeaderMap) -> Uuid helper. Reads the inbound header, trims whitespace, parses as Uuid. Falls back to Uuid::new_v4() on missing/non-UTF-8/unparseable — preserves current behaviour when no header is sent.
proxy_handler calls the helper in place of the unconditional Uuid::new_v4().
Every response path stamps X-LLMTrace-Trace-Id: <uuid>:
- Success path builder (set after the upstream-header copy loop so we win on conflict)
- error_response(status, message, trace_id) — new required param; 4 call sites updated
- rate_limit_response(…, trace_id) — new required param
- cap_rejected_response(…, trace_id) — new required param

`docs/TODO_E2E.md`

New file. Pulls the 10 E2E child issues (feat(e2e): attack-scenario YAML schema + validator #92–feat(e2e): Rust llmtrace redteam subcommand (stretch) #101) into a RALPH-style loop breakdown with IDs (E2E-NNN), acceptance criteria, dependency graph, and sequencing options.
Includes the Loop E2E-L1 pre-flight audit findings that motivated this PR, plus two other L1 outcomes:
- JudgeVerdictQuery { trace_id: Option<Uuid>, .. } already exists, so the L5 plan is trimmed (no new find_by_trace_id trait method needed — query_verdicts suffices).
- Default shutdown.timeout_seconds = 30; L3 e2e fixture should override to 5 so SIGTERM teardown fits the harness's 10-second budget.

Test plan

cargo test -p llmtrace --lib proxy:: — 27/27 pass (4 new trace-id tests + 2 updated response-shape tests + 21 pre-existing)
cargo test -p llmtrace --lib — 569/569 pass, no regressions
cargo clippy -p llmtrace --lib -- -D warnings — clean
cargo fmt --check -p llmtrace — clean

New tests

test_extract_or_generate_trace_id_honors_valid_inbound — valid uuid round-trips
test_extract_or_generate_trace_id_tolerates_surrounding_whitespace
test_extract_or_generate_trace_id_generates_when_missing — empty headers → fresh v4 each call
test_extract_or_generate_trace_id_generates_when_unparseable — garbage → fresh v4
test_error_response_format extended to assert echoed x-llmtrace-trace-id
test_cap_rejected_response_format extended to assert echoed x-llmtrace-trace-id

Unblocks

Loop E2E-L4 (E2E-035) — metrics-delta observer can now scope counters to a client-supplied trace_id.
Loop E2E-L5 (E2E-045) — judge verdict collector can poll by the same trace_id it supplied on the request.

Backward compatibility

Non-breaking for external callers: if no X-LLMTrace-Trace-Id header is sent, behaviour is identical to before (server-generated UUID).
New response header is additive; existing x-llmtrace-flagged / x-llmtrace-findings headers unchanged.
Private helper signatures (error_response, rate_limit_response, cap_rejected_response) gained a required trace_id parameter. These are internal to the crate; no public API change.

…04) (#117) * chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104) RUSTSEC-2026-0104: reachable panic in certificate revocation list parsing in rustls-webpki <0.103.13. The advisory was published after main's last successful CI run (2026-04-21 21:55), so all currently open PRs (#114, #115, #116) inherit the audit failure. `cargo update -p rustls-webpki` resolves to the patch release that contains the fix. No source changes; the bump is transitive via the rustls / reqwest / quinn dependency chain. Verification: cargo audit exit 0 (only unmaintained-crate warnings remain; no vulnerabilities) cargo build --workspace ok Unblocks the Security Audit job on #114, #115, #116. * chore: retrigger CI (previous Clippy hung 60 min)

The E2E test framework (issue #91) needs a client-controlled correlation id to attribute per-request observations (findings, metrics deltas, judge verdicts) to scenarios. Today the proxy always generates a fresh v4 trace_id at request entry and never echoes it, so the harness can't correlate. Changes to crates/llmtrace-proxy/src/proxy.rs: - New TRACE_ID_HEADER const ("x-llmtrace-trace-id") and extract_or_generate_trace_id(&HeaderMap) -> Uuid helper that reads the inbound header, parses as Uuid (whitespace-tolerant), falls back to Uuid::new_v4() on missing/unparseable. - proxy_handler uses the helper in place of Uuid::new_v4(). - Every response path echoes TRACE_ID_HEADER: the success builder, error_response, rate_limit_response, and cap_rejected_response each take a trace_id parameter and stamp the header. - 4 new unit tests cover the helper; 2 updated response-shape tests assert the echoed header. Also commits docs/TODO_E2E.md with the Loop E2E-L1 pre-flight audit findings that motivated this change (trace-id was the one real gap) and the L1a checklist marked complete. Part of the E2E adversarial test framework breakdown in TODO_E2E.md. Unblocks Loops E2E-L4 (metrics observer) and E2E-L5 (verdict collector). Tests: - cargo test -p llmtrace --lib proxy:: 27/27 ok - cargo test -p llmtrace --lib 569/569 ok - cargo clippy -p llmtrace --lib -D warn clean - cargo fmt --check -p llmtrace clean

…attack detection (#97) (#122) * feat(e2e): attack-scenario schema, validator, and 3 examples (#92) Lock the YAML contract every e2e attack scenario follows. Blocks every other loop in the framework tracked under #91. New files: - benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed enums for family (10 values), proxy_outcome, recommended_action, and severity. Required fields: id, source, family, prompt, expected. judge_verdict block is optional so judge-disabled runs work. Uses additionalProperties: false at every object level so typos are rejected loudly. - benchmarks/attacks/SCHEMA.md — human-readable reference: top-level fields, every enum with semantics, expectation comparators, skip block contract, what the schema does NOT validate (out-of-scope to triage loops). - benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/ *.yaml — 3 hand-written canonical examples that double as schema documentation. All tagged 'pr-gate' so they exercise the L9 subset. - scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/ *.yaml, validates each against schema.json, detects duplicate ids across the corpus, prints per-file summary, exits non-zero on any failure. Functions all under 50 LOC. - requirements-e2e.txt — pinned jsonschema + PyYAML. Modified: - .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust toolchain, runs on every push/PR. Verification: - python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0 - failure-path sanity (exit 1 + actionable messages): * bad enum value * missing required field * unparseable YAML * duplicate ids across files - python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK * feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93) Loop E2E-L3 of the e2e adversarial test framework (umbrella #91). Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under benchmarks/attacks/ at it, and asserts the per-scenario expected.proxy_outcome.at_* constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6. New files: - tests/e2e/conftest.py — session-scoped fixtures: * proxy_config_path: copies the e2e config to a temp file. * mock_upstream: free-port FastAPI subprocess, /health-gated. * proxy: free-port llmtrace-proxy subprocess wired to the mock via LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_* env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then target/release/, then target/debug/. * scenarios: walks benchmarks/attacks/, parametrises tests by id; --family / --tag CLI filters; respects skip blocks. * Hard guard at collection time that rejects pytest-xdist (-n) because counter-diff observability (L4) requires serial execution. * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for both subprocesses; verified no zombie processes remain. - tests/e2e/test_cascade.py — first parametrised test. Outcome classifier maps response to allow/warn/block heuristically (refined by L6 DSL later). Failure messages include scenario id, expected vs observed, status, flagged header, and trace_id. - tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible /v1/chat/completions response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides, not on upstream content. - tests/e2e/fixtures/config-e2e.yaml — judge OFF base config. - tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow tier null (matches the cascade-null-slow PR-gate matrix dimension). Both configs: * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so SIGTERM teardown completes inside the harness's 10 s budget). * auth.enabled: false (no API-key bootstrap needed in tests). * enforcement.mode: flag, min_severity: Low — surfaces findings as response headers (x-llmtrace-flagged) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them; the harness needs the response-header signal. - tests/e2e/README.md — how to run, filter, inspect logs, and where things live. - pytest.ini — repo-local config so CI doesn't depend on a parent pyproject.toml. Declares `serial` and `pr_gate` markers. Modified: - requirements-e2e.txt — pinned pytest, requests, prometheus-client, fastapi, uvicorn (in addition to L2's jsonschema + PyYAML). - .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and .venv-e2e/ (local virtualenv). Verification: - pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s. * dan-classic-001 → warn (regex flagged 3 findings). * xstest-violence-question-001 → allow (correctly NOT flagged). * base64-command-001 → warn (regex flagged 4 findings). - pgrep llmtrace-proxy / mock_upstream after exit → empty. - Failure-message quality verified by tightening one xstest expectation: message includes id, expected vs observed, status, flagged value, trace_id. - pytest --collect-only confirms rootdir = repo root and pytest.ini is picked up (no parent pyproject.toml leakage). Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt. * feat(e2e): metrics-delta + trace-id observer (#94) Loop E2E-L4 of the e2e adversarial test framework (umbrella #91). Per-scenario observability of the /metrics surface so harness assertions can talk in terms of "this scenario produced N findings of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6 (expectation DSL) build on. New files: - tests/e2e/observer.py — MetricsSnapshot + helpers. * fetch(url) / parse(text): read the Prometheus exposition format into a flat (sample_name, labels) -> value dict. * diff(before) / series(name, labels) / __contains__: counter subtraction, gauge "latest wins", histogram _count/_sum/_bucket diffs, subset-label matching that sums across unspecified labels. * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created` so queries with or without the _total suffix both work. * render_nonzero() / render_assertion_context() produce the deterministic pretty-print that gets attached to every assertion failure message for triage. * fetch_after_until_settled(): LLMTrace records security findings in a background task that outlives the synchronous upstream response, so a naive MetricsSnapshot.fetch misses them. This helper polls /metrics until the delta plateaus across two reads (or a 10 s timeout). * collect_finding_types(): extracts observed finding_type labels from the llmtrace_security_findings_total delta for the findings_include assertion. - tests/e2e/test_observer_unit.py — 18 unit tests covering parser, counter/gauge/histogram diffs, label-subset matching, render determinism, and the diff-self invariant. All tests use recorded /metrics text fixtures; no live proxy. - tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt — hand-built fixtures exercising counter+gauge+histogram with overlapping and disjoint label sets. Modified: - tests/e2e/test_cascade.py — after each scenario fires, diffs /metrics (polling through fetch_after_until_settled when the scenario expects findings) and: * asserts every declared expected.findings_include finding_type appears in the delta, * attaches render_assertion_context(delta) to every failure message so the first reader sees what LLMTrace actually recorded, not just the expected-vs-observed summary. * marks test_scenario with @pytest.mark.serial (E2E-034). - benchmarks/attacks/prompt_injection/dan-classic-001.yaml, benchmarks/attacks/encoding_evasion/base64-command-001.yaml — findings_include updated to match the actual finding_type values the ensemble emits (jailbreak for DAN, encoding_attack for the base64 payload). Inline comment explains the rationale so future authors pick stable labels over detector-specific ones. Verification: - python3 -m pytest tests/e2e/ -v 21/21 pass (~20s) * 3 scenarios integration (dan + xstest + base64) * 18 observer unit tests - Failure-message quality verified by injecting findings_include: [prompt_injection] into the benign xstest scenario — message includes scenario id, expected types, observed types, trace_id, and the full non-zero metrics delta. - No regression on the L3 proxy_outcome assertions. E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every request) already landed in L3 and are unchanged here. Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116). * feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95) Loop E2E-L5 of the e2e adversarial test framework (umbrella #91). Stitches the async judge verdict surface back into per-scenario assertions so the harness can declare expected.judge_verdict.* in the scenario YAML and have it verified against the persisted verdict. ## Rust - New `ServerConfig { debug_endpoints: bool }` (default false) on `ProxyConfig`. Production proxies must not enable this — debug routes return verdicts un-auth-gated by trace_id. - New `crates/llmtrace-proxy/src/debug.rs` with `verdict_by_trace_id_handler`. Thin wrapper over the existing `JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })` trait method (no new trait surface needed — Loop E2E-L1 audit finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id` exists). Returns 200 + verdict JSON, 404 when absent or flag off, 400 on non-UUID query param. - `build_router` registers `GET /debug/judge/verdicts` only when `server.debug_endpoints: true`. When the flag is off the route is not mounted at all (axum's not-found handler returns 404). Operator gets a loud `WARN` log on boot when the flag is on. - 4 new Rust integration tests in `main.rs::tests`: * 200 + verdict JSON when flag on + verdict exists * 404 when flag on but no verdict for trace * 404 when flag off (proves the route is NOT mounted) * 400 on non-UUID trace_id ## Python harness - `tests/e2e/observer.py`: * `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms polling against `/debug/judge/verdicts`. Returns the verdict dict on 200, None on timeout, raises HTTPError on other 4xx/5xx. * `shadow_would_block_count(delta, category=…, recommended_action=…)` — sums `llmtrace_judge_shadow_would_block_total` deltas; returns 0.0 when the metric is absent so callers can treat absence and zero symmetrically. * `judge_backend_errored(delta)` — True iff `llmtrace_judge_requests_total{status="backend_error"}` ticked in the window. - `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into the per-scenario assertions: * `is_threat`, `category`, `recommended_action.at_least/at_most`. * On verdict-not-found + judge_backend_errored=True: pytest.skip with explanation (degraded mode — provider/upstream flake, not LLMTrace regression). * On verdict-not-found + no backend_error: pytest.fail with the full metrics-delta context attached. - `tests/e2e/fixtures/config-e2e-judge.yaml` — flips `server.debug_endpoints: true`, `action_router.enabled: true`, and adds `judge_route` to `default_actions` so the cascade fast tier actually runs and verdicts get persisted. Without these the worker spawns but never receives requests. - `tests/e2e/conftest.py` — default config switched from `config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the cascade-null-slow PR-gate matrix dimension is exercised on every run. - 7 new observer unit tests covering shadow_would_block_count (filtered + unfiltered), judge_backend_errored (positive/negative/ absent), and the empty-snapshot edge case. Total observer suite now 25 tests; integration suite 3. ## Docs - `docs/guides/e2e-testing.md` — full operator guide: quick start, architecture diagram, debug endpoint contract with the production- safety call-out, judge collector lifecycle, degraded-mode handling, shadow-mode signal, failure-message anatomy, file map. ## Verification - cargo test -p llmtrace --bin llmtrace-proxy 11/11 ok - cargo clippy --workspace -- -D warnings clean - cargo fmt --check clean - pytest tests/e2e/ -v 28/28 ok in ~45s L1 audit corrections honoured: - E2E-040, E2E-041 (proposed `find_by_trace_id` trait method) remain SUPERSEDED — `query_verdicts` is the entry point used here. Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole stack (#115 → #116 → #118 → this) was rebased onto main after #114 (L1a) and #117 (rustls bump) merged earlier today. * feat(e2e): expectation DSL + assertion helpers (#96) Loop E2E-L6 of the e2e adversarial test framework (umbrella #91). Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in test_cascade.py into a small, pure DSL. The orchestrator returns one AssertionResult per declared comparator; the test wrapper aggregates into a pytest verdict. Makes the harness extensible for L7 (50-scenario seed corpus) and L8 (upstream-fell-for-it detector) without growing test_cascade.py further. New files: - tests/e2e/expect.py — the DSL. * `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053). Total ordering with parse() / .label round-trip and explicit rejection of unknown labels. * `AssertionResult { comparator, passed, soft, message, fields }` dataclass (E2E-050). * One `_compare_*` helper per supported `expected.*` key (E2E-052): - proxy_outcome.at_least / at_most - findings_include - findings_min_severity (NEW — schema declared it but no loop had wired it up; uses the peak severity across the per-scenario delta of llmtrace_security_findings_total) - judge_verdict.is_threat - judge_verdict.category - judge_verdict.recommended_action.at_least / at_most * `assert_scenario(scenario, response, delta, verdict, judge_degraded)` orchestrator (E2E-051). Pure: no I/O, all inputs handed in. * `classify_proxy_outcome(response)` returning ProxyOutcome (the enum form of L3's allow/warn/block heuristic). * `render_assertion_summary(results)` for failure-message rendering with `[ok]`/`[soft]`/`[FAIL]` per-row markers. * Unknown top-level OR judge-block keys produce explicit failure rows so typos in scenario YAML cannot silently skip an assertion. - tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every comparator with at least one passing + one failing case so failure- message wording is exercised. Synthesises responses, deltas, and verdicts in-process; no proxy boot, no network. Modified: - tests/e2e/test_cascade.py — collapsed from 196 lines of per-comparator if-blocks to 117 lines that wire I/O into assert_scenario(). Hard failures → pytest.fail with the assertion summary + metrics-delta context. All-soft failures → pytest.skip (degraded judge tier; not LLMTrace's fault). - docs/guides/e2e-testing.md — comparator reference table, per-comparator semantics, soft-vs-hard aggregation rules, and the "adding a new comparator" three-step recipe. Verification: - pytest tests/e2e/ 54/54 ok in ~40s * 3 integration scenarios (dan, xstest, base64) * 25 observer unit tests (unchanged) * 26 expect.py unit tests (new) - Failure-message quality verified by deliberately tightening the benign xstest scenario with bogus findings_include + an aggressive findings_min_severity. The resulting pytest.fail message lists every comparator with a marker, the missing finding types, the observed peak severity, and the full non-zero metrics delta. E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical matching the proxy's `SecuritySeverity`. Stacked on feat/issue-95-e2e-judge-verdict-collector (#119). * feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps RegexSecurityAnalyzer.detect_injection_patterns now also fires detect_rot13_injection and detect_leetspeak_injection, both emitting finding_type="encoding_attack". The jailbreak detector already handled these via "jailbreak" findings; the regex analyzer was only doing base64. Three new unit tests validate the new detectors in isolation. All 1025 existing security tests pass unchanged. Corpus (50 scenarios): - enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end. - 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most: warn with known-gap annotation. These are harmful-content requests, not injection attacks; the proxy is an injection detector, not a content moderator. Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.

This was referenced Apr 22, 2026

feat(e2e): attack-scenario schema, validator, and 3 examples (#92) #115

Closed

feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93) #116

Closed

chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104) #117

Merged

epappas force-pushed the feat/issue-91-e2e-trace-id-header branch from 2e014cc to 8edd8bd Compare April 22, 2026 17:16

epappas merged commit 82539bb into main Apr 22, 2026
13 checks passed

epappas deleted the feat/issue-91-e2e-trace-id-header branch April 22, 2026 18:56

epappas mentioned this pull request Apr 22, 2026

feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95) #119

Closed

5 tasks

epappas mentioned this pull request Apr 23, 2026

tracking: E2E adversarial test framework for LLMTrace #91

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a)#114

feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a)#114
epappas merged 1 commit into
mainfrom
feat/issue-91-e2e-trace-id-header

epappas commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented Apr 22, 2026

Summary

Changes

crates/llmtrace-proxy/src/proxy.rs

docs/TODO_E2E.md

Test plan

New tests

Unblocks

Backward compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`crates/llmtrace-proxy/src/proxy.rs`

`docs/TODO_E2E.md`