Skip to content

feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a)#114

Merged
epappas merged 1 commit into
mainfrom
feat/issue-91-e2e-trace-id-header
Apr 22, 2026
Merged

feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a)#114
epappas merged 1 commit into
mainfrom
feat/issue-91-e2e-trace-id-header

Conversation

@epappas
Copy link
Copy Markdown
Collaborator

@epappas epappas commented Apr 22, 2026

Summary

First concrete step of the E2E adversarial test framework (umbrella #91).

  • Pre-flight audit (Loop E2E-L1) identified one real gap: the proxy always generates a server-side trace_id at request entry and never echoes it, so the harness can't correlate per-request observations (findings, metrics deltas, judge verdicts) with scenarios.
  • This PR closes that gap (Loop E2E-L1a) with a minimal, typed change and unblocks Loops E2E-L4 (metrics-delta observer) and E2E-L5 (judge verdict collector).

Changes

crates/llmtrace-proxy/src/proxy.rs

  • pub const TRACE_ID_HEADER = "x-llmtrace-trace-id" and a new extract_or_generate_trace_id(&HeaderMap) -> Uuid helper. Reads the inbound header, trims whitespace, parses as Uuid. Falls back to Uuid::new_v4() on missing/non-UTF-8/unparseable — preserves current behaviour when no header is sent.
  • proxy_handler calls the helper in place of the unconditional Uuid::new_v4().
  • Every response path stamps X-LLMTrace-Trace-Id: <uuid>:
    • Success path builder (set after the upstream-header copy loop so we win on conflict)
    • error_response(status, message, trace_id) — new required param; 4 call sites updated
    • rate_limit_response(…, trace_id) — new required param
    • cap_rejected_response(…, trace_id) — new required param

docs/TODO_E2E.md

  • New file. Pulls the 10 E2E child issues (feat(e2e): attack-scenario YAML schema + validator #92feat(e2e): Rust llmtrace redteam subcommand (stretch) #101) into a RALPH-style loop breakdown with IDs (E2E-NNN), acceptance criteria, dependency graph, and sequencing options.
  • Includes the Loop E2E-L1 pre-flight audit findings that motivated this PR, plus two other L1 outcomes:
    • JudgeVerdictQuery { trace_id: Option<Uuid>, .. } already exists, so the L5 plan is trimmed (no new find_by_trace_id trait method needed — query_verdicts suffices).
    • Default shutdown.timeout_seconds = 30; L3 e2e fixture should override to 5 so SIGTERM teardown fits the harness's 10-second budget.

Test plan

  • cargo test -p llmtrace --lib proxy::27/27 pass (4 new trace-id tests + 2 updated response-shape tests + 21 pre-existing)
  • cargo test -p llmtrace --lib569/569 pass, no regressions
  • cargo clippy -p llmtrace --lib -- -D warnings — clean
  • cargo fmt --check -p llmtrace — clean

New tests

  • test_extract_or_generate_trace_id_honors_valid_inbound — valid uuid round-trips
  • test_extract_or_generate_trace_id_tolerates_surrounding_whitespace
  • test_extract_or_generate_trace_id_generates_when_missing — empty headers → fresh v4 each call
  • test_extract_or_generate_trace_id_generates_when_unparseable — garbage → fresh v4
  • test_error_response_format extended to assert echoed x-llmtrace-trace-id
  • test_cap_rejected_response_format extended to assert echoed x-llmtrace-trace-id

Unblocks

  • Loop E2E-L4 (E2E-035) — metrics-delta observer can now scope counters to a client-supplied trace_id.
  • Loop E2E-L5 (E2E-045) — judge verdict collector can poll by the same trace_id it supplied on the request.

Backward compatibility

  • Non-breaking for external callers: if no X-LLMTrace-Trace-Id header is sent, behaviour is identical to before (server-generated UUID).
  • New response header is additive; existing x-llmtrace-flagged / x-llmtrace-findings headers unchanged.
  • Private helper signatures (error_response, rate_limit_response, cap_rejected_response) gained a required trace_id parameter. These are internal to the crate; no public API change.

epappas added a commit that referenced this pull request Apr 22, 2026
…04) (#117)

* chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104)

RUSTSEC-2026-0104: reachable panic in certificate revocation list
parsing in rustls-webpki <0.103.13. The advisory was published after
main's last successful CI run (2026-04-21 21:55), so all currently
open PRs (#114, #115, #116) inherit the audit failure.

`cargo update -p rustls-webpki` resolves to the patch release that
contains the fix. No source changes; the bump is transitive via
the rustls / reqwest / quinn dependency chain.

Verification:

  cargo audit                 exit 0 (only unmaintained-crate warnings
                              remain; no vulnerabilities)
  cargo build --workspace     ok

Unblocks the Security Audit job on #114, #115, #116.

* chore: retrigger CI (previous Clippy hung 60 min)
The E2E test framework (issue #91) needs a client-controlled correlation
id to attribute per-request observations (findings, metrics deltas,
judge verdicts) to scenarios. Today the proxy always generates a
fresh v4 trace_id at request entry and never echoes it, so the harness
can't correlate.

Changes to crates/llmtrace-proxy/src/proxy.rs:

- New TRACE_ID_HEADER const ("x-llmtrace-trace-id") and
  extract_or_generate_trace_id(&HeaderMap) -> Uuid helper that reads
  the inbound header, parses as Uuid (whitespace-tolerant), falls
  back to Uuid::new_v4() on missing/unparseable.
- proxy_handler uses the helper in place of Uuid::new_v4().
- Every response path echoes TRACE_ID_HEADER: the success builder,
  error_response, rate_limit_response, and cap_rejected_response each
  take a trace_id parameter and stamp the header.
- 4 new unit tests cover the helper; 2 updated response-shape tests
  assert the echoed header.

Also commits docs/TODO_E2E.md with the Loop E2E-L1 pre-flight audit
findings that motivated this change (trace-id was the one real gap)
and the L1a checklist marked complete.

Part of the E2E adversarial test framework breakdown in TODO_E2E.md.
Unblocks Loops E2E-L4 (metrics observer) and E2E-L5 (verdict collector).

Tests:
- cargo test -p llmtrace --lib proxy::      27/27 ok
- cargo test -p llmtrace --lib              569/569 ok
- cargo clippy -p llmtrace --lib -D warn    clean
- cargo fmt --check -p llmtrace             clean
@epappas epappas force-pushed the feat/issue-91-e2e-trace-id-header branch from 2e014cc to 8edd8bd Compare April 22, 2026 17:16
@epappas epappas merged commit 82539bb into main Apr 22, 2026
13 checks passed
@epappas epappas deleted the feat/issue-91-e2e-trace-id-header branch April 22, 2026 18:56
epappas added a commit that referenced this pull request Apr 23, 2026
…attack detection (#97) (#122)

* feat(e2e): attack-scenario schema, validator, and 3 examples (#92)

Lock the YAML contract every e2e attack scenario follows. Blocks every
other loop in the framework tracked under #91.

New files:

- benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed
  enums for family (10 values), proxy_outcome, recommended_action, and
  severity. Required fields: id, source, family, prompt, expected.
  judge_verdict block is optional so judge-disabled runs work. Uses
  additionalProperties: false at every object level so typos are
  rejected loudly.
- benchmarks/attacks/SCHEMA.md — human-readable reference: top-level
  fields, every enum with semantics, expectation comparators, skip
  block contract, what the schema does NOT validate (out-of-scope to
  triage loops).
- benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/
  *.yaml — 3 hand-written canonical examples that double as schema
  documentation. All tagged 'pr-gate' so they exercise the L9 subset.
- scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/
  *.yaml, validates each against schema.json, detects duplicate ids
  across the corpus, prints per-file summary, exits non-zero on any
  failure. Functions all under 50 LOC.
- requirements-e2e.txt — pinned jsonschema + PyYAML.

Modified:

- .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust
  toolchain, runs on every push/PR.

Verification:

- python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0
- failure-path sanity (exit 1 + actionable messages):
  * bad enum value
  * missing required field
  * unparseable YAML
  * duplicate ids across files
- python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK

* feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)

Loop E2E-L3 of the e2e adversarial test framework (umbrella #91).

Boots the LLMTrace proxy as a subprocess against an in-process FastAPI
mock upstream, fires every scenario YAML under benchmarks/attacks/ at
it, and asserts the per-scenario expected.proxy_outcome.at_* constraints.
Asserts proxy outcome only — metrics-delta and judge-verdict observability
land in Loops E2E-L4 / L5 / L6.

New files:

- tests/e2e/conftest.py — session-scoped fixtures:
  * proxy_config_path: copies the e2e config to a temp file.
  * mock_upstream: free-port FastAPI subprocess, /health-gated.
  * proxy: free-port llmtrace-proxy subprocess wired to the mock via
    LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_*
    env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then
    target/release/, then target/debug/.
  * scenarios: walks benchmarks/attacks/, parametrises tests by id;
    --family / --tag CLI filters; respects skip blocks.
  * Hard guard at collection time that rejects pytest-xdist (-n)
    because counter-diff observability (L4) requires serial execution.
  * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for
    both subprocesses; verified no zombie processes remain.
- tests/e2e/test_cascade.py — first parametrised test. Outcome
  classifier maps response to allow/warn/block heuristically (refined
  by L6 DSL later). Failure messages include scenario id, expected vs
  observed, status, flagged header, and trace_id.
- tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible
  /v1/chat/completions response. Always returns the same assistant
  message; PR-gate tests assert on what LLMTrace decides, not on
  upstream content.
- tests/e2e/fixtures/config-e2e.yaml — judge OFF base config.
- tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow
  tier null (matches the cascade-null-slow PR-gate matrix dimension).
  Both configs:
  * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so
    SIGTERM teardown completes inside the harness's 10 s budget).
  * auth.enabled: false (no API-key bootstrap needed in tests).
  * enforcement.mode: flag, min_severity: Low — surfaces findings as
    response headers (x-llmtrace-flagged) so the harness can observe
    them. Default mode is `log`, which detects findings but never
    surfaces them; the harness needs the response-header signal.
- tests/e2e/README.md — how to run, filter, inspect logs, and where
  things live.
- pytest.ini — repo-local config so CI doesn't depend on a parent
  pyproject.toml. Declares `serial` and `pr_gate` markers.

Modified:

- requirements-e2e.txt — pinned pytest, requests, prometheus-client,
  fastapi, uvicorn (in addition to L2's jsonschema + PyYAML).
- .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and
  .venv-e2e/ (local virtualenv).

Verification:

- pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s.
  * dan-classic-001 → warn (regex flagged 3 findings).
  * xstest-violence-question-001 → allow (correctly NOT flagged).
  * base64-command-001 → warn (regex flagged 4 findings).
- pgrep llmtrace-proxy / mock_upstream after exit → empty.
- Failure-message quality verified by tightening one xstest
  expectation: message includes id, expected vs observed, status,
  flagged value, trace_id.
- pytest --collect-only confirms rootdir = repo root and pytest.ini
  is picked up (no parent pyproject.toml leakage).

Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because
this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.

* feat(e2e): metrics-delta + trace-id observer (#94)

Loop E2E-L4 of the e2e adversarial test framework (umbrella #91).

Per-scenario observability of the /metrics surface so harness
assertions can talk in terms of "this scenario produced N findings
of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6
(expectation DSL) build on.

New files:

- tests/e2e/observer.py — MetricsSnapshot + helpers.
  * fetch(url) / parse(text): read the Prometheus exposition format
    into a flat (sample_name, labels) -> value dict.
  * diff(before) / series(name, labels) / __contains__: counter
    subtraction, gauge "latest wins", histogram _count/_sum/_bucket
    diffs, subset-label matching that sums across unspecified labels.
  * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created`
    so queries with or without the _total suffix both work.
  * render_nonzero() / render_assertion_context() produce the
    deterministic pretty-print that gets attached to every assertion
    failure message for triage.
  * fetch_after_until_settled(): LLMTrace records security findings
    in a background task that outlives the synchronous upstream
    response, so a naive MetricsSnapshot.fetch misses them. This
    helper polls /metrics until the delta plateaus across two reads
    (or a 10 s timeout).
  * collect_finding_types(): extracts observed finding_type labels
    from the llmtrace_security_findings_total delta for the
    findings_include assertion.

- tests/e2e/test_observer_unit.py — 18 unit tests covering parser,
  counter/gauge/histogram diffs, label-subset matching, render
  determinism, and the diff-self invariant. All tests use recorded
  /metrics text fixtures; no live proxy.

- tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt
  — hand-built fixtures exercising counter+gauge+histogram with
  overlapping and disjoint label sets.

Modified:

- tests/e2e/test_cascade.py — after each scenario fires, diffs
  /metrics (polling through fetch_after_until_settled when the
  scenario expects findings) and:
  * asserts every declared expected.findings_include finding_type
    appears in the delta,
  * attaches render_assertion_context(delta) to every failure
    message so the first reader sees what LLMTrace actually recorded,
    not just the expected-vs-observed summary.
  * marks test_scenario with @pytest.mark.serial (E2E-034).

- benchmarks/attacks/prompt_injection/dan-classic-001.yaml,
  benchmarks/attacks/encoding_evasion/base64-command-001.yaml —
  findings_include updated to match the actual finding_type values
  the ensemble emits (jailbreak for DAN, encoding_attack for the
  base64 payload). Inline comment explains the rationale so future
  authors pick stable labels over detector-specific ones.

Verification:

- python3 -m pytest tests/e2e/ -v                  21/21 pass (~20s)
  * 3 scenarios integration (dan + xstest + base64)
  * 18 observer unit tests
- Failure-message quality verified by injecting findings_include:
  [prompt_injection] into the benign xstest scenario — message
  includes scenario id, expected types, observed types, trace_id,
  and the full non-zero metrics delta.
- No regression on the L3 proxy_outcome assertions.

E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every
request) already landed in L3 and are unchanged here.

Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).

* feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95)

Loop E2E-L5 of the e2e adversarial test framework (umbrella #91).

Stitches the async judge verdict surface back into per-scenario
assertions so the harness can declare expected.judge_verdict.* in the
scenario YAML and have it verified against the persisted verdict.

## Rust

- New `ServerConfig { debug_endpoints: bool }` (default false) on
  `ProxyConfig`. Production proxies must not enable this — debug
  routes return verdicts un-auth-gated by trace_id.

- New `crates/llmtrace-proxy/src/debug.rs` with
  `verdict_by_trace_id_handler`. Thin wrapper over the existing
  `JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })`
  trait method (no new trait surface needed — Loop E2E-L1 audit
  finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id`
  exists). Returns 200 + verdict JSON, 404 when absent or flag off,
  400 on non-UUID query param.

- `build_router` registers `GET /debug/judge/verdicts` only when
  `server.debug_endpoints: true`. When the flag is off the route is
  not mounted at all (axum's not-found handler returns 404).
  Operator gets a loud `WARN` log on boot when the flag is on.

- 4 new Rust integration tests in `main.rs::tests`:
  * 200 + verdict JSON when flag on + verdict exists
  * 404 when flag on but no verdict for trace
  * 404 when flag off (proves the route is NOT mounted)
  * 400 on non-UUID trace_id

## Python harness

- `tests/e2e/observer.py`:
  * `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms
    polling against `/debug/judge/verdicts`. Returns the verdict dict
    on 200, None on timeout, raises HTTPError on other 4xx/5xx.
  * `shadow_would_block_count(delta, category=…, recommended_action=…)`
    — sums `llmtrace_judge_shadow_would_block_total` deltas; returns
    0.0 when the metric is absent so callers can treat absence and
    zero symmetrically.
  * `judge_backend_errored(delta)` — True iff
    `llmtrace_judge_requests_total{status="backend_error"}` ticked
    in the window.

- `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into
  the per-scenario assertions:
  * `is_threat`, `category`, `recommended_action.at_least/at_most`.
  * On verdict-not-found + judge_backend_errored=True: pytest.skip
    with explanation (degraded mode — provider/upstream flake, not
    LLMTrace regression).
  * On verdict-not-found + no backend_error: pytest.fail with the
    full metrics-delta context attached.

- `tests/e2e/fixtures/config-e2e-judge.yaml` — flips
  `server.debug_endpoints: true`, `action_router.enabled: true`, and
  adds `judge_route` to `default_actions` so the cascade fast tier
  actually runs and verdicts get persisted. Without these the worker
  spawns but never receives requests.

- `tests/e2e/conftest.py` — default config switched from
  `config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the
  cascade-null-slow PR-gate matrix dimension is exercised on every
  run.

- 7 new observer unit tests covering shadow_would_block_count
  (filtered + unfiltered), judge_backend_errored (positive/negative/
  absent), and the empty-snapshot edge case. Total observer suite
  now 25 tests; integration suite 3.

## Docs

- `docs/guides/e2e-testing.md` — full operator guide: quick start,
  architecture diagram, debug endpoint contract with the production-
  safety call-out, judge collector lifecycle, degraded-mode handling,
  shadow-mode signal, failure-message anatomy, file map.

## Verification

- cargo test -p llmtrace --bin llmtrace-proxy   11/11 ok
- cargo clippy --workspace -- -D warnings       clean
- cargo fmt --check                             clean
- pytest tests/e2e/ -v                          28/28 ok in ~45s

L1 audit corrections honoured:
- E2E-040, E2E-041 (proposed `find_by_trace_id` trait method)
  remain SUPERSEDED — `query_verdicts` is the entry point used here.

Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole
stack (#115#116#118 → this) was rebased onto main after
#114 (L1a) and #117 (rustls bump) merged earlier today.

* feat(e2e): expectation DSL + assertion helpers (#96)

Loop E2E-L6 of the e2e adversarial test framework (umbrella #91).

Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in
test_cascade.py into a small, pure DSL. The orchestrator returns one
AssertionResult per declared comparator; the test wrapper aggregates
into a pytest verdict. Makes the harness extensible for L7 (50-scenario
seed corpus) and L8 (upstream-fell-for-it detector) without growing
test_cascade.py further.

New files:

- tests/e2e/expect.py — the DSL.
  * `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053).
    Total ordering with parse() / .label round-trip and explicit
    rejection of unknown labels.
  * `AssertionResult { comparator, passed, soft, message, fields }`
    dataclass (E2E-050).
  * One `_compare_*` helper per supported `expected.*` key (E2E-052):
      - proxy_outcome.at_least / at_most
      - findings_include
      - findings_min_severity (NEW — schema declared it but no loop
        had wired it up; uses the peak severity across the per-scenario
        delta of llmtrace_security_findings_total)
      - judge_verdict.is_threat
      - judge_verdict.category
      - judge_verdict.recommended_action.at_least / at_most
  * `assert_scenario(scenario, response, delta, verdict, judge_degraded)`
    orchestrator (E2E-051). Pure: no I/O, all inputs handed in.
  * `classify_proxy_outcome(response)` returning ProxyOutcome (the
    enum form of L3's allow/warn/block heuristic).
  * `render_assertion_summary(results)` for failure-message rendering
    with `[ok]`/`[soft]`/`[FAIL]` per-row markers.
  * Unknown top-level OR judge-block keys produce explicit failure
    rows so typos in scenario YAML cannot silently skip an assertion.

- tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every
  comparator with at least one passing + one failing case so failure-
  message wording is exercised. Synthesises responses, deltas, and
  verdicts in-process; no proxy boot, no network.

Modified:

- tests/e2e/test_cascade.py — collapsed from 196 lines of
  per-comparator if-blocks to 117 lines that wire I/O into
  assert_scenario(). Hard failures → pytest.fail with the assertion
  summary + metrics-delta context. All-soft failures → pytest.skip
  (degraded judge tier; not LLMTrace's fault).

- docs/guides/e2e-testing.md — comparator reference table,
  per-comparator semantics, soft-vs-hard aggregation rules, and the
  "adding a new comparator" three-step recipe.

Verification:

- pytest tests/e2e/                              54/54 ok in ~40s
  * 3 integration scenarios (dan, xstest, base64)
  * 25 observer unit tests (unchanged)
  * 26 expect.py unit tests (new)
- Failure-message quality verified by deliberately tightening the
  benign xstest scenario with bogus findings_include + an aggressive
  findings_min_severity. The resulting pytest.fail message lists
  every comparator with a marker, the missing finding types, the
  observed peak severity, and the full non-zero metrics delta.

E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical
matching the proxy's `SecuritySeverity`.

Stacked on feat/issue-95-e2e-judge-verdict-collector (#119).

* feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps

RegexSecurityAnalyzer.detect_injection_patterns now also fires
detect_rot13_injection and detect_leetspeak_injection, both emitting
finding_type="encoding_attack". The jailbreak detector already handled
these via "jailbreak" findings; the regex analyzer was only doing base64.

Three new unit tests validate the new detectors in isolation. All 1025
existing security tests pass unchanged.

Corpus (50 scenarios):
- enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end.
- 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most:
  warn with known-gap annotation. These are harmful-content requests, not
  injection attacks; the proxy is an injection detector, not a content
  moderator.

Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant