feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)#116
Closed
epappas wants to merge 1 commit into
Closed
Conversation
3 tasks
epappas
added a commit
that referenced
this pull request
Apr 22, 2026
…04) (#117) * chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104) RUSTSEC-2026-0104: reachable panic in certificate revocation list parsing in rustls-webpki <0.103.13. The advisory was published after main's last successful CI run (2026-04-21 21:55), so all currently open PRs (#114, #115, #116) inherit the audit failure. `cargo update -p rustls-webpki` resolves to the patch release that contains the fix. No source changes; the bump is transitive via the rustls / reqwest / quinn dependency chain. Verification: cargo audit exit 0 (only unmaintained-crate warnings remain; no vulnerabilities) cargo build --workspace ok Unblocks the Security Audit job on #114, #115, #116. * chore: retrigger CI (previous Clippy hung 60 min)
1463e70 to
e9be221
Compare
0e82d54 to
c5a9cac
Compare
3 tasks
e9be221 to
5e5ccac
Compare
…est (#93) Loop E2E-L3 of the e2e adversarial test framework (umbrella #91). Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under benchmarks/attacks/ at it, and asserts the per-scenario expected.proxy_outcome.at_* constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6. New files: - tests/e2e/conftest.py — session-scoped fixtures: * proxy_config_path: copies the e2e config to a temp file. * mock_upstream: free-port FastAPI subprocess, /health-gated. * proxy: free-port llmtrace-proxy subprocess wired to the mock via LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_* env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then target/release/, then target/debug/. * scenarios: walks benchmarks/attacks/, parametrises tests by id; --family / --tag CLI filters; respects skip blocks. * Hard guard at collection time that rejects pytest-xdist (-n) because counter-diff observability (L4) requires serial execution. * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for both subprocesses; verified no zombie processes remain. - tests/e2e/test_cascade.py — first parametrised test. Outcome classifier maps response to allow/warn/block heuristically (refined by L6 DSL later). Failure messages include scenario id, expected vs observed, status, flagged header, and trace_id. - tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible /v1/chat/completions response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides, not on upstream content. - tests/e2e/fixtures/config-e2e.yaml — judge OFF base config. - tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow tier null (matches the cascade-null-slow PR-gate matrix dimension). Both configs: * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so SIGTERM teardown completes inside the harness's 10 s budget). * auth.enabled: false (no API-key bootstrap needed in tests). * enforcement.mode: flag, min_severity: Low — surfaces findings as response headers (x-llmtrace-flagged) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them; the harness needs the response-header signal. - tests/e2e/README.md — how to run, filter, inspect logs, and where things live. - pytest.ini — repo-local config so CI doesn't depend on a parent pyproject.toml. Declares `serial` and `pr_gate` markers. Modified: - requirements-e2e.txt — pinned pytest, requests, prometheus-client, fastapi, uvicorn (in addition to L2's jsonschema + PyYAML). - .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and .venv-e2e/ (local virtualenv). Verification: - pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s. * dan-classic-001 → warn (regex flagged 3 findings). * xstest-violence-question-001 → allow (correctly NOT flagged). * base64-command-001 → warn (regex flagged 4 findings). - pgrep llmtrace-proxy / mock_upstream after exit → empty. - Failure-message quality verified by tightening one xstest expectation: message includes id, expected vs observed, status, flagged value, trace_id. - pytest --collect-only confirms rootdir = repo root and pytest.ini is picked up (no parent pyproject.toml leakage). Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.
c5a9cac to
d39d444
Compare
epappas
added a commit
that referenced
this pull request
Apr 22, 2026
Loop E2E-L4 of the e2e adversarial test framework (umbrella #91). Per-scenario observability of the /metrics surface so harness assertions can talk in terms of "this scenario produced N findings of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6 (expectation DSL) build on. New files: - tests/e2e/observer.py — MetricsSnapshot + helpers. * fetch(url) / parse(text): read the Prometheus exposition format into a flat (sample_name, labels) -> value dict. * diff(before) / series(name, labels) / __contains__: counter subtraction, gauge "latest wins", histogram _count/_sum/_bucket diffs, subset-label matching that sums across unspecified labels. * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created` so queries with or without the _total suffix both work. * render_nonzero() / render_assertion_context() produce the deterministic pretty-print that gets attached to every assertion failure message for triage. * fetch_after_until_settled(): LLMTrace records security findings in a background task that outlives the synchronous upstream response, so a naive MetricsSnapshot.fetch misses them. This helper polls /metrics until the delta plateaus across two reads (or a 10 s timeout). * collect_finding_types(): extracts observed finding_type labels from the llmtrace_security_findings_total delta for the findings_include assertion. - tests/e2e/test_observer_unit.py — 18 unit tests covering parser, counter/gauge/histogram diffs, label-subset matching, render determinism, and the diff-self invariant. All tests use recorded /metrics text fixtures; no live proxy. - tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt — hand-built fixtures exercising counter+gauge+histogram with overlapping and disjoint label sets. Modified: - tests/e2e/test_cascade.py — after each scenario fires, diffs /metrics (polling through fetch_after_until_settled when the scenario expects findings) and: * asserts every declared expected.findings_include finding_type appears in the delta, * attaches render_assertion_context(delta) to every failure message so the first reader sees what LLMTrace actually recorded, not just the expected-vs-observed summary. * marks test_scenario with @pytest.mark.serial (E2E-034). - benchmarks/attacks/prompt_injection/dan-classic-001.yaml, benchmarks/attacks/encoding_evasion/base64-command-001.yaml — findings_include updated to match the actual finding_type values the ensemble emits (jailbreak for DAN, encoding_attack for the base64 payload). Inline comment explains the rationale so future authors pick stable labels over detector-specific ones. Verification: - python3 -m pytest tests/e2e/ -v 21/21 pass (~20s) * 3 scenarios integration (dan + xstest + base64) * 18 observer unit tests - Failure-message quality verified by injecting findings_include: [prompt_injection] into the benign xstest scenario — message includes scenario id, expected types, observed types, trace_id, and the full non-zero metrics delta. - No regression on the L3 proxy_outcome assertions. E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every request) already landed in L3 and are unchanged here. Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).
5 tasks
epappas
added a commit
that referenced
this pull request
Apr 23, 2026
…attack detection (#97) (#122) * feat(e2e): attack-scenario schema, validator, and 3 examples (#92) Lock the YAML contract every e2e attack scenario follows. Blocks every other loop in the framework tracked under #91. New files: - benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed enums for family (10 values), proxy_outcome, recommended_action, and severity. Required fields: id, source, family, prompt, expected. judge_verdict block is optional so judge-disabled runs work. Uses additionalProperties: false at every object level so typos are rejected loudly. - benchmarks/attacks/SCHEMA.md — human-readable reference: top-level fields, every enum with semantics, expectation comparators, skip block contract, what the schema does NOT validate (out-of-scope to triage loops). - benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/ *.yaml — 3 hand-written canonical examples that double as schema documentation. All tagged 'pr-gate' so they exercise the L9 subset. - scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/ *.yaml, validates each against schema.json, detects duplicate ids across the corpus, prints per-file summary, exits non-zero on any failure. Functions all under 50 LOC. - requirements-e2e.txt — pinned jsonschema + PyYAML. Modified: - .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust toolchain, runs on every push/PR. Verification: - python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0 - failure-path sanity (exit 1 + actionable messages): * bad enum value * missing required field * unparseable YAML * duplicate ids across files - python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK * feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93) Loop E2E-L3 of the e2e adversarial test framework (umbrella #91). Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under benchmarks/attacks/ at it, and asserts the per-scenario expected.proxy_outcome.at_* constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6. New files: - tests/e2e/conftest.py — session-scoped fixtures: * proxy_config_path: copies the e2e config to a temp file. * mock_upstream: free-port FastAPI subprocess, /health-gated. * proxy: free-port llmtrace-proxy subprocess wired to the mock via LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_* env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then target/release/, then target/debug/. * scenarios: walks benchmarks/attacks/, parametrises tests by id; --family / --tag CLI filters; respects skip blocks. * Hard guard at collection time that rejects pytest-xdist (-n) because counter-diff observability (L4) requires serial execution. * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for both subprocesses; verified no zombie processes remain. - tests/e2e/test_cascade.py — first parametrised test. Outcome classifier maps response to allow/warn/block heuristically (refined by L6 DSL later). Failure messages include scenario id, expected vs observed, status, flagged header, and trace_id. - tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible /v1/chat/completions response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides, not on upstream content. - tests/e2e/fixtures/config-e2e.yaml — judge OFF base config. - tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow tier null (matches the cascade-null-slow PR-gate matrix dimension). Both configs: * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so SIGTERM teardown completes inside the harness's 10 s budget). * auth.enabled: false (no API-key bootstrap needed in tests). * enforcement.mode: flag, min_severity: Low — surfaces findings as response headers (x-llmtrace-flagged) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them; the harness needs the response-header signal. - tests/e2e/README.md — how to run, filter, inspect logs, and where things live. - pytest.ini — repo-local config so CI doesn't depend on a parent pyproject.toml. Declares `serial` and `pr_gate` markers. Modified: - requirements-e2e.txt — pinned pytest, requests, prometheus-client, fastapi, uvicorn (in addition to L2's jsonschema + PyYAML). - .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and .venv-e2e/ (local virtualenv). Verification: - pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s. * dan-classic-001 → warn (regex flagged 3 findings). * xstest-violence-question-001 → allow (correctly NOT flagged). * base64-command-001 → warn (regex flagged 4 findings). - pgrep llmtrace-proxy / mock_upstream after exit → empty. - Failure-message quality verified by tightening one xstest expectation: message includes id, expected vs observed, status, flagged value, trace_id. - pytest --collect-only confirms rootdir = repo root and pytest.ini is picked up (no parent pyproject.toml leakage). Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt. * feat(e2e): metrics-delta + trace-id observer (#94) Loop E2E-L4 of the e2e adversarial test framework (umbrella #91). Per-scenario observability of the /metrics surface so harness assertions can talk in terms of "this scenario produced N findings of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6 (expectation DSL) build on. New files: - tests/e2e/observer.py — MetricsSnapshot + helpers. * fetch(url) / parse(text): read the Prometheus exposition format into a flat (sample_name, labels) -> value dict. * diff(before) / series(name, labels) / __contains__: counter subtraction, gauge "latest wins", histogram _count/_sum/_bucket diffs, subset-label matching that sums across unspecified labels. * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created` so queries with or without the _total suffix both work. * render_nonzero() / render_assertion_context() produce the deterministic pretty-print that gets attached to every assertion failure message for triage. * fetch_after_until_settled(): LLMTrace records security findings in a background task that outlives the synchronous upstream response, so a naive MetricsSnapshot.fetch misses them. This helper polls /metrics until the delta plateaus across two reads (or a 10 s timeout). * collect_finding_types(): extracts observed finding_type labels from the llmtrace_security_findings_total delta for the findings_include assertion. - tests/e2e/test_observer_unit.py — 18 unit tests covering parser, counter/gauge/histogram diffs, label-subset matching, render determinism, and the diff-self invariant. All tests use recorded /metrics text fixtures; no live proxy. - tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt — hand-built fixtures exercising counter+gauge+histogram with overlapping and disjoint label sets. Modified: - tests/e2e/test_cascade.py — after each scenario fires, diffs /metrics (polling through fetch_after_until_settled when the scenario expects findings) and: * asserts every declared expected.findings_include finding_type appears in the delta, * attaches render_assertion_context(delta) to every failure message so the first reader sees what LLMTrace actually recorded, not just the expected-vs-observed summary. * marks test_scenario with @pytest.mark.serial (E2E-034). - benchmarks/attacks/prompt_injection/dan-classic-001.yaml, benchmarks/attacks/encoding_evasion/base64-command-001.yaml — findings_include updated to match the actual finding_type values the ensemble emits (jailbreak for DAN, encoding_attack for the base64 payload). Inline comment explains the rationale so future authors pick stable labels over detector-specific ones. Verification: - python3 -m pytest tests/e2e/ -v 21/21 pass (~20s) * 3 scenarios integration (dan + xstest + base64) * 18 observer unit tests - Failure-message quality verified by injecting findings_include: [prompt_injection] into the benign xstest scenario — message includes scenario id, expected types, observed types, trace_id, and the full non-zero metrics delta. - No regression on the L3 proxy_outcome assertions. E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every request) already landed in L3 and are unchanged here. Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116). * feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95) Loop E2E-L5 of the e2e adversarial test framework (umbrella #91). Stitches the async judge verdict surface back into per-scenario assertions so the harness can declare expected.judge_verdict.* in the scenario YAML and have it verified against the persisted verdict. ## Rust - New `ServerConfig { debug_endpoints: bool }` (default false) on `ProxyConfig`. Production proxies must not enable this — debug routes return verdicts un-auth-gated by trace_id. - New `crates/llmtrace-proxy/src/debug.rs` with `verdict_by_trace_id_handler`. Thin wrapper over the existing `JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })` trait method (no new trait surface needed — Loop E2E-L1 audit finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id` exists). Returns 200 + verdict JSON, 404 when absent or flag off, 400 on non-UUID query param. - `build_router` registers `GET /debug/judge/verdicts` only when `server.debug_endpoints: true`. When the flag is off the route is not mounted at all (axum's not-found handler returns 404). Operator gets a loud `WARN` log on boot when the flag is on. - 4 new Rust integration tests in `main.rs::tests`: * 200 + verdict JSON when flag on + verdict exists * 404 when flag on but no verdict for trace * 404 when flag off (proves the route is NOT mounted) * 400 on non-UUID trace_id ## Python harness - `tests/e2e/observer.py`: * `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms polling against `/debug/judge/verdicts`. Returns the verdict dict on 200, None on timeout, raises HTTPError on other 4xx/5xx. * `shadow_would_block_count(delta, category=…, recommended_action=…)` — sums `llmtrace_judge_shadow_would_block_total` deltas; returns 0.0 when the metric is absent so callers can treat absence and zero symmetrically. * `judge_backend_errored(delta)` — True iff `llmtrace_judge_requests_total{status="backend_error"}` ticked in the window. - `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into the per-scenario assertions: * `is_threat`, `category`, `recommended_action.at_least/at_most`. * On verdict-not-found + judge_backend_errored=True: pytest.skip with explanation (degraded mode — provider/upstream flake, not LLMTrace regression). * On verdict-not-found + no backend_error: pytest.fail with the full metrics-delta context attached. - `tests/e2e/fixtures/config-e2e-judge.yaml` — flips `server.debug_endpoints: true`, `action_router.enabled: true`, and adds `judge_route` to `default_actions` so the cascade fast tier actually runs and verdicts get persisted. Without these the worker spawns but never receives requests. - `tests/e2e/conftest.py` — default config switched from `config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the cascade-null-slow PR-gate matrix dimension is exercised on every run. - 7 new observer unit tests covering shadow_would_block_count (filtered + unfiltered), judge_backend_errored (positive/negative/ absent), and the empty-snapshot edge case. Total observer suite now 25 tests; integration suite 3. ## Docs - `docs/guides/e2e-testing.md` — full operator guide: quick start, architecture diagram, debug endpoint contract with the production- safety call-out, judge collector lifecycle, degraded-mode handling, shadow-mode signal, failure-message anatomy, file map. ## Verification - cargo test -p llmtrace --bin llmtrace-proxy 11/11 ok - cargo clippy --workspace -- -D warnings clean - cargo fmt --check clean - pytest tests/e2e/ -v 28/28 ok in ~45s L1 audit corrections honoured: - E2E-040, E2E-041 (proposed `find_by_trace_id` trait method) remain SUPERSEDED — `query_verdicts` is the entry point used here. Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole stack (#115 → #116 → #118 → this) was rebased onto main after #114 (L1a) and #117 (rustls bump) merged earlier today. * feat(e2e): expectation DSL + assertion helpers (#96) Loop E2E-L6 of the e2e adversarial test framework (umbrella #91). Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in test_cascade.py into a small, pure DSL. The orchestrator returns one AssertionResult per declared comparator; the test wrapper aggregates into a pytest verdict. Makes the harness extensible for L7 (50-scenario seed corpus) and L8 (upstream-fell-for-it detector) without growing test_cascade.py further. New files: - tests/e2e/expect.py — the DSL. * `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053). Total ordering with parse() / .label round-trip and explicit rejection of unknown labels. * `AssertionResult { comparator, passed, soft, message, fields }` dataclass (E2E-050). * One `_compare_*` helper per supported `expected.*` key (E2E-052): - proxy_outcome.at_least / at_most - findings_include - findings_min_severity (NEW — schema declared it but no loop had wired it up; uses the peak severity across the per-scenario delta of llmtrace_security_findings_total) - judge_verdict.is_threat - judge_verdict.category - judge_verdict.recommended_action.at_least / at_most * `assert_scenario(scenario, response, delta, verdict, judge_degraded)` orchestrator (E2E-051). Pure: no I/O, all inputs handed in. * `classify_proxy_outcome(response)` returning ProxyOutcome (the enum form of L3's allow/warn/block heuristic). * `render_assertion_summary(results)` for failure-message rendering with `[ok]`/`[soft]`/`[FAIL]` per-row markers. * Unknown top-level OR judge-block keys produce explicit failure rows so typos in scenario YAML cannot silently skip an assertion. - tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every comparator with at least one passing + one failing case so failure- message wording is exercised. Synthesises responses, deltas, and verdicts in-process; no proxy boot, no network. Modified: - tests/e2e/test_cascade.py — collapsed from 196 lines of per-comparator if-blocks to 117 lines that wire I/O into assert_scenario(). Hard failures → pytest.fail with the assertion summary + metrics-delta context. All-soft failures → pytest.skip (degraded judge tier; not LLMTrace's fault). - docs/guides/e2e-testing.md — comparator reference table, per-comparator semantics, soft-vs-hard aggregation rules, and the "adding a new comparator" three-step recipe. Verification: - pytest tests/e2e/ 54/54 ok in ~40s * 3 integration scenarios (dan, xstest, base64) * 25 observer unit tests (unchanged) * 26 expect.py unit tests (new) - Failure-message quality verified by deliberately tightening the benign xstest scenario with bogus findings_include + an aggressive findings_min_severity. The resulting pytest.fail message lists every comparator with a marker, the missing finding types, the observed peak severity, and the full non-zero metrics delta. E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical matching the proxy's `SecuritySeverity`. Stacked on feat/issue-95-e2e-judge-verdict-collector (#119). * feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps RegexSecurityAnalyzer.detect_injection_patterns now also fires detect_rot13_injection and detect_leetspeak_injection, both emitting finding_type="encoding_attack". The jailbreak detector already handled these via "jailbreak" findings; the regex analyzer was only doing base64. Three new unit tests validate the new detectors in isolation. All 1025 existing security tests pass unchanged. Corpus (50 scenarios): - enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end. - 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most: warn with known-gap annotation. These are harmful-content requests, not injection attacks; the proxy is an injection detector, not a content moderator. Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Loop E2E-L3 of the e2e adversarial test framework (umbrella #91, child #93).
Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under `benchmarks/attacks/` at it, and asserts the per-scenario `expected.proxy_outcome.at_*` constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6.
Changes
`tests/e2e/conftest.py` (NEW)
Session-scoped fixtures owning the lifecycle of two subprocesses:
Hard guard at collection time rejects `pytest-xdist` (-n) because Loop E2E-L4 will diff per-scenario `/metrics` deltas — that requires serial execution. Counter-diffing under parallel runs would silently produce nonsense.
Reliable teardown: SIGTERM-then-wait-10s in `finally` blocks for both subprocesses; verified with `pgrep llmtrace-proxy/mock_upstream` after exit (empty).
`tests/e2e/test_cascade.py` (NEW)
First parametrised test. Outcome classifier maps response to `allow` / `warn` / `block` heuristically:
Failure messages include scenario `id`, expected vs observed, status, flagged header, and trace_id — verified by deliberately tightening one xstest expectation and reading the resulting message.
`tests/e2e/mock_upstream.py` (NEW)
FastAPI canned OpenAI-compatible `/v1/chat/completions` response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides about the request, not on upstream content. Real-LLM upstream is the nightly job (Loop E2E-L10).
`tests/e2e/fixtures/config-e2e.yaml` and `config-e2e-judge.yaml` (NEW)
Two base configs the harness templates per session:
Both:
`tests/e2e/README.md` (NEW)
How to build the proxy, install Python deps, run the harness, filter scenarios, inspect logs, and where things live.
`pytest.ini` (NEW)
Repo-local config so CI doesn't depend on a parent `pyproject.toml` (which a developer might have for personal setup but CI runners don't). Declares `serial` and `pr_gate` markers.
`requirements-e2e.txt` (modified)
Added pinned `pytest`, `requests`, `prometheus-client`, `fastapi`, `uvicorn` on top of L2's `jsonschema` + `PyYAML`.
`.gitignore` (modified)
`tests/e2e/.logs/` (per-session subprocess logs) and `.venv-e2e/` (local virtualenv).
Test plan
Notes
Unblocks