feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93) by epappas · Pull Request #116 · techlab-innov/llmtrace

epappas · 2026-04-22T14:29:44Z

Summary

Loop E2E-L3 of the e2e adversarial test framework (umbrella #91, child #93).

Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under `benchmarks/attacks/` at it, and asserts the per-scenario `expected.proxy_outcome.at_*` constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6.

Stacked on #115 (Loop E2E-L2). This branch targets `feat/issue-92-e2e-scenario-schema` because L3 reads `benchmarks/attacks/*.yaml` and reuses `requirements-e2e.txt`. Rebase onto `main` after #115 merges and switch the base.

Changes

`tests/e2e/conftest.py` (NEW)

Session-scoped fixtures owning the lifecycle of two subprocesses:

`mock_upstream`: free-port FastAPI subprocess, `/health`-gated.
`proxy`: free-port `llmtrace-proxy` subprocess, wired to the mock via env-var overrides (`LLMTRACE_LISTEN_ADDR` / `LLMTRACE_UPSTREAM_URL` / `LLMTRACE_STORAGE_*`). Binary discovered via `LLMTRACE_PROXY_BIN` → `target/release/` → `target/debug/`.
`scenarios`: walks `benchmarks/attacks/`, parametrises tests by scenario `id`. CLI filters `--family=` and `--tag=` (both repeatable).

Hard guard at collection time rejects `pytest-xdist` (-n) because Loop E2E-L4 will diff per-scenario `/metrics` deltas — that requires serial execution. Counter-diffing under parallel runs would silently produce nonsense.

Reliable teardown: SIGTERM-then-wait-10s in `finally` blocks for both subprocesses; verified with `pgrep llmtrace-proxy/mock_upstream` after exit (empty).

`tests/e2e/test_cascade.py` (NEW)

First parametrised test. Outcome classifier maps response to `allow` / `warn` / `block` heuristically:

Response	Outcome
HTTP `>= 400` with body `{"error": {"type": "proxy_*", ...}}`	`block`
HTTP `200` + `x-llmtrace-flagged: true` response header	`warn`
Any other HTTP `200`	`allow`

Failure messages include scenario `id`, expected vs observed, status, flagged header, and trace_id — verified by deliberately tightening one xstest expectation and reading the resulting message.

`tests/e2e/mock_upstream.py` (NEW)

FastAPI canned OpenAI-compatible `/v1/chat/completions` response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides about the request, not on upstream content. Real-LLM upstream is the nightly job (Loop E2E-L10).

`tests/e2e/fixtures/config-e2e.yaml` and `config-e2e-judge.yaml` (NEW)

Two base configs the harness templates per session:

`config-e2e.yaml` — judge OFF.
`config-e2e-judge.yaml` — judge cascade ON, slow tier null (matches the `cascade-null-slow` PR-gate matrix dimension from ci(e2e): PR-gate workflow (fast curated set, must-green) #99).

Both:

`shutdown.timeout_seconds: 5` per Loop E2E-L1 finding E2E-006.
`auth.enabled: false` — no API-key bootstrap needed in tests.
`enforcement.mode: flag`, `min_severity: Low` — surfaces findings as response headers (`x-llmtrace-flagged`) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them.

`tests/e2e/README.md` (NEW)

How to build the proxy, install Python deps, run the harness, filter scenarios, inspect logs, and where things live.

`pytest.ini` (NEW)

Repo-local config so CI doesn't depend on a parent `pyproject.toml` (which a developer might have for personal setup but CI runners don't). Declares `serial` and `pr_gate` markers.

`requirements-e2e.txt` (modified)

Added pinned `pytest`, `requests`, `prometheus-client`, `fastapi`, `uvicorn` on top of L2's `jsonschema` + `PyYAML`.

`.gitignore` (modified)

`tests/e2e/.logs/` (per-session subprocess logs) and `.venv-e2e/` (local virtualenv).

Test plan

`pytest tests/e2e/ -v` — 3/3 example scenarios pass in ~25s.
- `dan-classic-001` → `warn` (regex flagged 3 findings).
- `xstest-violence-question-001` → `allow` (correctly NOT flagged).
- `base64-command-001` → `warn` (regex flagged 4 findings).
`pgrep llmtrace-proxy / mock_upstream` after exit — empty (no zombie processes).
Failure-message quality verified by tightening one xstest expectation: message includes `id`, expected vs observed, `status`, `flagged`, `trace_id`.
`pytest --collect-only` confirms `rootdir = /home/epappas/workspace/spacejar/llmtrace` and `configfile: pytest.ini` is picked up (no parent pyproject leakage).
`tests/e2e/.logs/proxy.log` populated and gitignored.
CI integration is out of scope — that is Loop E2E-L9 (ci(e2e): PR-gate workflow (fast curated set, must-green) #99).

Notes

The L1a trace-id header (feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a) #114) is set by the harness on every request. If feat(proxy): honor + echo X-LLMTrace-Trace-Id header (#91 E2E-L1a) #114 is not yet merged, the proxy will simply ignore the inbound header and generate its own; nothing in this loop's assertions depends on the trace-id round-trip yet (that's L4 / L5).
The harness includes hooks for L4/L5/L8 (e.g. `metrics_url` on `ProxyHandle`, `trace_id` on every request) so subsequent loops only add observability, not plumbing.

Unblocks

Loop E2E-L4 (feat(e2e): metrics-delta + trace-id observer #94, metrics-delta observer) — `proxy.metrics_url` ready.
Loop E2E-L5 (feat(e2e): judge verdict collector (store poll + shadow counter) #95, judge verdict collector) — once the debug-endpoint Rust change lands.
Loop E2E-L6 (feat(e2e): expectation DSL + assertion helpers #96, expectation DSL) — slots in by replacing the heuristic in `test_cascade.py`.

…04) (#117) * chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104) RUSTSEC-2026-0104: reachable panic in certificate revocation list parsing in rustls-webpki <0.103.13. The advisory was published after main's last successful CI run (2026-04-21 21:55), so all currently open PRs (#114, #115, #116) inherit the audit failure. `cargo update -p rustls-webpki` resolves to the patch release that contains the fix. No source changes; the bump is transitive via the rustls / reqwest / quinn dependency chain. Verification: cargo audit exit 0 (only unmaintained-crate warnings remain; no vulnerabilities) cargo build --workspace ok Unblocks the Security Audit job on #114, #115, #116. * chore: retrigger CI (previous Clippy hung 60 min)

…est (#93) Loop E2E-L3 of the e2e adversarial test framework (umbrella #91). Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under benchmarks/attacks/ at it, and asserts the per-scenario expected.proxy_outcome.at_* constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6. New files: - tests/e2e/conftest.py — session-scoped fixtures: * proxy_config_path: copies the e2e config to a temp file. * mock_upstream: free-port FastAPI subprocess, /health-gated. * proxy: free-port llmtrace-proxy subprocess wired to the mock via LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_* env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then target/release/, then target/debug/. * scenarios: walks benchmarks/attacks/, parametrises tests by id; --family / --tag CLI filters; respects skip blocks. * Hard guard at collection time that rejects pytest-xdist (-n) because counter-diff observability (L4) requires serial execution. * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for both subprocesses; verified no zombie processes remain. - tests/e2e/test_cascade.py — first parametrised test. Outcome classifier maps response to allow/warn/block heuristically (refined by L6 DSL later). Failure messages include scenario id, expected vs observed, status, flagged header, and trace_id. - tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible /v1/chat/completions response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides, not on upstream content. - tests/e2e/fixtures/config-e2e.yaml — judge OFF base config. - tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow tier null (matches the cascade-null-slow PR-gate matrix dimension). Both configs: * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so SIGTERM teardown completes inside the harness's 10 s budget). * auth.enabled: false (no API-key bootstrap needed in tests). * enforcement.mode: flag, min_severity: Low — surfaces findings as response headers (x-llmtrace-flagged) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them; the harness needs the response-header signal. - tests/e2e/README.md — how to run, filter, inspect logs, and where things live. - pytest.ini — repo-local config so CI doesn't depend on a parent pyproject.toml. Declares `serial` and `pr_gate` markers. Modified: - requirements-e2e.txt — pinned pytest, requests, prometheus-client, fastapi, uvicorn (in addition to L2's jsonschema + PyYAML). - .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and .venv-e2e/ (local virtualenv). Verification: - pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s. * dan-classic-001 → warn (regex flagged 3 findings). * xstest-violence-question-001 → allow (correctly NOT flagged). * base64-command-001 → warn (regex flagged 4 findings). - pgrep llmtrace-proxy / mock_upstream after exit → empty. - Failure-message quality verified by tightening one xstest expectation: message includes id, expected vs observed, status, flagged value, trace_id. - pytest --collect-only confirms rootdir = repo root and pytest.ini is picked up (no parent pyproject.toml leakage). Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.

Loop E2E-L4 of the e2e adversarial test framework (umbrella #91). Per-scenario observability of the /metrics surface so harness assertions can talk in terms of "this scenario produced N findings of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6 (expectation DSL) build on. New files: - tests/e2e/observer.py — MetricsSnapshot + helpers. * fetch(url) / parse(text): read the Prometheus exposition format into a flat (sample_name, labels) -> value dict. * diff(before) / series(name, labels) / __contains__: counter subtraction, gauge "latest wins", histogram _count/_sum/_bucket diffs, subset-label matching that sums across unspecified labels. * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created` so queries with or without the _total suffix both work. * render_nonzero() / render_assertion_context() produce the deterministic pretty-print that gets attached to every assertion failure message for triage. * fetch_after_until_settled(): LLMTrace records security findings in a background task that outlives the synchronous upstream response, so a naive MetricsSnapshot.fetch misses them. This helper polls /metrics until the delta plateaus across two reads (or a 10 s timeout). * collect_finding_types(): extracts observed finding_type labels from the llmtrace_security_findings_total delta for the findings_include assertion. - tests/e2e/test_observer_unit.py — 18 unit tests covering parser, counter/gauge/histogram diffs, label-subset matching, render determinism, and the diff-self invariant. All tests use recorded /metrics text fixtures; no live proxy. - tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt — hand-built fixtures exercising counter+gauge+histogram with overlapping and disjoint label sets. Modified: - tests/e2e/test_cascade.py — after each scenario fires, diffs /metrics (polling through fetch_after_until_settled when the scenario expects findings) and: * asserts every declared expected.findings_include finding_type appears in the delta, * attaches render_assertion_context(delta) to every failure message so the first reader sees what LLMTrace actually recorded, not just the expected-vs-observed summary. * marks test_scenario with @pytest.mark.serial (E2E-034). - benchmarks/attacks/prompt_injection/dan-classic-001.yaml, benchmarks/attacks/encoding_evasion/base64-command-001.yaml — findings_include updated to match the actual finding_type values the ensemble emits (jailbreak for DAN, encoding_attack for the base64 payload). Inline comment explains the rationale so future authors pick stable labels over detector-specific ones. Verification: - python3 -m pytest tests/e2e/ -v 21/21 pass (~20s) * 3 scenarios integration (dan + xstest + base64) * 18 observer unit tests - Failure-message quality verified by injecting findings_include: [prompt_injection] into the benign xstest scenario — message includes scenario id, expected types, observed types, trace_id, and the full non-zero metrics delta. - No regression on the L3 proxy_outcome assertions. E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every request) already landed in L3 and are unchanged here. Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).

…attack detection (#97) (#122) * feat(e2e): attack-scenario schema, validator, and 3 examples (#92) Lock the YAML contract every e2e attack scenario follows. Blocks every other loop in the framework tracked under #91. New files: - benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed enums for family (10 values), proxy_outcome, recommended_action, and severity. Required fields: id, source, family, prompt, expected. judge_verdict block is optional so judge-disabled runs work. Uses additionalProperties: false at every object level so typos are rejected loudly. - benchmarks/attacks/SCHEMA.md — human-readable reference: top-level fields, every enum with semantics, expectation comparators, skip block contract, what the schema does NOT validate (out-of-scope to triage loops). - benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/ *.yaml — 3 hand-written canonical examples that double as schema documentation. All tagged 'pr-gate' so they exercise the L9 subset. - scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/ *.yaml, validates each against schema.json, detects duplicate ids across the corpus, prints per-file summary, exits non-zero on any failure. Functions all under 50 LOC. - requirements-e2e.txt — pinned jsonschema + PyYAML. Modified: - .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust toolchain, runs on every push/PR. Verification: - python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0 - failure-path sanity (exit 1 + actionable messages): * bad enum value * missing required field * unparseable YAML * duplicate ids across files - python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK * feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93) Loop E2E-L3 of the e2e adversarial test framework (umbrella #91). Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under benchmarks/attacks/ at it, and asserts the per-scenario expected.proxy_outcome.at_* constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6. New files: - tests/e2e/conftest.py — session-scoped fixtures: * proxy_config_path: copies the e2e config to a temp file. * mock_upstream: free-port FastAPI subprocess, /health-gated. * proxy: free-port llmtrace-proxy subprocess wired to the mock via LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_* env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then target/release/, then target/debug/. * scenarios: walks benchmarks/attacks/, parametrises tests by id; --family / --tag CLI filters; respects skip blocks. * Hard guard at collection time that rejects pytest-xdist (-n) because counter-diff observability (L4) requires serial execution. * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for both subprocesses; verified no zombie processes remain. - tests/e2e/test_cascade.py — first parametrised test. Outcome classifier maps response to allow/warn/block heuristically (refined by L6 DSL later). Failure messages include scenario id, expected vs observed, status, flagged header, and trace_id. - tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible /v1/chat/completions response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides, not on upstream content. - tests/e2e/fixtures/config-e2e.yaml — judge OFF base config. - tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow tier null (matches the cascade-null-slow PR-gate matrix dimension). Both configs: * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so SIGTERM teardown completes inside the harness's 10 s budget). * auth.enabled: false (no API-key bootstrap needed in tests). * enforcement.mode: flag, min_severity: Low — surfaces findings as response headers (x-llmtrace-flagged) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them; the harness needs the response-header signal. - tests/e2e/README.md — how to run, filter, inspect logs, and where things live. - pytest.ini — repo-local config so CI doesn't depend on a parent pyproject.toml. Declares `serial` and `pr_gate` markers. Modified: - requirements-e2e.txt — pinned pytest, requests, prometheus-client, fastapi, uvicorn (in addition to L2's jsonschema + PyYAML). - .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and .venv-e2e/ (local virtualenv). Verification: - pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s. * dan-classic-001 → warn (regex flagged 3 findings). * xstest-violence-question-001 → allow (correctly NOT flagged). * base64-command-001 → warn (regex flagged 4 findings). - pgrep llmtrace-proxy / mock_upstream after exit → empty. - Failure-message quality verified by tightening one xstest expectation: message includes id, expected vs observed, status, flagged value, trace_id. - pytest --collect-only confirms rootdir = repo root and pytest.ini is picked up (no parent pyproject.toml leakage). Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt. * feat(e2e): metrics-delta + trace-id observer (#94) Loop E2E-L4 of the e2e adversarial test framework (umbrella #91). Per-scenario observability of the /metrics surface so harness assertions can talk in terms of "this scenario produced N findings of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6 (expectation DSL) build on. New files: - tests/e2e/observer.py — MetricsSnapshot + helpers. * fetch(url) / parse(text): read the Prometheus exposition format into a flat (sample_name, labels) -> value dict. * diff(before) / series(name, labels) / __contains__: counter subtraction, gauge "latest wins", histogram _count/_sum/_bucket diffs, subset-label matching that sums across unspecified labels. * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created` so queries with or without the _total suffix both work. * render_nonzero() / render_assertion_context() produce the deterministic pretty-print that gets attached to every assertion failure message for triage. * fetch_after_until_settled(): LLMTrace records security findings in a background task that outlives the synchronous upstream response, so a naive MetricsSnapshot.fetch misses them. This helper polls /metrics until the delta plateaus across two reads (or a 10 s timeout). * collect_finding_types(): extracts observed finding_type labels from the llmtrace_security_findings_total delta for the findings_include assertion. - tests/e2e/test_observer_unit.py — 18 unit tests covering parser, counter/gauge/histogram diffs, label-subset matching, render determinism, and the diff-self invariant. All tests use recorded /metrics text fixtures; no live proxy. - tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt — hand-built fixtures exercising counter+gauge+histogram with overlapping and disjoint label sets. Modified: - tests/e2e/test_cascade.py — after each scenario fires, diffs /metrics (polling through fetch_after_until_settled when the scenario expects findings) and: * asserts every declared expected.findings_include finding_type appears in the delta, * attaches render_assertion_context(delta) to every failure message so the first reader sees what LLMTrace actually recorded, not just the expected-vs-observed summary. * marks test_scenario with @pytest.mark.serial (E2E-034). - benchmarks/attacks/prompt_injection/dan-classic-001.yaml, benchmarks/attacks/encoding_evasion/base64-command-001.yaml — findings_include updated to match the actual finding_type values the ensemble emits (jailbreak for DAN, encoding_attack for the base64 payload). Inline comment explains the rationale so future authors pick stable labels over detector-specific ones. Verification: - python3 -m pytest tests/e2e/ -v 21/21 pass (~20s) * 3 scenarios integration (dan + xstest + base64) * 18 observer unit tests - Failure-message quality verified by injecting findings_include: [prompt_injection] into the benign xstest scenario — message includes scenario id, expected types, observed types, trace_id, and the full non-zero metrics delta. - No regression on the L3 proxy_outcome assertions. E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every request) already landed in L3 and are unchanged here. Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116). * feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95) Loop E2E-L5 of the e2e adversarial test framework (umbrella #91). Stitches the async judge verdict surface back into per-scenario assertions so the harness can declare expected.judge_verdict.* in the scenario YAML and have it verified against the persisted verdict. ## Rust - New `ServerConfig { debug_endpoints: bool }` (default false) on `ProxyConfig`. Production proxies must not enable this — debug routes return verdicts un-auth-gated by trace_id. - New `crates/llmtrace-proxy/src/debug.rs` with `verdict_by_trace_id_handler`. Thin wrapper over the existing `JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })` trait method (no new trait surface needed — Loop E2E-L1 audit finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id` exists). Returns 200 + verdict JSON, 404 when absent or flag off, 400 on non-UUID query param. - `build_router` registers `GET /debug/judge/verdicts` only when `server.debug_endpoints: true`. When the flag is off the route is not mounted at all (axum's not-found handler returns 404). Operator gets a loud `WARN` log on boot when the flag is on. - 4 new Rust integration tests in `main.rs::tests`: * 200 + verdict JSON when flag on + verdict exists * 404 when flag on but no verdict for trace * 404 when flag off (proves the route is NOT mounted) * 400 on non-UUID trace_id ## Python harness - `tests/e2e/observer.py`: * `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms polling against `/debug/judge/verdicts`. Returns the verdict dict on 200, None on timeout, raises HTTPError on other 4xx/5xx. * `shadow_would_block_count(delta, category=…, recommended_action=…)` — sums `llmtrace_judge_shadow_would_block_total` deltas; returns 0.0 when the metric is absent so callers can treat absence and zero symmetrically. * `judge_backend_errored(delta)` — True iff `llmtrace_judge_requests_total{status="backend_error"}` ticked in the window. - `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into the per-scenario assertions: * `is_threat`, `category`, `recommended_action.at_least/at_most`. * On verdict-not-found + judge_backend_errored=True: pytest.skip with explanation (degraded mode — provider/upstream flake, not LLMTrace regression). * On verdict-not-found + no backend_error: pytest.fail with the full metrics-delta context attached. - `tests/e2e/fixtures/config-e2e-judge.yaml` — flips `server.debug_endpoints: true`, `action_router.enabled: true`, and adds `judge_route` to `default_actions` so the cascade fast tier actually runs and verdicts get persisted. Without these the worker spawns but never receives requests. - `tests/e2e/conftest.py` — default config switched from `config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the cascade-null-slow PR-gate matrix dimension is exercised on every run. - 7 new observer unit tests covering shadow_would_block_count (filtered + unfiltered), judge_backend_errored (positive/negative/ absent), and the empty-snapshot edge case. Total observer suite now 25 tests; integration suite 3. ## Docs - `docs/guides/e2e-testing.md` — full operator guide: quick start, architecture diagram, debug endpoint contract with the production- safety call-out, judge collector lifecycle, degraded-mode handling, shadow-mode signal, failure-message anatomy, file map. ## Verification - cargo test -p llmtrace --bin llmtrace-proxy 11/11 ok - cargo clippy --workspace -- -D warnings clean - cargo fmt --check clean - pytest tests/e2e/ -v 28/28 ok in ~45s L1 audit corrections honoured: - E2E-040, E2E-041 (proposed `find_by_trace_id` trait method) remain SUPERSEDED — `query_verdicts` is the entry point used here. Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole stack (#115 → #116 → #118 → this) was rebased onto main after #114 (L1a) and #117 (rustls bump) merged earlier today. * feat(e2e): expectation DSL + assertion helpers (#96) Loop E2E-L6 of the e2e adversarial test framework (umbrella #91). Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in test_cascade.py into a small, pure DSL. The orchestrator returns one AssertionResult per declared comparator; the test wrapper aggregates into a pytest verdict. Makes the harness extensible for L7 (50-scenario seed corpus) and L8 (upstream-fell-for-it detector) without growing test_cascade.py further. New files: - tests/e2e/expect.py — the DSL. * `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053). Total ordering with parse() / .label round-trip and explicit rejection of unknown labels. * `AssertionResult { comparator, passed, soft, message, fields }` dataclass (E2E-050). * One `_compare_*` helper per supported `expected.*` key (E2E-052): - proxy_outcome.at_least / at_most - findings_include - findings_min_severity (NEW — schema declared it but no loop had wired it up; uses the peak severity across the per-scenario delta of llmtrace_security_findings_total) - judge_verdict.is_threat - judge_verdict.category - judge_verdict.recommended_action.at_least / at_most * `assert_scenario(scenario, response, delta, verdict, judge_degraded)` orchestrator (E2E-051). Pure: no I/O, all inputs handed in. * `classify_proxy_outcome(response)` returning ProxyOutcome (the enum form of L3's allow/warn/block heuristic). * `render_assertion_summary(results)` for failure-message rendering with `[ok]`/`[soft]`/`[FAIL]` per-row markers. * Unknown top-level OR judge-block keys produce explicit failure rows so typos in scenario YAML cannot silently skip an assertion. - tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every comparator with at least one passing + one failing case so failure- message wording is exercised. Synthesises responses, deltas, and verdicts in-process; no proxy boot, no network. Modified: - tests/e2e/test_cascade.py — collapsed from 196 lines of per-comparator if-blocks to 117 lines that wire I/O into assert_scenario(). Hard failures → pytest.fail with the assertion summary + metrics-delta context. All-soft failures → pytest.skip (degraded judge tier; not LLMTrace's fault). - docs/guides/e2e-testing.md — comparator reference table, per-comparator semantics, soft-vs-hard aggregation rules, and the "adding a new comparator" three-step recipe. Verification: - pytest tests/e2e/ 54/54 ok in ~40s * 3 integration scenarios (dan, xstest, base64) * 25 observer unit tests (unchanged) * 26 expect.py unit tests (new) - Failure-message quality verified by deliberately tightening the benign xstest scenario with bogus findings_include + an aggressive findings_min_severity. The resulting pytest.fail message lists every comparator with a marker, the missing finding types, the observed peak severity, and the full non-zero metrics delta. E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical matching the proxy's `SecuritySeverity`. Stacked on feat/issue-95-e2e-judge-verdict-collector (#119). * feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps RegexSecurityAnalyzer.detect_injection_patterns now also fires detect_rot13_injection and detect_leetspeak_injection, both emitting finding_type="encoding_attack". The jailbreak detector already handled these via "jailbreak" findings; the regex analyzer was only doing base64. Three new unit tests validate the new detectors in isolation. All 1025 existing security tests pass unchanged. Corpus (50 scenarios): - enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end. - 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most: warn with known-gap annotation. These are harmful-content requests, not injection attacks; the proxy is an injection detector, not a content moderator. Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.

epappas · 2026-04-23T20:06:35Z

Superseded by #122 (squash-merged into main as commit fc71a5a). All L1-L6 changes landed in that single squash.

epappas mentioned this pull request Apr 22, 2026

chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104) #117

Merged

3 tasks

epappas force-pushed the feat/issue-92-e2e-scenario-schema branch from 1463e70 to e9be221 Compare April 22, 2026 17:16

epappas force-pushed the feat/issue-93-e2e-pytest-harness-stacked branch from 0e82d54 to c5a9cac Compare April 22, 2026 17:17

epappas mentioned this pull request Apr 22, 2026

feat(e2e): metrics-delta + trace-id observer (#94) #118

Closed

3 tasks

epappas force-pushed the feat/issue-92-e2e-scenario-schema branch from e9be221 to 5e5ccac Compare April 22, 2026 18:57

epappas force-pushed the feat/issue-93-e2e-pytest-harness-stacked branch from c5a9cac to d39d444 Compare April 22, 2026 18:58

epappas mentioned this pull request Apr 22, 2026

feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95) #119

Closed

5 tasks

epappas closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)#116

feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)#116
epappas wants to merge 1 commit into
feat/issue-92-e2e-scenario-schemafrom
feat/issue-93-e2e-pytest-harness-stacked

epappas commented Apr 22, 2026

Uh oh!

epappas commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented Apr 22, 2026

Summary

Changes

`tests/e2e/conftest.py` (NEW)

`tests/e2e/test_cascade.py` (NEW)

`tests/e2e/mock_upstream.py` (NEW)

`tests/e2e/fixtures/config-e2e.yaml` and `config-e2e-judge.yaml` (NEW)

`tests/e2e/README.md` (NEW)

`pytest.ini` (NEW)

`requirements-e2e.txt` (modified)

`.gitignore` (modified)

Test plan

Notes

Unblocks

Uh oh!

epappas commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant