Skip to content

feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)#116

Closed
epappas wants to merge 1 commit into
feat/issue-92-e2e-scenario-schemafrom
feat/issue-93-e2e-pytest-harness-stacked
Closed

feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)#116
epappas wants to merge 1 commit into
feat/issue-92-e2e-scenario-schemafrom
feat/issue-93-e2e-pytest-harness-stacked

Conversation

@epappas
Copy link
Copy Markdown
Collaborator

@epappas epappas commented Apr 22, 2026

Summary

Loop E2E-L3 of the e2e adversarial test framework (umbrella #91, child #93).

Boots the LLMTrace proxy as a subprocess against an in-process FastAPI mock upstream, fires every scenario YAML under `benchmarks/attacks/` at it, and asserts the per-scenario `expected.proxy_outcome.at_*` constraints. Asserts proxy outcome only — metrics-delta and judge-verdict observability land in Loops E2E-L4 / L5 / L6.

Stacked on #115 (Loop E2E-L2). This branch targets `feat/issue-92-e2e-scenario-schema` because L3 reads `benchmarks/attacks/*.yaml` and reuses `requirements-e2e.txt`. Rebase onto `main` after #115 merges and switch the base.

Changes

`tests/e2e/conftest.py` (NEW)

Session-scoped fixtures owning the lifecycle of two subprocesses:

  • `mock_upstream`: free-port FastAPI subprocess, `/health`-gated.
  • `proxy`: free-port `llmtrace-proxy` subprocess, wired to the mock via env-var overrides (`LLMTRACE_LISTEN_ADDR` / `LLMTRACE_UPSTREAM_URL` / `LLMTRACE_STORAGE_*`). Binary discovered via `LLMTRACE_PROXY_BIN` → `target/release/` → `target/debug/`.
  • `scenarios`: walks `benchmarks/attacks/`, parametrises tests by scenario `id`. CLI filters `--family=` and `--tag=` (both repeatable).

Hard guard at collection time rejects `pytest-xdist` (-n) because Loop E2E-L4 will diff per-scenario `/metrics` deltas — that requires serial execution. Counter-diffing under parallel runs would silently produce nonsense.

Reliable teardown: SIGTERM-then-wait-10s in `finally` blocks for both subprocesses; verified with `pgrep llmtrace-proxy/mock_upstream` after exit (empty).

`tests/e2e/test_cascade.py` (NEW)

First parametrised test. Outcome classifier maps response to `allow` / `warn` / `block` heuristically:

Response Outcome
HTTP `>= 400` with body `{"error": {"type": "proxy_*", ...}}` `block`
HTTP `200` + `x-llmtrace-flagged: true` response header `warn`
Any other HTTP `200` `allow`

Failure messages include scenario `id`, expected vs observed, status, flagged header, and trace_id — verified by deliberately tightening one xstest expectation and reading the resulting message.

`tests/e2e/mock_upstream.py` (NEW)

FastAPI canned OpenAI-compatible `/v1/chat/completions` response. Always returns the same assistant message; PR-gate tests assert on what LLMTrace decides about the request, not on upstream content. Real-LLM upstream is the nightly job (Loop E2E-L10).

`tests/e2e/fixtures/config-e2e.yaml` and `config-e2e-judge.yaml` (NEW)

Two base configs the harness templates per session:

Both:

  • `shutdown.timeout_seconds: 5` per Loop E2E-L1 finding E2E-006.
  • `auth.enabled: false` — no API-key bootstrap needed in tests.
  • `enforcement.mode: flag`, `min_severity: Low` — surfaces findings as response headers (`x-llmtrace-flagged`) so the harness can observe them. Default mode is `log`, which detects findings but never surfaces them.

`tests/e2e/README.md` (NEW)

How to build the proxy, install Python deps, run the harness, filter scenarios, inspect logs, and where things live.

`pytest.ini` (NEW)

Repo-local config so CI doesn't depend on a parent `pyproject.toml` (which a developer might have for personal setup but CI runners don't). Declares `serial` and `pr_gate` markers.

`requirements-e2e.txt` (modified)

Added pinned `pytest`, `requests`, `prometheus-client`, `fastapi`, `uvicorn` on top of L2's `jsonschema` + `PyYAML`.

`.gitignore` (modified)

`tests/e2e/.logs/` (per-session subprocess logs) and `.venv-e2e/` (local virtualenv).

Test plan

  • `pytest tests/e2e/ -v` — 3/3 example scenarios pass in ~25s.
    • `dan-classic-001` → `warn` (regex flagged 3 findings).
    • `xstest-violence-question-001` → `allow` (correctly NOT flagged).
    • `base64-command-001` → `warn` (regex flagged 4 findings).
  • `pgrep llmtrace-proxy / mock_upstream` after exit — empty (no zombie processes).
  • Failure-message quality verified by tightening one xstest expectation: message includes `id`, expected vs observed, `status`, `flagged`, `trace_id`.
  • `pytest --collect-only` confirms `rootdir = /home/epappas/workspace/spacejar/llmtrace` and `configfile: pytest.ini` is picked up (no parent pyproject leakage).
  • `tests/e2e/.logs/proxy.log` populated and gitignored.
  • CI integration is out of scope — that is Loop E2E-L9 (ci(e2e): PR-gate workflow (fast curated set, must-green) #99).

Notes

Unblocks

epappas added a commit that referenced this pull request Apr 22, 2026
…04) (#117)

* chore(deps): bump rustls-webpki 0.103.12 -> 0.103.13 (RUSTSEC-2026-0104)

RUSTSEC-2026-0104: reachable panic in certificate revocation list
parsing in rustls-webpki <0.103.13. The advisory was published after
main's last successful CI run (2026-04-21 21:55), so all currently
open PRs (#114, #115, #116) inherit the audit failure.

`cargo update -p rustls-webpki` resolves to the patch release that
contains the fix. No source changes; the bump is transitive via
the rustls / reqwest / quinn dependency chain.

Verification:

  cargo audit                 exit 0 (only unmaintained-crate warnings
                              remain; no vulnerabilities)
  cargo build --workspace     ok

Unblocks the Security Audit job on #114, #115, #116.

* chore: retrigger CI (previous Clippy hung 60 min)
@epappas epappas force-pushed the feat/issue-92-e2e-scenario-schema branch from 1463e70 to e9be221 Compare April 22, 2026 17:16
@epappas epappas force-pushed the feat/issue-93-e2e-pytest-harness-stacked branch from 0e82d54 to c5a9cac Compare April 22, 2026 17:17
@epappas epappas force-pushed the feat/issue-92-e2e-scenario-schema branch from e9be221 to 5e5ccac Compare April 22, 2026 18:57
…est (#93)

Loop E2E-L3 of the e2e adversarial test framework (umbrella #91).

Boots the LLMTrace proxy as a subprocess against an in-process FastAPI
mock upstream, fires every scenario YAML under benchmarks/attacks/ at
it, and asserts the per-scenario expected.proxy_outcome.at_* constraints.
Asserts proxy outcome only — metrics-delta and judge-verdict observability
land in Loops E2E-L4 / L5 / L6.

New files:

- tests/e2e/conftest.py — session-scoped fixtures:
  * proxy_config_path: copies the e2e config to a temp file.
  * mock_upstream: free-port FastAPI subprocess, /health-gated.
  * proxy: free-port llmtrace-proxy subprocess wired to the mock via
    LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_*
    env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then
    target/release/, then target/debug/.
  * scenarios: walks benchmarks/attacks/, parametrises tests by id;
    --family / --tag CLI filters; respects skip blocks.
  * Hard guard at collection time that rejects pytest-xdist (-n)
    because counter-diff observability (L4) requires serial execution.
  * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for
    both subprocesses; verified no zombie processes remain.
- tests/e2e/test_cascade.py — first parametrised test. Outcome
  classifier maps response to allow/warn/block heuristically (refined
  by L6 DSL later). Failure messages include scenario id, expected vs
  observed, status, flagged header, and trace_id.
- tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible
  /v1/chat/completions response. Always returns the same assistant
  message; PR-gate tests assert on what LLMTrace decides, not on
  upstream content.
- tests/e2e/fixtures/config-e2e.yaml — judge OFF base config.
- tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow
  tier null (matches the cascade-null-slow PR-gate matrix dimension).
  Both configs:
  * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so
    SIGTERM teardown completes inside the harness's 10 s budget).
  * auth.enabled: false (no API-key bootstrap needed in tests).
  * enforcement.mode: flag, min_severity: Low — surfaces findings as
    response headers (x-llmtrace-flagged) so the harness can observe
    them. Default mode is `log`, which detects findings but never
    surfaces them; the harness needs the response-header signal.
- tests/e2e/README.md — how to run, filter, inspect logs, and where
  things live.
- pytest.ini — repo-local config so CI doesn't depend on a parent
  pyproject.toml. Declares `serial` and `pr_gate` markers.

Modified:

- requirements-e2e.txt — pinned pytest, requests, prometheus-client,
  fastapi, uvicorn (in addition to L2's jsonschema + PyYAML).
- .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and
  .venv-e2e/ (local virtualenv).

Verification:

- pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s.
  * dan-classic-001 → warn (regex flagged 3 findings).
  * xstest-violence-question-001 → allow (correctly NOT flagged).
  * base64-command-001 → warn (regex flagged 4 findings).
- pgrep llmtrace-proxy / mock_upstream after exit → empty.
- Failure-message quality verified by tightening one xstest
  expectation: message includes id, expected vs observed, status,
  flagged value, trace_id.
- pytest --collect-only confirms rootdir = repo root and pytest.ini
  is picked up (no parent pyproject.toml leakage).

Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because
this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.
@epappas epappas force-pushed the feat/issue-93-e2e-pytest-harness-stacked branch from c5a9cac to d39d444 Compare April 22, 2026 18:58
epappas added a commit that referenced this pull request Apr 22, 2026
Loop E2E-L4 of the e2e adversarial test framework (umbrella #91).

Per-scenario observability of the /metrics surface so harness
assertions can talk in terms of "this scenario produced N findings
of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6
(expectation DSL) build on.

New files:

- tests/e2e/observer.py — MetricsSnapshot + helpers.
  * fetch(url) / parse(text): read the Prometheus exposition format
    into a flat (sample_name, labels) -> value dict.
  * diff(before) / series(name, labels) / __contains__: counter
    subtraction, gauge "latest wins", histogram _count/_sum/_bucket
    diffs, subset-label matching that sums across unspecified labels.
  * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created`
    so queries with or without the _total suffix both work.
  * render_nonzero() / render_assertion_context() produce the
    deterministic pretty-print that gets attached to every assertion
    failure message for triage.
  * fetch_after_until_settled(): LLMTrace records security findings
    in a background task that outlives the synchronous upstream
    response, so a naive MetricsSnapshot.fetch misses them. This
    helper polls /metrics until the delta plateaus across two reads
    (or a 10 s timeout).
  * collect_finding_types(): extracts observed finding_type labels
    from the llmtrace_security_findings_total delta for the
    findings_include assertion.

- tests/e2e/test_observer_unit.py — 18 unit tests covering parser,
  counter/gauge/histogram diffs, label-subset matching, render
  determinism, and the diff-self invariant. All tests use recorded
  /metrics text fixtures; no live proxy.

- tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt
  — hand-built fixtures exercising counter+gauge+histogram with
  overlapping and disjoint label sets.

Modified:

- tests/e2e/test_cascade.py — after each scenario fires, diffs
  /metrics (polling through fetch_after_until_settled when the
  scenario expects findings) and:
  * asserts every declared expected.findings_include finding_type
    appears in the delta,
  * attaches render_assertion_context(delta) to every failure
    message so the first reader sees what LLMTrace actually recorded,
    not just the expected-vs-observed summary.
  * marks test_scenario with @pytest.mark.serial (E2E-034).

- benchmarks/attacks/prompt_injection/dan-classic-001.yaml,
  benchmarks/attacks/encoding_evasion/base64-command-001.yaml —
  findings_include updated to match the actual finding_type values
  the ensemble emits (jailbreak for DAN, encoding_attack for the
  base64 payload). Inline comment explains the rationale so future
  authors pick stable labels over detector-specific ones.

Verification:

- python3 -m pytest tests/e2e/ -v                  21/21 pass (~20s)
  * 3 scenarios integration (dan + xstest + base64)
  * 18 observer unit tests
- Failure-message quality verified by injecting findings_include:
  [prompt_injection] into the benign xstest scenario — message
  includes scenario id, expected types, observed types, trace_id,
  and the full non-zero metrics delta.
- No regression on the L3 proxy_outcome assertions.

E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every
request) already landed in L3 and are unchanged here.

Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).
epappas added a commit that referenced this pull request Apr 23, 2026
…attack detection (#97) (#122)

* feat(e2e): attack-scenario schema, validator, and 3 examples (#92)

Lock the YAML contract every e2e attack scenario follows. Blocks every
other loop in the framework tracked under #91.

New files:

- benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed
  enums for family (10 values), proxy_outcome, recommended_action, and
  severity. Required fields: id, source, family, prompt, expected.
  judge_verdict block is optional so judge-disabled runs work. Uses
  additionalProperties: false at every object level so typos are
  rejected loudly.
- benchmarks/attacks/SCHEMA.md — human-readable reference: top-level
  fields, every enum with semantics, expectation comparators, skip
  block contract, what the schema does NOT validate (out-of-scope to
  triage loops).
- benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/
  *.yaml — 3 hand-written canonical examples that double as schema
  documentation. All tagged 'pr-gate' so they exercise the L9 subset.
- scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/
  *.yaml, validates each against schema.json, detects duplicate ids
  across the corpus, prints per-file summary, exits non-zero on any
  failure. Functions all under 50 LOC.
- requirements-e2e.txt — pinned jsonschema + PyYAML.

Modified:

- .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust
  toolchain, runs on every push/PR.

Verification:

- python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0
- failure-path sanity (exit 1 + actionable messages):
  * bad enum value
  * missing required field
  * unparseable YAML
  * duplicate ids across files
- python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK

* feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)

Loop E2E-L3 of the e2e adversarial test framework (umbrella #91).

Boots the LLMTrace proxy as a subprocess against an in-process FastAPI
mock upstream, fires every scenario YAML under benchmarks/attacks/ at
it, and asserts the per-scenario expected.proxy_outcome.at_* constraints.
Asserts proxy outcome only — metrics-delta and judge-verdict observability
land in Loops E2E-L4 / L5 / L6.

New files:

- tests/e2e/conftest.py — session-scoped fixtures:
  * proxy_config_path: copies the e2e config to a temp file.
  * mock_upstream: free-port FastAPI subprocess, /health-gated.
  * proxy: free-port llmtrace-proxy subprocess wired to the mock via
    LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_*
    env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then
    target/release/, then target/debug/.
  * scenarios: walks benchmarks/attacks/, parametrises tests by id;
    --family / --tag CLI filters; respects skip blocks.
  * Hard guard at collection time that rejects pytest-xdist (-n)
    because counter-diff observability (L4) requires serial execution.
  * Reliable teardown: SIGTERM-then-wait-10s in finally blocks for
    both subprocesses; verified no zombie processes remain.
- tests/e2e/test_cascade.py — first parametrised test. Outcome
  classifier maps response to allow/warn/block heuristically (refined
  by L6 DSL later). Failure messages include scenario id, expected vs
  observed, status, flagged header, and trace_id.
- tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible
  /v1/chat/completions response. Always returns the same assistant
  message; PR-gate tests assert on what LLMTrace decides, not on
  upstream content.
- tests/e2e/fixtures/config-e2e.yaml — judge OFF base config.
- tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow
  tier null (matches the cascade-null-slow PR-gate matrix dimension).
  Both configs:
  * shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so
    SIGTERM teardown completes inside the harness's 10 s budget).
  * auth.enabled: false (no API-key bootstrap needed in tests).
  * enforcement.mode: flag, min_severity: Low — surfaces findings as
    response headers (x-llmtrace-flagged) so the harness can observe
    them. Default mode is `log`, which detects findings but never
    surfaces them; the harness needs the response-header signal.
- tests/e2e/README.md — how to run, filter, inspect logs, and where
  things live.
- pytest.ini — repo-local config so CI doesn't depend on a parent
  pyproject.toml. Declares `serial` and `pr_gate` markers.

Modified:

- requirements-e2e.txt — pinned pytest, requests, prometheus-client,
  fastapi, uvicorn (in addition to L2's jsonschema + PyYAML).
- .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and
  .venv-e2e/ (local virtualenv).

Verification:

- pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s.
  * dan-classic-001 → warn (regex flagged 3 findings).
  * xstest-violence-question-001 → allow (correctly NOT flagged).
  * base64-command-001 → warn (regex flagged 4 findings).
- pgrep llmtrace-proxy / mock_upstream after exit → empty.
- Failure-message quality verified by tightening one xstest
  expectation: message includes id, expected vs observed, status,
  flagged value, trace_id.
- pytest --collect-only confirms rootdir = repo root and pytest.ini
  is picked up (no parent pyproject.toml leakage).

Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because
this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.

* feat(e2e): metrics-delta + trace-id observer (#94)

Loop E2E-L4 of the e2e adversarial test framework (umbrella #91).

Per-scenario observability of the /metrics surface so harness
assertions can talk in terms of "this scenario produced N findings
of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6
(expectation DSL) build on.

New files:

- tests/e2e/observer.py — MetricsSnapshot + helpers.
  * fetch(url) / parse(text): read the Prometheus exposition format
    into a flat (sample_name, labels) -> value dict.
  * diff(before) / series(name, labels) / __contains__: counter
    subtraction, gauge "latest wins", histogram _count/_sum/_bucket
    diffs, subset-label matching that sums across unspecified labels.
  * `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created`
    so queries with or without the _total suffix both work.
  * render_nonzero() / render_assertion_context() produce the
    deterministic pretty-print that gets attached to every assertion
    failure message for triage.
  * fetch_after_until_settled(): LLMTrace records security findings
    in a background task that outlives the synchronous upstream
    response, so a naive MetricsSnapshot.fetch misses them. This
    helper polls /metrics until the delta plateaus across two reads
    (or a 10 s timeout).
  * collect_finding_types(): extracts observed finding_type labels
    from the llmtrace_security_findings_total delta for the
    findings_include assertion.

- tests/e2e/test_observer_unit.py — 18 unit tests covering parser,
  counter/gauge/histogram diffs, label-subset matching, render
  determinism, and the diff-self invariant. All tests use recorded
  /metrics text fixtures; no live proxy.

- tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt
  — hand-built fixtures exercising counter+gauge+histogram with
  overlapping and disjoint label sets.

Modified:

- tests/e2e/test_cascade.py — after each scenario fires, diffs
  /metrics (polling through fetch_after_until_settled when the
  scenario expects findings) and:
  * asserts every declared expected.findings_include finding_type
    appears in the delta,
  * attaches render_assertion_context(delta) to every failure
    message so the first reader sees what LLMTrace actually recorded,
    not just the expected-vs-observed summary.
  * marks test_scenario with @pytest.mark.serial (E2E-034).

- benchmarks/attacks/prompt_injection/dan-classic-001.yaml,
  benchmarks/attacks/encoding_evasion/base64-command-001.yaml —
  findings_include updated to match the actual finding_type values
  the ensemble emits (jailbreak for DAN, encoding_attack for the
  base64 payload). Inline comment explains the rationale so future
  authors pick stable labels over detector-specific ones.

Verification:

- python3 -m pytest tests/e2e/ -v                  21/21 pass (~20s)
  * 3 scenarios integration (dan + xstest + base64)
  * 18 observer unit tests
- Failure-message quality verified by injecting findings_include:
  [prompt_injection] into the benign xstest scenario — message
  includes scenario id, expected types, observed types, trace_id,
  and the full non-zero metrics delta.
- No regression on the L3 proxy_outcome assertions.

E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every
request) already landed in L3 and are unchanged here.

Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).

* feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95)

Loop E2E-L5 of the e2e adversarial test framework (umbrella #91).

Stitches the async judge verdict surface back into per-scenario
assertions so the harness can declare expected.judge_verdict.* in the
scenario YAML and have it verified against the persisted verdict.

## Rust

- New `ServerConfig { debug_endpoints: bool }` (default false) on
  `ProxyConfig`. Production proxies must not enable this — debug
  routes return verdicts un-auth-gated by trace_id.

- New `crates/llmtrace-proxy/src/debug.rs` with
  `verdict_by_trace_id_handler`. Thin wrapper over the existing
  `JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })`
  trait method (no new trait surface needed — Loop E2E-L1 audit
  finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id`
  exists). Returns 200 + verdict JSON, 404 when absent or flag off,
  400 on non-UUID query param.

- `build_router` registers `GET /debug/judge/verdicts` only when
  `server.debug_endpoints: true`. When the flag is off the route is
  not mounted at all (axum's not-found handler returns 404).
  Operator gets a loud `WARN` log on boot when the flag is on.

- 4 new Rust integration tests in `main.rs::tests`:
  * 200 + verdict JSON when flag on + verdict exists
  * 404 when flag on but no verdict for trace
  * 404 when flag off (proves the route is NOT mounted)
  * 400 on non-UUID trace_id

## Python harness

- `tests/e2e/observer.py`:
  * `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms
    polling against `/debug/judge/verdicts`. Returns the verdict dict
    on 200, None on timeout, raises HTTPError on other 4xx/5xx.
  * `shadow_would_block_count(delta, category=…, recommended_action=…)`
    — sums `llmtrace_judge_shadow_would_block_total` deltas; returns
    0.0 when the metric is absent so callers can treat absence and
    zero symmetrically.
  * `judge_backend_errored(delta)` — True iff
    `llmtrace_judge_requests_total{status="backend_error"}` ticked
    in the window.

- `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into
  the per-scenario assertions:
  * `is_threat`, `category`, `recommended_action.at_least/at_most`.
  * On verdict-not-found + judge_backend_errored=True: pytest.skip
    with explanation (degraded mode — provider/upstream flake, not
    LLMTrace regression).
  * On verdict-not-found + no backend_error: pytest.fail with the
    full metrics-delta context attached.

- `tests/e2e/fixtures/config-e2e-judge.yaml` — flips
  `server.debug_endpoints: true`, `action_router.enabled: true`, and
  adds `judge_route` to `default_actions` so the cascade fast tier
  actually runs and verdicts get persisted. Without these the worker
  spawns but never receives requests.

- `tests/e2e/conftest.py` — default config switched from
  `config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the
  cascade-null-slow PR-gate matrix dimension is exercised on every
  run.

- 7 new observer unit tests covering shadow_would_block_count
  (filtered + unfiltered), judge_backend_errored (positive/negative/
  absent), and the empty-snapshot edge case. Total observer suite
  now 25 tests; integration suite 3.

## Docs

- `docs/guides/e2e-testing.md` — full operator guide: quick start,
  architecture diagram, debug endpoint contract with the production-
  safety call-out, judge collector lifecycle, degraded-mode handling,
  shadow-mode signal, failure-message anatomy, file map.

## Verification

- cargo test -p llmtrace --bin llmtrace-proxy   11/11 ok
- cargo clippy --workspace -- -D warnings       clean
- cargo fmt --check                             clean
- pytest tests/e2e/ -v                          28/28 ok in ~45s

L1 audit corrections honoured:
- E2E-040, E2E-041 (proposed `find_by_trace_id` trait method)
  remain SUPERSEDED — `query_verdicts` is the entry point used here.

Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole
stack (#115#116#118 → this) was rebased onto main after
#114 (L1a) and #117 (rustls bump) merged earlier today.

* feat(e2e): expectation DSL + assertion helpers (#96)

Loop E2E-L6 of the e2e adversarial test framework (umbrella #91).

Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in
test_cascade.py into a small, pure DSL. The orchestrator returns one
AssertionResult per declared comparator; the test wrapper aggregates
into a pytest verdict. Makes the harness extensible for L7 (50-scenario
seed corpus) and L8 (upstream-fell-for-it detector) without growing
test_cascade.py further.

New files:

- tests/e2e/expect.py — the DSL.
  * `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053).
    Total ordering with parse() / .label round-trip and explicit
    rejection of unknown labels.
  * `AssertionResult { comparator, passed, soft, message, fields }`
    dataclass (E2E-050).
  * One `_compare_*` helper per supported `expected.*` key (E2E-052):
      - proxy_outcome.at_least / at_most
      - findings_include
      - findings_min_severity (NEW — schema declared it but no loop
        had wired it up; uses the peak severity across the per-scenario
        delta of llmtrace_security_findings_total)
      - judge_verdict.is_threat
      - judge_verdict.category
      - judge_verdict.recommended_action.at_least / at_most
  * `assert_scenario(scenario, response, delta, verdict, judge_degraded)`
    orchestrator (E2E-051). Pure: no I/O, all inputs handed in.
  * `classify_proxy_outcome(response)` returning ProxyOutcome (the
    enum form of L3's allow/warn/block heuristic).
  * `render_assertion_summary(results)` for failure-message rendering
    with `[ok]`/`[soft]`/`[FAIL]` per-row markers.
  * Unknown top-level OR judge-block keys produce explicit failure
    rows so typos in scenario YAML cannot silently skip an assertion.

- tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every
  comparator with at least one passing + one failing case so failure-
  message wording is exercised. Synthesises responses, deltas, and
  verdicts in-process; no proxy boot, no network.

Modified:

- tests/e2e/test_cascade.py — collapsed from 196 lines of
  per-comparator if-blocks to 117 lines that wire I/O into
  assert_scenario(). Hard failures → pytest.fail with the assertion
  summary + metrics-delta context. All-soft failures → pytest.skip
  (degraded judge tier; not LLMTrace's fault).

- docs/guides/e2e-testing.md — comparator reference table,
  per-comparator semantics, soft-vs-hard aggregation rules, and the
  "adding a new comparator" three-step recipe.

Verification:

- pytest tests/e2e/                              54/54 ok in ~40s
  * 3 integration scenarios (dan, xstest, base64)
  * 25 observer unit tests (unchanged)
  * 26 expect.py unit tests (new)
- Failure-message quality verified by deliberately tightening the
  benign xstest scenario with bogus findings_include + an aggressive
  findings_min_severity. The resulting pytest.fail message lists
  every comparator with a marker, the missing finding types, the
  observed peak severity, and the full non-zero metrics delta.

E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical
matching the proxy's `SecuritySeverity`.

Stacked on feat/issue-95-e2e-judge-verdict-collector (#119).

* feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps

RegexSecurityAnalyzer.detect_injection_patterns now also fires
detect_rot13_injection and detect_leetspeak_injection, both emitting
finding_type="encoding_attack". The jailbreak detector already handled
these via "jailbreak" findings; the regex analyzer was only doing base64.

Three new unit tests validate the new detectors in isolation. All 1025
existing security tests pass unchanged.

Corpus (50 scenarios):
- enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end.
- 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most:
  warn with known-gap annotation. These are harmful-content requests, not
  injection attacks; the proxy is an injection detector, not a content
  moderator.

Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.
@epappas
Copy link
Copy Markdown
Collaborator Author

epappas commented Apr 23, 2026

Superseded by #122 (squash-merged into main as commit fc71a5a). All L1-L6 changes landed in that single squash.

@epappas epappas closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant