fix(test-suites): set external_cost_cents on confirmed paid-vendor suites#49
Merged
Merged
Conversation
…ites Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup test coverage protocol. Stops the hourly-cadence bleed on Dilisense, eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently amplified by moving the scheduler from 24h to 1h cadence. Scope (intentionally tight — verified ground-truth via executor source): - pep-check, sanctions-check, adverse-media-check (Dilisense): 2-char name guard then unconditional fetch to api.dilisense.com per call. No free-probe path; no input bypass. Every scheduled run = 1 paid Dilisense call. → external_cost_cents = 1. - uk-cop-check (eSortcode Pay.UK): the manifest currently passes test_outcome=MATCHED which eSortcode burns no credits for, but the classification is still wrong as a flag. Future test-input change could start burning credits unexpectedly. → external_cost_cents = 1 per the user's future-proofing instruction. - risk-narrative-generate: Sonnet 4.6 (DEFAULT_MODEL constant in the executor), max_tokens 1500. Conservative cost upper bound: 4K input × $3/MTok + 1500 output × $15/MTok ≈ $0.034 ≈ €0.031. Round up to 3 cents so the scheduler skips it (any non-zero value excludes from hourly cadence) and the cost reflects real magnitude. → external_cost_cents = 3. Filter (idempotent + scope-bounded): - active = true - test_mode = 'live' (skip 'fixture' = saved data; skip 'canary' = existing non-zero values preserved) - test_type IN ('known_answer', 'edge_case', 'negative', 'known_bad') - external_cost_cents = 0 (preserve existing manual values) Excludes by design (zero-cost-by-design test types): - schema_check (dry-run mode, no API call) - dependency_health (zero-cost auth-less probe per CLAUDE.md Principle A — skipAuth: true on probe means a 401 proves connectivity without consuming quota) - piggyback (not scheduled; populated by customer traffic) Expected updated rows: 22 (16 Dilisense/eSortcode at 1¢ + 6 risk-narrative-generate at 3¢). Pre-flight verified against prod; suite IDs in PR description. NOT in scope (explicitly deferred per user instruction; separate to-dos): - Anthropic-Haiku bulk set (~80 caps): per-call cost depends on input/output token volume; flat 1-cent gives misleading false safety signal. Defer to a separate PR with proper per-call cost estimation (read max_tokens per executor + estimate typical input). - Browserless suites (37 caps): Browserless billing is per-minute, not per-call; mapping it onto external_cost_cents requires a different model. - Capability-level misclassifications (~8-12 caps tagged maintenance_class = 'commercial-stable-api' but actually free): Companies House, AviationStack free tier, RVO/Kadaster, Arbetsförmedlingen, CommonCrawl. Fix the maintenance_class classification on the capability, not the suite cost. Separate to-do. Migration shape: drizzle/0062_paid_vendor_suite_cost.sql + idempotent runtime block in scripts/apply-migrations.ts. Number gap from 0059 to 0062 leaves 0060 + 0061 free for the open feat/marketplace-eligible- flag (PR #42) and feat/retire-solutions-and-web3-assurance (PR #45) branches; the migration-prefix lint catches collisions either way. Post-condition assertion baked into both the SQL migration (DO $$ RAISE EXCEPTION on remaining_zero > 0) and the apply-migrations.ts block (logs remaining-at-zero count). If a new paid-vendor suite landed at cost=0 between audit and apply, the assertion fails with a clear message rather than letting the suite quietly bleed. Tests: 3 new in src/jobs/paid-vendor-suite-cost.test.ts. Compiles the UPDATE SQL via PgDialect.sqlToQuery and asserts: - Slug whitelist (4 + 1) is exact; bystander slugs are absent - test_mode = 'live' filter present; fixture/canary excluded - known_answer/edge_case/negative/known_bad included; schema_check/ dependency_health/piggyback excluded - external_cost_cents = 0 idempotency guard present - scope-sanity test pins the in-scope slug set so a scope-creep edit (e.g. adding ~80 Anthropic-Haiku slugs) trips a test failure instead of silently shipping Verification: - tsc --noEmit clean - vitest run — 450 pass, 11 skipped, 0 fail (3 new regression tests) - lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes all clean (62 migrations, no prefix collisions) - 22 rows verified pre-flight against prod (suite IDs listed in PR description) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
petterlindstrom79
added a commit
that referenced
this pull request
May 4, 2026
…sification dependency (#50) Defensive note added per the 2026-05-04 paid-vendor audit follow-up. The eSortcode CoP API does not deduct credits when `test_outcome` is set on the request — the executor (apps/api/src/capabilities/uk-cop- check.ts:88-112) passes it through verbatim, and eSortcode returns a deterministic test response. This makes the test scheduler's runs against the current fixture cost-free in practice, while a real production call (no `test_outcome` parameter) bills per call against the Pay.UK CoP scheme. PR #49 set test_suites.external_cost_cents = 1 for uk-cop-check live suites despite the bypass — the rationale (per user instruction) was that classification should reflect the per-call billing model so a future fixture change that silently loses the bypass doesn't quietly start burning credits. This commit closes the gap by documenting the dependency in the manifest itself: - Multi-line comment block above test_fixtures explains the bypass + the cost-classification implication. - Inline comment on each `test_outcome: MATCHED` line points back to the explanation, so an editor scanning the input fields sees the warning at the line they're about to change. YAML-only change. No code path. Manifest pipeline reads the file for content; comments are not roundtripped to DB but are visible to any engineer reading or editing the file. Verified the YAML still parses cleanly via js-yaml after the edit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
petterlindstrom79
added a commit
that referenced
this pull request
May 4, 2026
…ply-migrations.ts (#51) Recovery from the 2026-05-04 PR-#42 deploy outage. Root cause: apps/api/scripts/apply-migrations.ts was a dead file. The Dockerfile CMD ran the API server directly with no pre-start migration hook, AND apps/api/tsconfig.json's `rootDir: "./src"` excluded the script from the build entirely. Every block we shipped through that file (PR #29 actual_cost_cents, PR #42 marketplace_eligible, PR #49 paid-vendor cost UPDATEs) silently never ran in production. PR #42's deploy outage made this visible: the new code referenced columns that the migration was supposed to add, but the migration never ran, so every public-surface request 500'd until the columns were applied manually. Fix: - New `src/lib/startup-migrations.ts` exporting `runStartupMigrations()` and one function per migration block (0028, 0029, 0060, 0062). Each block is independently testable via an injected MigrationExecutor stub. - Wired into `index.ts:69` BEFORE `validateSchema()`, BEFORE the API listens, BEFORE any scheduler / job boots. Blocking — if any block throws, the process aborts via the existing `main().catch(...)`. - Per-block structured logging (`startup-migration-block` label with block id, outcome, rows_affected, duration_ms). Top-level `startup-migrations-complete` summary at the end. Pre-fix, the dead script wrote to stdout via console.log; the new code uses pino through `lib/log.js` (passes lint:no-new-console). - Block 0062 (paid-vendor) gains a post-condition `RAISE`-equivalent: if any of the 5 paid-vendor capabilities' active live non-probe suites still has external_cost_cents = 0 after the UPDATEs run, the function throws (would abort boot). Rationale: a new paid suite landing at cost=0 between deploys is exactly the silent bleed pattern this whole exercise is fixing. - Idempotency rules per block: - 0028 (sqs_daily_snapshot): information_schema check + skip; CREATE TABLE / INDEX statements use IF NOT EXISTS as defence-in-depth. - 0029 (actual_cost_cents): information_schema check + skip (column has NOT NULL DEFAULT 0; ADD COLUMN IF NOT EXISTS isn't quite enough on its own without re-applying the default to a pre-existing column). - 0060 (marketplace_eligible): two ADD COLUMN IF NOT EXISTS unconditional; Postgres-level no-op on re-run. - 0062 (paid-vendor costs): UPDATE WHERE external_cost_cents = 0; second run finds zero matching rows. Deleted apps/api/scripts/apply-migrations.ts. The historical admin endpoint at routes/internal-tests.ts:/admin/apply-migrations is left in place — it's a separate manual-recovery path with its own (older, incomplete) block list. Action 3 audit will surface what to do with it. Tests: 9 new in src/lib/startup-migrations.test.ts using a stub MigrationExecutor (PgDialect.sqlToQuery for SQL shape; canned-result queue for behaviour). Per DEC-20260504-A audit-followup test coverage protocol — every block gets: - First-run: assert the block runs the expected DDL/DML and reports the expected outcome. - Second-run idempotency: assert that on re-run the block either skips (information_schema check returns 1) or matches zero rows (WHERE filter on the post-state). - Block 0062 also gets a post-condition-violation test: stub the post-check to return `remaining_zero: 1`, assert the function throws (the failure-aborts-boot semantics). The new structure has a single source of truth for migration blocks, runs synchronously at boot, and is regression-tested. Dead-file silent-failure mode eliminated. Verification: - tsc --noEmit clean - vitest run — 469 pass, 11 skipped, 0 fail (9 new regression tests) - lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes all clean Manual recovery completed prior to this commit: - Migration 0060 marketplace_eligible columns applied to prod at 16:31 UTC 2026-05-04 (during the active outage; identical SQL to the migration file, ADD COLUMN IF NOT EXISTS — non-destructive). - PR #49's paid-vendor UPDATEs applied to prod at ~22:00 UTC 2026-05-04 (22 rows: 16 Dilisense+eSortcode at 1¢ + 6 risk- narrative-generate at 3¢; idempotent re-run produced 0 changes). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
Merged
petterlindstrom79
added a commit
that referenced
this pull request
May 4, 2026
…Anthropic-Haiku bleed (#55) Block 0063 — sibling of block 0062 (PR #49 paid-vendor classification), single-capability variant for invoice-extract. ## Why Prod query 2026-05-04 found all 4 active non-probe live test suites (known_answer, edge_case, negative, known_bad) for `invoice-extract` at `external_cost_cents = 0`. Per DEC-20260503-B the scheduler skips suites with `external_cost_cents > 0` for paid vendors; at 0, they were being scheduled hourly and paying Anthropic Haiku vision to "extract invoice fields" from `httpbin.org/image/jpeg` (a JPEG of a dog). The fixture issue is hygiene-only; this block is the structural fix that flips the scheduler-skip semantic. ## What UPDATE test_suites SET external_cost_cents = 1 WHERE capability_slug = 'invoice-extract' AND active = true AND test_mode = 'live' AND test_type IN ('known_answer','edge_case','negative','known_bad') AND external_cost_cents = 0 dependency_health and schema_check are explicitly excluded — they use the auth-less probe pattern (no paid call), legitimately stay at 0. 1¢ floor matches the PR #49 / block 0062 defensible-minimum pattern; real Haiku-vision-on-small-JPEG cost is below 1¢, but the floor's operational role is purely the scheduler-skip flip. ## Idempotency The `external_cost_cents = 0` filter in the WHERE clause makes the UPDATE a no-op on re-run. Post-condition assertion fails boot if any new active live non-probe invoice-extract suite shows up at 0. ## Per active protocols - DEC-20260504-A: 3 regression tests cover (a) first-run UPDATE shape including the test_type list and the negative assertion that probes are excluded, (b) idempotent re-run zero-rows, (c) post-condition violation throws. - DEC-20260504-C: this is a startup-migrations block; PR #51/#52 wired runStartupMigrations() to fire blocking before validateSchema() at every API restart. No deploy-mechanism risk. Notion to-do 35667c87-082c-817f-9905-cfad380b8ae3 (invoice-extract fixture). Fixture itself left as the dog JPEG intentionally — once the suite is no longer scheduled, the fixture quality is moot. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 11, 2026
petterlindstrom79
added a commit
that referenced
this pull request
May 11, 2026
…ly testing (#85) Phase 1 (Contain): startup-migrations block 0064 sets external_cost_cents to 1¢ on the 73 always-LLM Haiku capabilities whose suites were still at 0 after PR #49 (which explicitly deferred the bulk-set). The test scheduler dispatch query (apps/api/src/jobs/test-scheduler.ts:262) filters on external_cost_cents = 0, so until this fix every always-LLM Haiku cap was being executed hourly with real Anthropic billing. Phase 2 (Understand): see Notion Journal entry for the named failure pattern ("Compound-PR cost leak") and audit PR #84 for the full ramp attribution. Phase 3 (Harden): new llm-capability-costs.ts holds the canonical map + CONDITIONAL_LLM_CAPABILITIES exclusion set. New CI test walks apps/api/src/capabilities/*.ts and fails if any @anthropic-ai/sdk importer is not registered as always-LLM, conditional-LLM, or deactivated. Adding a new LLM-using capability without registering its cost fails CI. Idempotency: 0064 filters on `external_cost_cents = 0` in the WHERE clause + a post-condition SELECT that throws on remaining_zero > 0. Mirrors the PR #49 / PR #55 pattern. Slug count: 73 always-LLM Haiku caps (audit PR #84 estimated ~50; the full grep + filter against DEACTIVATED + CONDITIONAL_LLM yields 73, within the prompt's <30 / >80 halt range). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup test coverage. Stops the hourly-cadence bleed on Dilisense, eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently amplified by moving the scheduler from 24h to 1h cadence.
Scope (intentionally tight — verified ground-truth via executor source)
pep-checksanctions-checkadverse-media-checkuk-cop-checkrisk-narrative-generate*
uk-cop-checkcurrently passestest_outcome=MATCHEDwhich eSortcode burns no credits for. Setting to 1¢ anyway because the classification is wrong as a flag and a future test-input change could start burning credits unexpectedly.risk-narrative-generatecost derivation: Executor pinsclaude-sonnet-4-6withmax_tokens: 1500. Sonnet 4.6 pricing $3/MTok input + $15/MTok output. Conservative upper bound: 4K input × $3 + 1.5K output × $15 ≈ $0.034 ≈ €0.031. Rounded up to 3 cents so the scheduler skips it and the cost reflects real magnitude.Filter (idempotent + scope-bounded)
Excluded by design (zero-cost-by-design test types):
schema_check— dry-run mode, no API calldependency_health— zero-cost auth-less probe per CLAUDE.md Principle A (skipAuth: truereturns 401 without consuming quota)piggyback— not scheduled; populated by customer trafficPre-flight verified against prod (22 rows total)
16 at 1¢ + 6 at 3¢ = 22 rows total. Some rows have
fixturein theirtest_namebuttest_mode='live'— name is a holdover; the executor calls the real API regardless of name.Migration shape
drizzle/0062_paid_vendor_suite_cost.sql— canonical SQL (claims 0062 to leave 0060/0061 free for the open PRs feat(capability): add marketplace_eligible flag + onboarding classification gate #42 and feat(api): retire solutions surface; delete Web3 Assurance code #45; the migration-prefix lint catches collisions either way).scripts/apply-migrations.ts— idempotent runtime block. Logs deleted counts + remaining-at-zero post-check.external_cost_cents = 0after the migration runs, the SQLRAISE EXCEPTIONs and the runtime applier logs aWARNING.Tests (DEC-20260504-A audit-followup test coverage)
3 new in
src/jobs/paid-vendor-suite-cost.test.ts. Compiles the UPDATE SQL viaPgDialect.sqlToQueryand asserts:risk-narrative-generate,us-company-data-cobalt,google-search,translate) are absent from the Dilisense UPDATEtest_mode = 'live'filter present (excludes fixture/canary)known_answer/edge_case/negative/known_badincluded;schema_check/dependency_health/piggybackexcludedexternal_cost_cents = 0idempotency guard presentNOT in scope (explicitly deferred — separate to-dos)
Anthropic-Haiku bulk set (~80 caps). Per-call cost depends on input/output token volume; flat 1-cent gives a misleading false safety signal. Needs a per-executor
max_tokens+ typical-input estimate. Defer.Browserless suites (37 caps). Browserless billing is per-minute, not per-call. Mapping onto
external_cost_centsneeds a different model. Defer.Capability-level misclassifications. ~8–12 caps tagged
maintenance_class = 'commercial-stable-api'but actually free upstream:uk-company-data,uk-companies-house-officers(Companies House — free public registry)flight-status(AviationStack — has a free tier)backlink-check(CommonCrawl — free)job-board-search(Arbetsförmedlingen — Swedish public employment service, free)nl-bag-address,nl-energy-label(RVO / Kadaster — free Dutch government APIs)address-parse,pii-redact,fake-data-generate,skill-extract(data_source says "Algorithmic …" — should bepure-computation, notcommercial-stable-api)Fix the
maintenance_classon the capability, not the suite cost. Separate to-do.Verification
tsc --noEmitcleanvitest run— 450 pass, 11 skipped, 0 fail (3 new regression tests)lint:no-bare-catch,lint:no-new-console,lint:migration-prefixes(62 migrations, no prefix collisions)apply-migrations.tslogsDilisense/eSortcode suites updated: 16andrisk-narrative-generate suites updated: 6;paid-vendor remaining-at-zero post-check: 0Deploy posture
Auto-deploys on merge. Migration runs idempotently on every restart (no-op after first apply). Disk impact: negligible — 22 row UPDATEs.
🤖 Generated with Claude Code