fix(test-suites): set external_cost_cents on confirmed paid-vendor suites by petterlindstrom79 · Pull Request #49 · strale-io/strale

petterlindstrom79 · 2026-05-04T13:27:17Z

Summary

Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup test coverage. Stops the hourly-cadence bleed on Dilisense, eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently amplified by moving the scheduler from 24h to 1h cadence.

Scope (intentionally tight — verified ground-truth via executor source)

Capability	Vendor	Per-call billing?	New cost
`pep-check`	Dilisense	Yes	1¢
`sanctions-check`	Dilisense	Yes	1¢
`adverse-media-check`	Dilisense (+ Serper fallback)	Yes	1¢
`uk-cop-check`	eSortcode	Future-proofing flag*	1¢
`risk-narrative-generate`	Anthropic Sonnet 4.6	Yes	3¢

*uk-cop-check currently passes test_outcome=MATCHED which eSortcode burns no credits for. Setting to 1¢ anyway because the classification is wrong as a flag and a future test-input change could start burning credits unexpectedly.

risk-narrative-generate cost derivation: Executor pins claude-sonnet-4-6 with max_tokens: 1500. Sonnet 4.6 pricing $3/MTok input + $15/MTok output. Conservative upper bound: 4K input × $3 + 1.5K output × $15 ≈ $0.034 ≈ €0.031. Rounded up to 3 cents so the scheduler skips it and the cost reflects real magnitude.

Filter (idempotent + scope-bounded)

WHERE capability_slug IN (...)
  AND active = true
  AND test_mode = 'live'                 -- not fixture (saved data) / canary (existing non-zero)
  AND test_type IN (
    'known_answer', 'edge_case',
    'negative', 'known_bad'              -- exclude schema_check / dependency_health / piggyback
  )
  AND external_cost_cents = 0            -- preserve existing manual values; idempotent

Excluded by design (zero-cost-by-design test types):

schema_check — dry-run mode, no API call
dependency_health — zero-cost auth-less probe per CLAUDE.md Principle A (skipAuth: true returns 401 without consuming quota)
piggyback — not scheduled; populated by customer traffic

Pre-flight verified against prod (22 rows total)

adverse-media-check       edge_case       2fb2d99a-9b9f-424d-aa48-26e762e43b5e
adverse-media-check       known_answer    16b0562f-f3b4-4304-99f3-efa2958f584d
adverse-media-check       negative        c46407eb-b1c0-42aa-9473-91a34e441f1a
pep-check                 edge_case       0e71d89c-0df8-4428-92b9-b02b2b7de418
pep-check                 edge_case       69cece61-fe10-4ff6-ba1c-eb9de8945c84
pep-check                 known_answer    67a75c61-2f3c-4885-98d5-7b1083f394d6
pep-check                 known_answer    a0d840b4-e6bd-4458-b146-b68133db41bc
pep-check                 known_answer    a1e3290f-37b5-495d-8d76-cfb7060df5d1
pep-check                 known_answer    e2d5bbd8-6c79-46ee-a67a-830ed687796a
pep-check                 negative        e1b2bea1-b7c5-4bce-92ef-f3a1c5c5280e
sanctions-check           edge_case       98ad5bca-3328-4d33-b2f4-70fe179a4f30
sanctions-check           known_answer    0e3b2aa8-6a28-4b99-855e-543f45c5ebaf
sanctions-check           negative        a458ce66-a4dc-4441-b916-1e98835a98b5
uk-cop-check              edge_case       ad20e380-8972-487d-90e7-28d18bb470c7
uk-cop-check              known_answer    b860cbd9-55b5-4687-b20d-61dc1740fd12
uk-cop-check              negative        4be485fd-f422-4381-9684-fa00f320ad1e
risk-narrative-generate   edge_case       36dffc24-3052-4b22-9bbb-909096620b64
risk-narrative-generate   known_answer    134637c0-91e4-46c7-a945-946c0bc23ef3
risk-narrative-generate   known_answer    5e431628-9a08-4a45-9214-2738949021aa
risk-narrative-generate   known_answer    61ddd201-42ab-43f4-923c-ccd2efed6c54
risk-narrative-generate   known_answer    fa765d3c-b570-42d6-a47f-12b754865d1b
risk-narrative-generate   negative        a312f4cc-5c4e-47f6-bee2-fd21fffcd5b0

16 at 1¢ + 6 at 3¢ = 22 rows total. Some rows have fixture in their test_name but test_mode='live' — name is a holdover; the executor calls the real API regardless of name.

Migration shape

drizzle/0062_paid_vendor_suite_cost.sql — canonical SQL (claims 0062 to leave 0060/0061 free for the open PRs feat(capability): add marketplace_eligible flag + onboarding classification gate #42 and feat(api): retire solutions surface; delete Web3 Assurance code #45; the migration-prefix lint catches collisions either way).
scripts/apply-migrations.ts — idempotent runtime block. Logs deleted counts + remaining-at-zero post-check.
Post-condition assertion baked into both layers: if any of the 5 capabilities has an active live non-probe suite still at external_cost_cents = 0 after the migration runs, the SQL RAISE EXCEPTIONs and the runtime applier logs a WARNING.

Tests (DEC-20260504-A audit-followup test coverage)

3 new in src/jobs/paid-vendor-suite-cost.test.ts. Compiles the UPDATE SQL via PgDialect.sqlToQuery and asserts:

Slug whitelist (4 + 1) is exact; bystander slugs (risk-narrative-generate, us-company-data-cobalt, google-search, translate) are absent from the Dilisense UPDATE
test_mode = 'live' filter present (excludes fixture/canary)
known_answer/edge_case/negative/known_bad included; schema_check/dependency_health/piggyback excluded
external_cost_cents = 0 idempotency guard present
Scope-sanity test pins the exact in-scope slug set; a scope-creep edit (e.g. adding ~80 Anthropic-Haiku slugs) trips a test failure rather than silently shipping

NOT in scope (explicitly deferred — separate to-dos)

Anthropic-Haiku bulk set (~80 caps). Per-call cost depends on input/output token volume; flat 1-cent gives a misleading false safety signal. Needs a per-executor max_tokens + typical-input estimate. Defer.
Browserless suites (37 caps). Browserless billing is per-minute, not per-call. Mapping onto external_cost_cents needs a different model. Defer.
Capability-level misclassifications. ~8–12 caps tagged maintenance_class = 'commercial-stable-api' but actually free upstream:
- uk-company-data, uk-companies-house-officers (Companies House — free public registry)
- flight-status (AviationStack — has a free tier)
- backlink-check (CommonCrawl — free)
- job-board-search (Arbetsförmedlingen — Swedish public employment service, free)
- nl-bag-address, nl-energy-label (RVO / Kadaster — free Dutch government APIs)
- address-parse, pii-redact, fake-data-generate, skill-extract (data_source says "Algorithmic …" — should be pure-computation, not commercial-stable-api)
Fix the maintenance_class on the capability, not the suite cost. Separate to-do.

Verification

tsc --noEmit clean
vitest run — 450 pass, 11 skipped, 0 fail (3 new regression tests)
Linters: lint:no-bare-catch, lint:no-new-console, lint:migration-prefixes (62 migrations, no prefix collisions)
22 update-target rows verified pre-flight against prod
Post-deploy: confirm apply-migrations.ts logs Dilisense/eSortcode suites updated: 16 and risk-narrative-generate suites updated: 6; paid-vendor remaining-at-zero post-check: 0

Deploy posture

Auto-deploys on merge. Migration runs idempotently on every restart (no-op after first apply). Disk impact: negligible — 22 row UPDATEs.

🤖 Generated with Claude Code

…ites Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup test coverage protocol. Stops the hourly-cadence bleed on Dilisense, eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently amplified by moving the scheduler from 24h to 1h cadence. Scope (intentionally tight — verified ground-truth via executor source): - pep-check, sanctions-check, adverse-media-check (Dilisense): 2-char name guard then unconditional fetch to api.dilisense.com per call. No free-probe path; no input bypass. Every scheduled run = 1 paid Dilisense call. → external_cost_cents = 1. - uk-cop-check (eSortcode Pay.UK): the manifest currently passes test_outcome=MATCHED which eSortcode burns no credits for, but the classification is still wrong as a flag. Future test-input change could start burning credits unexpectedly. → external_cost_cents = 1 per the user's future-proofing instruction. - risk-narrative-generate: Sonnet 4.6 (DEFAULT_MODEL constant in the executor), max_tokens 1500. Conservative cost upper bound: 4K input × $3/MTok + 1500 output × $15/MTok ≈ $0.034 ≈ €0.031. Round up to 3 cents so the scheduler skips it (any non-zero value excludes from hourly cadence) and the cost reflects real magnitude. → external_cost_cents = 3. Filter (idempotent + scope-bounded): - active = true - test_mode = 'live' (skip 'fixture' = saved data; skip 'canary' = existing non-zero values preserved) - test_type IN ('known_answer', 'edge_case', 'negative', 'known_bad') - external_cost_cents = 0 (preserve existing manual values) Excludes by design (zero-cost-by-design test types): - schema_check (dry-run mode, no API call) - dependency_health (zero-cost auth-less probe per CLAUDE.md Principle A — skipAuth: true on probe means a 401 proves connectivity without consuming quota) - piggyback (not scheduled; populated by customer traffic) Expected updated rows: 22 (16 Dilisense/eSortcode at 1¢ + 6 risk-narrative-generate at 3¢). Pre-flight verified against prod; suite IDs in PR description. NOT in scope (explicitly deferred per user instruction; separate to-dos): - Anthropic-Haiku bulk set (~80 caps): per-call cost depends on input/output token volume; flat 1-cent gives misleading false safety signal. Defer to a separate PR with proper per-call cost estimation (read max_tokens per executor + estimate typical input). - Browserless suites (37 caps): Browserless billing is per-minute, not per-call; mapping it onto external_cost_cents requires a different model. - Capability-level misclassifications (~8-12 caps tagged maintenance_class = 'commercial-stable-api' but actually free): Companies House, AviationStack free tier, RVO/Kadaster, Arbetsförmedlingen, CommonCrawl. Fix the maintenance_class classification on the capability, not the suite cost. Separate to-do. Migration shape: drizzle/0062_paid_vendor_suite_cost.sql + idempotent runtime block in scripts/apply-migrations.ts. Number gap from 0059 to 0062 leaves 0060 + 0061 free for the open feat/marketplace-eligible- flag (PR #42) and feat/retire-solutions-and-web3-assurance (PR #45) branches; the migration-prefix lint catches collisions either way. Post-condition assertion baked into both the SQL migration (DO $$ RAISE EXCEPTION on remaining_zero > 0) and the apply-migrations.ts block (logs remaining-at-zero count). If a new paid-vendor suite landed at cost=0 between audit and apply, the assertion fails with a clear message rather than letting the suite quietly bleed. Tests: 3 new in src/jobs/paid-vendor-suite-cost.test.ts. Compiles the UPDATE SQL via PgDialect.sqlToQuery and asserts: - Slug whitelist (4 + 1) is exact; bystander slugs are absent - test_mode = 'live' filter present; fixture/canary excluded - known_answer/edge_case/negative/known_bad included; schema_check/ dependency_health/piggyback excluded - external_cost_cents = 0 idempotency guard present - scope-sanity test pins the in-scope slug set so a scope-creep edit (e.g. adding ~80 Anthropic-Haiku slugs) trips a test failure instead of silently shipping Verification: - tsc --noEmit clean - vitest run — 450 pass, 11 skipped, 0 fail (3 new regression tests) - lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes all clean (62 migrations, no prefix collisions) - 22 rows verified pre-flight against prod (suite IDs listed in PR description) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sification dependency (#50) Defensive note added per the 2026-05-04 paid-vendor audit follow-up. The eSortcode CoP API does not deduct credits when `test_outcome` is set on the request — the executor (apps/api/src/capabilities/uk-cop- check.ts:88-112) passes it through verbatim, and eSortcode returns a deterministic test response. This makes the test scheduler's runs against the current fixture cost-free in practice, while a real production call (no `test_outcome` parameter) bills per call against the Pay.UK CoP scheme. PR #49 set test_suites.external_cost_cents = 1 for uk-cop-check live suites despite the bypass — the rationale (per user instruction) was that classification should reflect the per-call billing model so a future fixture change that silently loses the bypass doesn't quietly start burning credits. This commit closes the gap by documenting the dependency in the manifest itself: - Multi-line comment block above test_fixtures explains the bypass + the cost-classification implication. - Inline comment on each `test_outcome: MATCHED` line points back to the explanation, so an editor scanning the input fields sees the warning at the line they're about to change. YAML-only change. No code path. Manifest pipeline reads the file for content; comments are not roundtripped to DB but are visible to any engineer reading or editing the file. Verified the YAML still parses cleanly via js-yaml after the edit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ply-migrations.ts (#51) Recovery from the 2026-05-04 PR-#42 deploy outage. Root cause: apps/api/scripts/apply-migrations.ts was a dead file. The Dockerfile CMD ran the API server directly with no pre-start migration hook, AND apps/api/tsconfig.json's `rootDir: "./src"` excluded the script from the build entirely. Every block we shipped through that file (PR #29 actual_cost_cents, PR #42 marketplace_eligible, PR #49 paid-vendor cost UPDATEs) silently never ran in production. PR #42's deploy outage made this visible: the new code referenced columns that the migration was supposed to add, but the migration never ran, so every public-surface request 500'd until the columns were applied manually. Fix: - New `src/lib/startup-migrations.ts` exporting `runStartupMigrations()` and one function per migration block (0028, 0029, 0060, 0062). Each block is independently testable via an injected MigrationExecutor stub. - Wired into `index.ts:69` BEFORE `validateSchema()`, BEFORE the API listens, BEFORE any scheduler / job boots. Blocking — if any block throws, the process aborts via the existing `main().catch(...)`. - Per-block structured logging (`startup-migration-block` label with block id, outcome, rows_affected, duration_ms). Top-level `startup-migrations-complete` summary at the end. Pre-fix, the dead script wrote to stdout via console.log; the new code uses pino through `lib/log.js` (passes lint:no-new-console). - Block 0062 (paid-vendor) gains a post-condition `RAISE`-equivalent: if any of the 5 paid-vendor capabilities' active live non-probe suites still has external_cost_cents = 0 after the UPDATEs run, the function throws (would abort boot). Rationale: a new paid suite landing at cost=0 between deploys is exactly the silent bleed pattern this whole exercise is fixing. - Idempotency rules per block: - 0028 (sqs_daily_snapshot): information_schema check + skip; CREATE TABLE / INDEX statements use IF NOT EXISTS as defence-in-depth. - 0029 (actual_cost_cents): information_schema check + skip (column has NOT NULL DEFAULT 0; ADD COLUMN IF NOT EXISTS isn't quite enough on its own without re-applying the default to a pre-existing column). - 0060 (marketplace_eligible): two ADD COLUMN IF NOT EXISTS unconditional; Postgres-level no-op on re-run. - 0062 (paid-vendor costs): UPDATE WHERE external_cost_cents = 0; second run finds zero matching rows. Deleted apps/api/scripts/apply-migrations.ts. The historical admin endpoint at routes/internal-tests.ts:/admin/apply-migrations is left in place — it's a separate manual-recovery path with its own (older, incomplete) block list. Action 3 audit will surface what to do with it. Tests: 9 new in src/lib/startup-migrations.test.ts using a stub MigrationExecutor (PgDialect.sqlToQuery for SQL shape; canned-result queue for behaviour). Per DEC-20260504-A audit-followup test coverage protocol — every block gets: - First-run: assert the block runs the expected DDL/DML and reports the expected outcome. - Second-run idempotency: assert that on re-run the block either skips (information_schema check returns 1) or matches zero rows (WHERE filter on the post-state). - Block 0062 also gets a post-condition-violation test: stub the post-check to return `remaining_zero: 1`, assert the function throws (the failure-aborts-boot semantics). The new structure has a single source of truth for migration blocks, runs synchronously at boot, and is regression-tested. Dead-file silent-failure mode eliminated. Verification: - tsc --noEmit clean - vitest run — 469 pass, 11 skipped, 0 fail (9 new regression tests) - lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes all clean Manual recovery completed prior to this commit: - Migration 0060 marketplace_eligible columns applied to prod at 16:31 UTC 2026-05-04 (during the active outage; identical SQL to the migration file, ADD COLUMN IF NOT EXISTS — non-destructive). - PR #49's paid-vendor UPDATEs applied to prod at ~22:00 UTC 2026-05-04 (22 rows: 16 Dilisense+eSortcode at 1¢ + 6 risk- narrative-generate at 3¢; idempotent re-run produced 0 changes). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Anthropic-Haiku bleed (#55) Block 0063 — sibling of block 0062 (PR #49 paid-vendor classification), single-capability variant for invoice-extract. ## Why Prod query 2026-05-04 found all 4 active non-probe live test suites (known_answer, edge_case, negative, known_bad) for `invoice-extract` at `external_cost_cents = 0`. Per DEC-20260503-B the scheduler skips suites with `external_cost_cents > 0` for paid vendors; at 0, they were being scheduled hourly and paying Anthropic Haiku vision to "extract invoice fields" from `httpbin.org/image/jpeg` (a JPEG of a dog). The fixture issue is hygiene-only; this block is the structural fix that flips the scheduler-skip semantic. ## What UPDATE test_suites SET external_cost_cents = 1 WHERE capability_slug = 'invoice-extract' AND active = true AND test_mode = 'live' AND test_type IN ('known_answer','edge_case','negative','known_bad') AND external_cost_cents = 0 dependency_health and schema_check are explicitly excluded — they use the auth-less probe pattern (no paid call), legitimately stay at 0. 1¢ floor matches the PR #49 / block 0062 defensible-minimum pattern; real Haiku-vision-on-small-JPEG cost is below 1¢, but the floor's operational role is purely the scheduler-skip flip. ## Idempotency The `external_cost_cents = 0` filter in the WHERE clause makes the UPDATE a no-op on re-run. Post-condition assertion fails boot if any new active live non-probe invoice-extract suite shows up at 0. ## Per active protocols - DEC-20260504-A: 3 regression tests cover (a) first-run UPDATE shape including the test_type list and the negative assertion that probes are excluded, (b) idempotent re-run zero-rows, (c) post-condition violation throws. - DEC-20260504-C: this is a startup-migrations block; PR #51/#52 wired runStartupMigrations() to fire blocking before validateSchema() at every API restart. No deploy-mechanism risk. Notion to-do 35667c87-082c-817f-9905-cfad380b8ae3 (invoice-extract fixture). Fixture itself left as the dog JPEG intentionally — once the suite is no longer scheduled, the fixture quality is moot. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ly testing (#85) Phase 1 (Contain): startup-migrations block 0064 sets external_cost_cents to 1¢ on the 73 always-LLM Haiku capabilities whose suites were still at 0 after PR #49 (which explicitly deferred the bulk-set). The test scheduler dispatch query (apps/api/src/jobs/test-scheduler.ts:262) filters on external_cost_cents = 0, so until this fix every always-LLM Haiku cap was being executed hourly with real Anthropic billing. Phase 2 (Understand): see Notion Journal entry for the named failure pattern ("Compound-PR cost leak") and audit PR #84 for the full ramp attribution. Phase 3 (Harden): new llm-capability-costs.ts holds the canonical map + CONDITIONAL_LLM_CAPABILITIES exclusion set. New CI test walks apps/api/src/capabilities/*.ts and fails if any @anthropic-ai/sdk importer is not registered as always-LLM, conditional-LLM, or deactivated. Adding a new LLM-using capability without registering its cost fails CI. Idempotency: 0064 filters on `external_cost_cents = 0` in the WHERE clause + a post-condition SELECT that throws on remaining_zero > 0. Mirrors the PR #49 / PR #55 pattern. Slug count: 73 always-LLM Haiku caps (audit PR #84 estimated ~50; the full grep + filter against DEACTIVATED + CONDITIONAL_LLM yields 73, within the prompt's <30 / >80 halt range). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

petterlindstrom79 merged commit f23a130 into main May 4, 2026
1 check passed

petterlindstrom79 deleted the fix/paid-vendor-suite-cost-classification branch May 4, 2026 13:44

petterlindstrom79 mentioned this pull request May 4, 2026

chore(uk-cop-check): document test_outcome=MATCHED bypass + cost-classification dependency #50

Merged

petterlindstrom79 mentioned this pull request May 4, 2026

fix(deploy): wire startup migrations into API boot; replaces dead apply-migrations.ts #51

Merged

4 tasks

This was referenced May 4, 2026

chore(cleanup): P3 hygiene — delete dead test-scheduler-policy.ts + Notion path/tense fix #54

Merged

fix(invoice-extract): reclassify paid-vendor suite cost; stop hourly Anthropic-Haiku bleed #55

Merged

This was referenced May 11, 2026

docs: audit anthropic api cost drivers for may 2026 ramp #84

Open

fix: bump external_cost_cents on always-llm capabilities to gate hourly testing #85

Merged

petterlindstrom79 mentioned this pull request May 11, 2026

refactor: scheduled_testing_eligible column + derivation bridge (PR A) #88

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test-suites): set external_cost_cents on confirmed paid-vendor suites#49

fix(test-suites): set external_cost_cents on confirmed paid-vendor suites#49
petterlindstrom79 merged 1 commit into
mainfrom
fix/paid-vendor-suite-cost-classification

petterlindstrom79 commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petterlindstrom79 commented May 4, 2026

Summary

Scope (intentionally tight — verified ground-truth via executor source)

Filter (idempotent + scope-bounded)

Pre-flight verified against prod (22 rows total)

Migration shape

Tests (DEC-20260504-A audit-followup test coverage)

NOT in scope (explicitly deferred — separate to-dos)

Verification

Deploy posture

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant