Skip to content

fix(test-suites): set external_cost_cents on confirmed paid-vendor suites#49

Merged
petterlindstrom79 merged 1 commit into
mainfrom
fix/paid-vendor-suite-cost-classification
May 4, 2026
Merged

fix(test-suites): set external_cost_cents on confirmed paid-vendor suites#49
petterlindstrom79 merged 1 commit into
mainfrom
fix/paid-vendor-suite-cost-classification

Conversation

@petterlindstrom79
Copy link
Copy Markdown
Member

Summary

Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup test coverage. Stops the hourly-cadence bleed on Dilisense, eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently amplified by moving the scheduler from 24h to 1h cadence.

Scope (intentionally tight — verified ground-truth via executor source)

Capability Vendor Per-call billing? New cost
pep-check Dilisense Yes
sanctions-check Dilisense Yes
adverse-media-check Dilisense (+ Serper fallback) Yes
uk-cop-check eSortcode Future-proofing flag*
risk-narrative-generate Anthropic Sonnet 4.6 Yes

*uk-cop-check currently passes test_outcome=MATCHED which eSortcode burns no credits for. Setting to 1¢ anyway because the classification is wrong as a flag and a future test-input change could start burning credits unexpectedly.

risk-narrative-generate cost derivation: Executor pins claude-sonnet-4-6 with max_tokens: 1500. Sonnet 4.6 pricing $3/MTok input + $15/MTok output. Conservative upper bound: 4K input × $3 + 1.5K output × $15 ≈ $0.034 ≈ €0.031. Rounded up to 3 cents so the scheduler skips it and the cost reflects real magnitude.

Filter (idempotent + scope-bounded)

WHERE capability_slug IN (...)
  AND active = true
  AND test_mode = 'live'                 -- not fixture (saved data) / canary (existing non-zero)
  AND test_type IN (
    'known_answer', 'edge_case',
    'negative', 'known_bad'              -- exclude schema_check / dependency_health / piggyback
  )
  AND external_cost_cents = 0            -- preserve existing manual values; idempotent

Excluded by design (zero-cost-by-design test types):

  • schema_check — dry-run mode, no API call
  • dependency_health — zero-cost auth-less probe per CLAUDE.md Principle A (skipAuth: true returns 401 without consuming quota)
  • piggyback — not scheduled; populated by customer traffic

Pre-flight verified against prod (22 rows total)

adverse-media-check       edge_case       2fb2d99a-9b9f-424d-aa48-26e762e43b5e
adverse-media-check       known_answer    16b0562f-f3b4-4304-99f3-efa2958f584d
adverse-media-check       negative        c46407eb-b1c0-42aa-9473-91a34e441f1a
pep-check                 edge_case       0e71d89c-0df8-4428-92b9-b02b2b7de418
pep-check                 edge_case       69cece61-fe10-4ff6-ba1c-eb9de8945c84
pep-check                 known_answer    67a75c61-2f3c-4885-98d5-7b1083f394d6
pep-check                 known_answer    a0d840b4-e6bd-4458-b146-b68133db41bc
pep-check                 known_answer    a1e3290f-37b5-495d-8d76-cfb7060df5d1
pep-check                 known_answer    e2d5bbd8-6c79-46ee-a67a-830ed687796a
pep-check                 negative        e1b2bea1-b7c5-4bce-92ef-f3a1c5c5280e
sanctions-check           edge_case       98ad5bca-3328-4d33-b2f4-70fe179a4f30
sanctions-check           known_answer    0e3b2aa8-6a28-4b99-855e-543f45c5ebaf
sanctions-check           negative        a458ce66-a4dc-4441-b916-1e98835a98b5
uk-cop-check              edge_case       ad20e380-8972-487d-90e7-28d18bb470c7
uk-cop-check              known_answer    b860cbd9-55b5-4687-b20d-61dc1740fd12
uk-cop-check              negative        4be485fd-f422-4381-9684-fa00f320ad1e
risk-narrative-generate   edge_case       36dffc24-3052-4b22-9bbb-909096620b64
risk-narrative-generate   known_answer    134637c0-91e4-46c7-a945-946c0bc23ef3
risk-narrative-generate   known_answer    5e431628-9a08-4a45-9214-2738949021aa
risk-narrative-generate   known_answer    61ddd201-42ab-43f4-923c-ccd2efed6c54
risk-narrative-generate   known_answer    fa765d3c-b570-42d6-a47f-12b754865d1b
risk-narrative-generate   negative        a312f4cc-5c4e-47f6-bee2-fd21fffcd5b0

16 at 1¢ + 6 at 3¢ = 22 rows total. Some rows have fixture in their test_name but test_mode='live' — name is a holdover; the executor calls the real API regardless of name.

Migration shape

Tests (DEC-20260504-A audit-followup test coverage)

3 new in src/jobs/paid-vendor-suite-cost.test.ts. Compiles the UPDATE SQL via PgDialect.sqlToQuery and asserts:

  • Slug whitelist (4 + 1) is exact; bystander slugs (risk-narrative-generate, us-company-data-cobalt, google-search, translate) are absent from the Dilisense UPDATE
  • test_mode = 'live' filter present (excludes fixture/canary)
  • known_answer/edge_case/negative/known_bad included; schema_check/dependency_health/piggyback excluded
  • external_cost_cents = 0 idempotency guard present
  • Scope-sanity test pins the exact in-scope slug set; a scope-creep edit (e.g. adding ~80 Anthropic-Haiku slugs) trips a test failure rather than silently shipping

NOT in scope (explicitly deferred — separate to-dos)

  • Anthropic-Haiku bulk set (~80 caps). Per-call cost depends on input/output token volume; flat 1-cent gives a misleading false safety signal. Needs a per-executor max_tokens + typical-input estimate. Defer.

  • Browserless suites (37 caps). Browserless billing is per-minute, not per-call. Mapping onto external_cost_cents needs a different model. Defer.

  • Capability-level misclassifications. ~8–12 caps tagged maintenance_class = 'commercial-stable-api' but actually free upstream:

    • uk-company-data, uk-companies-house-officers (Companies House — free public registry)
    • flight-status (AviationStack — has a free tier)
    • backlink-check (CommonCrawl — free)
    • job-board-search (Arbetsförmedlingen — Swedish public employment service, free)
    • nl-bag-address, nl-energy-label (RVO / Kadaster — free Dutch government APIs)
    • address-parse, pii-redact, fake-data-generate, skill-extract (data_source says "Algorithmic …" — should be pure-computation, not commercial-stable-api)

    Fix the maintenance_class on the capability, not the suite cost. Separate to-do.

Verification

  • tsc --noEmit clean
  • vitest run — 450 pass, 11 skipped, 0 fail (3 new regression tests)
  • Linters: lint:no-bare-catch, lint:no-new-console, lint:migration-prefixes (62 migrations, no prefix collisions)
  • 22 update-target rows verified pre-flight against prod
  • Post-deploy: confirm apply-migrations.ts logs Dilisense/eSortcode suites updated: 16 and risk-narrative-generate suites updated: 6; paid-vendor remaining-at-zero post-check: 0

Deploy posture

Auto-deploys on merge. Migration runs idempotently on every restart (no-op after first apply). Disk impact: negligible — 22 row UPDATEs.

🤖 Generated with Claude Code

…ites

Per the 2026-05-04 paid-vendor audit + DEC-20260504-A audit-followup
test coverage protocol. Stops the hourly-cadence bleed on Dilisense,
eSortcode, and Anthropic Sonnet calls that PR #46 inadvertently
amplified by moving the scheduler from 24h to 1h cadence.

Scope (intentionally tight — verified ground-truth via executor source):

- pep-check, sanctions-check, adverse-media-check (Dilisense): 2-char
  name guard then unconditional fetch to api.dilisense.com per call.
  No free-probe path; no input bypass. Every scheduled run = 1 paid
  Dilisense call. → external_cost_cents = 1.
- uk-cop-check (eSortcode Pay.UK): the manifest currently passes
  test_outcome=MATCHED which eSortcode burns no credits for, but the
  classification is still wrong as a flag. Future test-input change
  could start burning credits unexpectedly. → external_cost_cents = 1
  per the user's future-proofing instruction.
- risk-narrative-generate: Sonnet 4.6 (DEFAULT_MODEL constant in the
  executor), max_tokens 1500. Conservative cost upper bound: 4K input
  × $3/MTok + 1500 output × $15/MTok ≈ $0.034 ≈ €0.031. Round up to
  3 cents so the scheduler skips it (any non-zero value excludes from
  hourly cadence) and the cost reflects real magnitude. →
  external_cost_cents = 3.

Filter (idempotent + scope-bounded):
- active = true
- test_mode = 'live' (skip 'fixture' = saved data; skip 'canary' =
  existing non-zero values preserved)
- test_type IN ('known_answer', 'edge_case', 'negative', 'known_bad')
- external_cost_cents = 0 (preserve existing manual values)

Excludes by design (zero-cost-by-design test types):
- schema_check (dry-run mode, no API call)
- dependency_health (zero-cost auth-less probe per CLAUDE.md
  Principle A — skipAuth: true on probe means a 401 proves
  connectivity without consuming quota)
- piggyback (not scheduled; populated by customer traffic)

Expected updated rows: 22 (16 Dilisense/eSortcode at 1¢ + 6
risk-narrative-generate at 3¢). Pre-flight verified against prod;
suite IDs in PR description.

NOT in scope (explicitly deferred per user instruction; separate
to-dos):

- Anthropic-Haiku bulk set (~80 caps): per-call cost depends on
  input/output token volume; flat 1-cent gives misleading false
  safety signal. Defer to a separate PR with proper per-call cost
  estimation (read max_tokens per executor + estimate typical
  input).
- Browserless suites (37 caps): Browserless billing is per-minute,
  not per-call; mapping it onto external_cost_cents requires a
  different model.
- Capability-level misclassifications (~8-12 caps tagged
  maintenance_class = 'commercial-stable-api' but actually free):
  Companies House, AviationStack free tier, RVO/Kadaster,
  Arbetsförmedlingen, CommonCrawl. Fix the maintenance_class
  classification on the capability, not the suite cost. Separate
  to-do.

Migration shape: drizzle/0062_paid_vendor_suite_cost.sql + idempotent
runtime block in scripts/apply-migrations.ts. Number gap from 0059 to
0062 leaves 0060 + 0061 free for the open feat/marketplace-eligible-
flag (PR #42) and feat/retire-solutions-and-web3-assurance (PR #45)
branches; the migration-prefix lint catches collisions either way.

Post-condition assertion baked into both the SQL migration (DO $$
RAISE EXCEPTION on remaining_zero > 0) and the apply-migrations.ts
block (logs remaining-at-zero count). If a new paid-vendor suite
landed at cost=0 between audit and apply, the assertion fails with
a clear message rather than letting the suite quietly bleed.

Tests: 3 new in src/jobs/paid-vendor-suite-cost.test.ts. Compiles the
UPDATE SQL via PgDialect.sqlToQuery and asserts:
- Slug whitelist (4 + 1) is exact; bystander slugs are absent
- test_mode = 'live' filter present; fixture/canary excluded
- known_answer/edge_case/negative/known_bad included; schema_check/
  dependency_health/piggyback excluded
- external_cost_cents = 0 idempotency guard present
- scope-sanity test pins the in-scope slug set so a scope-creep
  edit (e.g. adding ~80 Anthropic-Haiku slugs) trips a test
  failure instead of silently shipping

Verification:
- tsc --noEmit clean
- vitest run — 450 pass, 11 skipped, 0 fail (3 new regression tests)
- lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes
  all clean (62 migrations, no prefix collisions)
- 22 rows verified pre-flight against prod (suite IDs listed in PR
  description)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@petterlindstrom79 petterlindstrom79 merged commit f23a130 into main May 4, 2026
1 check passed
@petterlindstrom79 petterlindstrom79 deleted the fix/paid-vendor-suite-cost-classification branch May 4, 2026 13:44
petterlindstrom79 added a commit that referenced this pull request May 4, 2026
…sification dependency (#50)

Defensive note added per the 2026-05-04 paid-vendor audit follow-up.

The eSortcode CoP API does not deduct credits when `test_outcome` is
set on the request — the executor (apps/api/src/capabilities/uk-cop-
check.ts:88-112) passes it through verbatim, and eSortcode returns a
deterministic test response. This makes the test scheduler's runs
against the current fixture cost-free in practice, while a real
production call (no `test_outcome` parameter) bills per call against
the Pay.UK CoP scheme.

PR #49 set test_suites.external_cost_cents = 1 for uk-cop-check live
suites despite the bypass — the rationale (per user instruction) was
that classification should reflect the per-call billing model so a
future fixture change that silently loses the bypass doesn't quietly
start burning credits.

This commit closes the gap by documenting the dependency in the
manifest itself:

- Multi-line comment block above test_fixtures explains the bypass +
  the cost-classification implication.
- Inline comment on each `test_outcome: MATCHED` line points back to
  the explanation, so an editor scanning the input fields sees the
  warning at the line they're about to change.

YAML-only change. No code path. Manifest pipeline reads the file for
content; comments are not roundtripped to DB but are visible to any
engineer reading or editing the file. Verified the YAML still parses
cleanly via js-yaml after the edit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
petterlindstrom79 added a commit that referenced this pull request May 4, 2026
…ply-migrations.ts (#51)

Recovery from the 2026-05-04 PR-#42 deploy outage.

Root cause:
apps/api/scripts/apply-migrations.ts was a dead file. The Dockerfile CMD
ran the API server directly with no pre-start migration hook, AND
apps/api/tsconfig.json's `rootDir: "./src"` excluded the script from
the build entirely. Every block we shipped through that file (PR #29
actual_cost_cents, PR #42 marketplace_eligible, PR #49 paid-vendor
cost UPDATEs) silently never ran in production. PR #42's deploy
outage made this visible: the new code referenced columns that the
migration was supposed to add, but the migration never ran, so every
public-surface request 500'd until the columns were applied manually.

Fix:
- New `src/lib/startup-migrations.ts` exporting `runStartupMigrations()`
  and one function per migration block (0028, 0029, 0060, 0062). Each
  block is independently testable via an injected MigrationExecutor
  stub.
- Wired into `index.ts:69` BEFORE `validateSchema()`, BEFORE the API
  listens, BEFORE any scheduler / job boots. Blocking — if any block
  throws, the process aborts via the existing `main().catch(...)`.
- Per-block structured logging (`startup-migration-block` label with
  block id, outcome, rows_affected, duration_ms). Top-level
  `startup-migrations-complete` summary at the end. Pre-fix, the
  dead script wrote to stdout via console.log; the new code uses
  pino through `lib/log.js` (passes lint:no-new-console).
- Block 0062 (paid-vendor) gains a post-condition `RAISE`-equivalent:
  if any of the 5 paid-vendor capabilities' active live non-probe
  suites still has external_cost_cents = 0 after the UPDATEs run,
  the function throws (would abort boot). Rationale: a new paid
  suite landing at cost=0 between deploys is exactly the silent
  bleed pattern this whole exercise is fixing.
- Idempotency rules per block:
  - 0028 (sqs_daily_snapshot): information_schema check + skip;
    CREATE TABLE / INDEX statements use IF NOT EXISTS as
    defence-in-depth.
  - 0029 (actual_cost_cents): information_schema check + skip
    (column has NOT NULL DEFAULT 0; ADD COLUMN IF NOT EXISTS isn't
    quite enough on its own without re-applying the default to a
    pre-existing column).
  - 0060 (marketplace_eligible): two ADD COLUMN IF NOT EXISTS
    unconditional; Postgres-level no-op on re-run.
  - 0062 (paid-vendor costs): UPDATE WHERE external_cost_cents = 0;
    second run finds zero matching rows.

Deleted apps/api/scripts/apply-migrations.ts. The historical admin
endpoint at routes/internal-tests.ts:/admin/apply-migrations is left
in place — it's a separate manual-recovery path with its own (older,
incomplete) block list. Action 3 audit will surface what to do with
it.

Tests: 9 new in src/lib/startup-migrations.test.ts using a stub
MigrationExecutor (PgDialect.sqlToQuery for SQL shape; canned-result
queue for behaviour). Per DEC-20260504-A audit-followup test
coverage protocol — every block gets:
- First-run: assert the block runs the expected DDL/DML and reports
  the expected outcome.
- Second-run idempotency: assert that on re-run the block either
  skips (information_schema check returns 1) or matches zero rows
  (WHERE filter on the post-state).
- Block 0062 also gets a post-condition-violation test: stub the
  post-check to return `remaining_zero: 1`, assert the function
  throws (the failure-aborts-boot semantics).

The new structure has a single source of truth for migration blocks,
runs synchronously at boot, and is regression-tested. Dead-file
silent-failure mode eliminated.

Verification:
- tsc --noEmit clean
- vitest run — 469 pass, 11 skipped, 0 fail (9 new regression tests)
- lint:no-bare-catch / lint:no-new-console / lint:migration-prefixes
  all clean

Manual recovery completed prior to this commit:
- Migration 0060 marketplace_eligible columns applied to prod at
  16:31 UTC 2026-05-04 (during the active outage; identical SQL to
  the migration file, ADD COLUMN IF NOT EXISTS — non-destructive).
- PR #49's paid-vendor UPDATEs applied to prod at ~22:00 UTC
  2026-05-04 (22 rows: 16 Dilisense+eSortcode at 1¢ + 6 risk-
  narrative-generate at 3¢; idempotent re-run produced 0 changes).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
petterlindstrom79 added a commit that referenced this pull request May 4, 2026
…Anthropic-Haiku bleed (#55)

Block 0063 — sibling of block 0062 (PR #49 paid-vendor classification),
single-capability variant for invoice-extract.

## Why
Prod query 2026-05-04 found all 4 active non-probe live test suites
(known_answer, edge_case, negative, known_bad) for `invoice-extract`
at `external_cost_cents = 0`. Per DEC-20260503-B the scheduler skips
suites with `external_cost_cents > 0` for paid vendors; at 0, they
were being scheduled hourly and paying Anthropic Haiku vision to
"extract invoice fields" from `httpbin.org/image/jpeg` (a JPEG of a
dog). The fixture issue is hygiene-only; this block is the structural
fix that flips the scheduler-skip semantic.

## What
UPDATE test_suites SET external_cost_cents = 1 WHERE
  capability_slug = 'invoice-extract'
  AND active = true AND test_mode = 'live'
  AND test_type IN ('known_answer','edge_case','negative','known_bad')
  AND external_cost_cents = 0

dependency_health and schema_check are explicitly excluded — they use
the auth-less probe pattern (no paid call), legitimately stay at 0.
1¢ floor matches the PR #49 / block 0062 defensible-minimum pattern;
real Haiku-vision-on-small-JPEG cost is below 1¢, but the floor's
operational role is purely the scheduler-skip flip.

## Idempotency
The `external_cost_cents = 0` filter in the WHERE clause makes the
UPDATE a no-op on re-run. Post-condition assertion fails boot if any
new active live non-probe invoice-extract suite shows up at 0.

## Per active protocols
- DEC-20260504-A: 3 regression tests cover (a) first-run UPDATE shape
  including the test_type list and the negative assertion that probes
  are excluded, (b) idempotent re-run zero-rows, (c) post-condition
  violation throws.
- DEC-20260504-C: this is a startup-migrations block; PR #51/#52 wired
  runStartupMigrations() to fire blocking before validateSchema() at
  every API restart. No deploy-mechanism risk.

Notion to-do 35667c87-082c-817f-9905-cfad380b8ae3 (invoice-extract
fixture). Fixture itself left as the dog JPEG intentionally — once the
suite is no longer scheduled, the fixture quality is moot.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
petterlindstrom79 added a commit that referenced this pull request May 11, 2026
…ly testing (#85)

Phase 1 (Contain): startup-migrations block 0064 sets external_cost_cents
to 1¢ on the 73 always-LLM Haiku capabilities whose suites were still at
0 after PR #49 (which explicitly deferred the bulk-set). The test
scheduler dispatch query (apps/api/src/jobs/test-scheduler.ts:262) filters
on external_cost_cents = 0, so until this fix every always-LLM Haiku cap
was being executed hourly with real Anthropic billing.

Phase 2 (Understand): see Notion Journal entry for the named failure
pattern ("Compound-PR cost leak") and audit PR #84 for the full ramp
attribution.

Phase 3 (Harden): new llm-capability-costs.ts holds the canonical map +
CONDITIONAL_LLM_CAPABILITIES exclusion set. New CI test walks
apps/api/src/capabilities/*.ts and fails if any @anthropic-ai/sdk
importer is not registered as always-LLM, conditional-LLM, or
deactivated. Adding a new LLM-using capability without registering its
cost fails CI.

Idempotency: 0064 filters on `external_cost_cents = 0` in the WHERE
clause + a post-condition SELECT that throws on remaining_zero > 0.
Mirrors the PR #49 / PR #55 pattern.

Slug count: 73 always-LLM Haiku caps (audit PR #84 estimated ~50; the
full grep + filter against DEACTIVATED + CONDITIONAL_LLM yields 73,
within the prompt's <30 / >80 halt range).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant