Skip to content

fix: post-audit P1 hardening (RLS, shutdown, CORS, secrets, cleanup)#10

Merged
yuzushi-dev merged 4 commits intomainfrom
fix/post-audit-p1-fixes
Apr 22, 2026
Merged

fix: post-audit P1 hardening (RLS, shutdown, CORS, secrets, cleanup)#10
yuzushi-dev merged 4 commits intomainfrom
fix/post-audit-p1-fixes

Conversation

@yuzushi-dev
Copy link
Copy Markdown
Owner

Landing the P1 block of the 2026-04-17 production audit plan (docs/plans/2026-04-17-production-audit-fix-plan.md). Four independent fixes grouped by concern; each commit is self-contained and reviewable on its own.

fix(api): always set app.current_tenant GUC (P1-1)

RLS policies on tenant tables read current_setting('app.current_tenant') without the missing_ok flag, so requests reaching get_db_session without a bound tenant blew up with 42704 unrecognized configuration parameter. The loudest symptom was the super-admin cross-tenant metrics endpoint logging Failed to resolve document names on every call. Fix: unconditionally emit set_config, using an empty string when no tenant is bound.

fix(platform): close() instead of aclose() on Neo4j/Milvus shutdown (P1-2)

Neo4jClient and MilvusVectorStore expose close(), not aclose() (only the Redis async client has aclose). The wrong method raised AttributeError on every Platform.shutdown(), leaving SQLAlchemy pool connections un-returned and triggering the downstream non-checked-in connection ... will be terminated errors from the GC.

chore: move src/test_*.py ad-hoc scripts → scripts/debug/ (P1-3)

Four asyncio smoke scripts (test_factory_chain, test_nim_timeout, test_or, test_timeout) were committed under the production Python package. Moved to scripts/debug/ so they no longer ship in the Docker image.

security(api): refuse wildcard CORS in prod + warn on dev SECRET_KEY_OLD (P1-4, P1-5)

  • CORS: a missing CORS_ORIGINS used to silently fall back to "*". With DEBUG=false we now raise RuntimeError at startup, forcing the operator to declare an explicit allow-list. Debug mode keeps the permissive default with a warning log.
  • SECRET_KEY_OLD: emit a warning if the rotation-fallback secret is one of the well-known dev defaults (amber-dev-key-2024, default-insecure-key), which defeat the dual-key keyring and leave legacy hashes forgeable.

Test plan

  • CI green
  • After deploy, verify production logs no longer contain:
    • Failed to resolve document names: ... unrecognized configuration parameter "app.current_tenant"
    • Error closing Neo4j: 'Neo4jClient' object has no attribute 'aclose'
    • Error closing Milvus store: 'MilvusVectorStore' object has no attribute 'aclose'
    • non-checked-in connection ... will be terminated
  • Super-admin metrics endpoint /v1/admin/maintenance/metrics/queries shows document filenames (not doc_ids) on the default tenant at minimum
  • With DEBUG=false and CORS_ORIGINS set, API boots; with CORS_ORIGINS unset, API refuses to start
  • Prod .env: remove SECRET_KEY_OLD once rotation window is confirmed closed (or replace with the actual retiring key)

These were never pytest test modules - they are smoke scripts with
asyncio entrypoints used to probe provider factory, NIM timeout, and
OpenRouter behaviour by hand. Having them under src/ meant they got
included in the Docker image and cluttered the production package.

Refs: docs/plans/2026-04-17-production-audit-fix-plan.md (P1-3)
…utdown

Neo4jClient and MilvusVectorStore both expose close(), not aclose()
(only the Redis async client has aclose). The wrong method name
raised AttributeError on every Platform.shutdown(), leaving SQLAlchemy
pool connections un-returned and triggering the follow-up
"non-checked-in connection ... will be terminated" errors from the GC.

Refs: docs/plans/2026-04-17-production-audit-fix-plan.md (P1-2)
RLS policies on tenant tables read current_setting(\"app.current_tenant\")
without the missing_ok flag, so any request that reached get_db_session
without a bound tenant raised "unrecognized configuration parameter"
42704 on the first query. The super-admin cross-tenant metrics path in
admin/maintenance was the loudest symptom, logging "Failed to resolve
document names" for every call.

Fix: always emit set_config(), using an empty string when no tenant is
bound. Behaviour for normal tenant-scoped requests is unchanged.

Refs: docs/plans/2026-04-17-production-audit-fix-plan.md (P1-1)
Two startup guardrails:

- CORS: a missing CORS_ORIGINS config used to silently fall back to
  "*". With DEBUG=false we now refuse to boot and raise RuntimeError,
  forcing the operator to set an explicit allow-list. DEBUG=true keeps
  the permissive default with a warning log.
- SECRET_KEY_OLD: log a warning when the rotation-fallback secret is
  one of the well-known non-entropic dev defaults (amber-dev-key-2024,
  default-insecure-key). Those values defeat the dual-key keyring and
  leave legacy hashes forgeable.

Refs: docs/plans/2026-04-17-production-audit-fix-plan.md (P1-4, P1-5)
Comment thread src/api/main.py
"SECRET_KEY_OLD is set to a known dev default (%s). "
"Complete the rotation and unset SECRET_KEY_OLD, or replace "
"it with the actual previous secret being retired.",
settings.secret_key_old[:12] + "…",
@yuzushi-dev yuzushi-dev merged commit 31a499c into main Apr 22, 2026
3 of 5 checks passed
@yuzushi-dev yuzushi-dev deleted the fix/post-audit-p1-fixes branch April 22, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants