Skip to content

fix: run migrations on tenant schemas at startup and harden worker poller#237

Merged
cdbartholomew merged 1 commit into
mainfrom
fix/tenant-schema-migrations
Jan 29, 2026
Merged

fix: run migrations on tenant schemas at startup and harden worker poller#237
cdbartholomew merged 1 commit into
mainfrom
fix/tenant-schema-migrations

Conversation

@cdbartholomew
Copy link
Copy Markdown
Contributor

Summary

  • Run database migrations on all existing tenant schemas at startup (not just the public schema)
  • Harden the worker poller so one broken tenant schema doesn't crash the entire polling loop

Problem

When new migrations are deployed, only the public schema gets migrated at startup. Existing tenant schemas are left behind on their old migration version. This caused the worker poller to crash silently when it tried to query columns (e.g. worker_id, task_payload) that didn't exist in tenant schemas.

The crash was completely silent because poller.run() is started via asyncio.create_task() and the exception in recover_own_tasks() (called before any try/except block) killed the coroutine with no logging.

Changes

memory_engine.py — After running public schema migrations, iterate all tenants from the tenant extension and run migrations on each schema. Wrapped in try/except per-schema so one failure doesn't block others.

worker/poller.py — Defense-in-depth:

  • recover_own_tasks(): try/except per-schema so a broken schema doesn't prevent the polling loop from starting
  • _claim_batch_for_schema(): try/except wrapper so a broken schema doesn't prevent claiming tasks from other schemas

Test plan

  • Verified fix on hindsight-dev: ran migrations on 15 tenant schemas, restarted pods, worker poller successfully entered polling loop and processed tasks
  • Verified fix on hindsight-prod: ran migrations on 3 tenant schemas preemptively
  • Run unit tests

…ller

Tenant schemas were never migrated when new migrations were deployed.
Only the public schema was migrated at startup, and tenant schemas only
got migrations when first provisioned. This meant existing tenants
missed any new columns (e.g. task_payload, worker_id, claimed_at on
async_operations), causing the worker poller to crash silently.

Changes:
- Run migrations on all existing tenant schemas at startup when a
  tenant_extension is configured. Each schema migration is wrapped in
  try/except so one failure doesn't block others.
- Add try/except in WorkerPoller.recover_own_tasks() so a broken
  schema doesn't prevent the polling loop from starting.
- Add try/except in WorkerPoller._claim_batch_for_schema() so a
  broken schema doesn't prevent claiming tasks from other schemas.
@cdbartholomew cdbartholomew merged commit 657fe02 into main Jan 29, 2026
17 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants