Skip to content

feat(deploy): persist DuckLake catalog on EFS instead of ephemeral EBS#14

Merged
smithclay merged 1 commit into
mainfrom
codex/deploy-efs-catalog
May 26, 2026
Merged

feat(deploy): persist DuckLake catalog on EFS instead of ephemeral EBS#14
smithclay merged 1 commit into
mainfrom
codex/deploy-efs-catalog

Conversation

@smithclay
Copy link
Copy Markdown
Owner

Problem

The catalog DuckDB file (canardstack.ducklake) — the index over the immutable S3 Parquet data — lived on a service-managed EBS volume. Service-managed EBS forces deleteOnTermination=true (confirmed in AWS docs), so the volume is destroyed on every catalog task replacement (deploy, crash, scale), orphaning the S3 data and effectively losing the dataset. Any catalog restart = data loss.

Change

Move the catalog volume to EFS (filesystem + access point + per-AZ mount targets + NFS security group), which persists across task replacement. The container mount point is unchanged (/var/lib/canardstack), so this is infra-only — no binary/image change. Drop the now-unused EbsSizeGiB parameter (the app raw-spool volume stays on managed EBS via AppEbsSizeGiB).

Single-writer safety on a shared filesystem

EBS physically enforced single-writer (single attach); EFS is shared, so single-writer is now enforced by DesiredCount: 1 + MaximumPercent: 100 (stop-old-before-start-new, never two tasks) plus DuckDB's file lock as a backstop. The catalog must never scale past 1. DuckDB advises against read-write DB files on NFS; acceptable here only because serve-catalog is the single writer. Postgres remains the stronger durable-catalog option.

Verified live

Deployed on top of the v0.0.6 image (which carries the catalog S3-creds compaction fix from #13):

  • Catalog rolled EBS→EFS and came up healthy (clean lock handoff).
  • Forced CHECKPOINTran:true, status:ok (the operation that previously 503'd), ducklake_checkpoint_runs_total{status="ok"} 1.
  • Fresh ingest + seals to S3 working.

Service-managed EBS volumes force deleteOnTermination=true, so the catalog's
managed EBS volume (holding canardstack.ducklake, the index over the immutable
S3 Parquet data) is destroyed on every catalog task replacement -- deploy,
crash, scale -- which orphans the S3 data and effectively loses the dataset.

Move the catalog volume to EFS (filesystem + access point + per-AZ mount targets
+ NFS security group), which persists across task replacement. The catalog
container mount point is unchanged (/var/lib/canardstack), so this is infra-only
-- no binary/image change. Drop the now-unused EbsSizeGiB parameter; the app
raw-spool volume stays on managed EBS (AppEbsSizeGiB).

NOTE: DuckDB advises against read-write database files on NFS; acceptable here
only because serve-catalog is the single writer. Postgres remains the stronger
durable-catalog option.
@smithclay smithclay force-pushed the codex/deploy-efs-catalog branch from 1432c79 to c94e763 Compare May 26, 2026 00:40
@smithclay smithclay merged commit 77106ed into main May 26, 2026
5 checks passed
@smithclay smithclay deleted the codex/deploy-efs-catalog branch May 26, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant