v0.99.39
Performance — backup and restore both run −51% end-to-end on the 136 GB / 431M-row bench corpus (backup full 881 s → 435 s, restore 2810 s → 1390 s), shrinking the gap to the pg_dump/pg_restore -j8 specialists from ~3.1–3.2× to 1.83× / 1.51×; the O(N²) manifest-rewrite wall (~78 hours of pure checkpoint I/O at 100k tables) is gone; and PG schema reads no longer die with SQLSTATE 53100 on ≥50k-table catalogs under container-default 64 MB /dev/shm. Drop-in from v0.99.38 — no flag or default changes, and finalized backups keep the exact same on-disk format.
Performance
- Backup/restore per-row JSON codec rewritten as a direct buffer-append fast path (tasks #51/#52). Profiling the 136 GB bench corpus showed the reflection-based
encoding/jsonround trip of the per-row map was 49% ofbackup fullCPU and 69% ofrestoreCPU. The chunk row encode/decode now runs on a specialized codec that emits and parses the SAME wire bytes — byte-identical output for every shape the fast path accepts, no chunk-format change, old and new binaries read each other's chunks — at ~10× per row in both directions (encode 82 → 0 allocs/row, decode 189 → 27). Any value or line outside the canonical shapes falls back to the legacy path, which remains the semantic and error oracle; differential sweeps plus two fuzz targets pin the two paths equivalent on arbitrary input. Together with the O(1) checkpoints below, measured end-to-end on the same corpus: both legs −51%, zero-loss (docs/comparison-backup.mdhas the full methodology). - Backup checkpoints are O(1) per event — the manifest is no longer rewritten per chunk/table (task #54, ADR-0086). Every per-chunk / per-table checkpoint used to re-marshal the ENTIRE manifest (embedded schema included) and re-Put
manifest.json, making the row sweep quadratic in table count — the #38 scale probe measured ~78 hours of pure manifest rewriting at 100k tables. The in-progress manifest is now a base written once plus an append-onlymanifest.progress.jsonlsidecar (one attempt-ID-stamped JSON line per event); the manifest is marshaled exactly twice per run (base + final), and the finalized backup folds everything back into one self-containedmanifest.json— the on-disk layout of a successful backup is unchanged. Pinned by a 10k-table committer benchmark asserting checkpoint bytes are independent of table count. Stores without an append primitive (S3/GCS/Azure blob stores) keep the previous full-rewrite checkpoints and say so loudly on large corpora. - PG→PG raw-copy single stream is ~4.9× faster (task #37, PR #196). The PG server emits one CopyData message per row on
COPY TO STDOUT, and each row paid a synchronous unbuffered-pipe rendezvous plus a ~265-byte socket write to the target — 81.8% of single-stream CPU. A 64 KiB buffer ahead of the pipe coalesces the frames (byte-transparent — the COPY stream has no per-Write framing): 72.6 s → 15.0 s on a 4M-row / 1040 MB single-stream run (14 → ~73 MB/s), checksum-verified zero-loss, pinned by the existing 13-test raw-copy + sync cold-start suites. - MySQL
LOAD DATAbulk writes get the same fix (PR #197). The TSV encoder issued one unbuffered pipe write per row — the identical per-row pipe-rendezvous class as the raw-copy lane — and is now buffered the same way (64 KiB, flushed before close on success; the error path still poisons the read so failures stay loud). Byte-transparent; the existing LOAD DATA zero-loss and warning-probe pins cover it.
Fixed
- PG schema reads no longer die with SQLSTATE 53100 on huge catalogs under small
/dev/shm(task #55, found by the #38 scale probe). At ≥50k-table catalog sizes Postgres plans parallel hash joins for several of sluice's catalog metadata queries; the parallel workers allocate their shared hash tables as dynamic shared memory segments in/dev/shm, which on container-default 64 MB shm exhausts withcould not resize shared memory segment … No space left on device (SQLSTATE 53100). Every PG SchemaReader catalog query now runs in its own read-only transaction withSET LOCAL max_parallel_workers_per_gather = 0— serial plans build hash tables in process-localwork_memand cannot hit the wall, and parallelism buys nothing on catalog reads (validated on a real 50k-table / 150k-index rig: ~15 s either way with headroom, and the fixed binary succeeds with/dev/shm100% full where the previous one failed). This was always a loud failure — exit non-zero, no data ever moved or lost. Affects all prior releases reading very large PG catalogs in shm-constrained containers; no--shm-sizetuning needed anymore.
Compatibility
- No breaking changes. Drop-in from v0.99.38 — no flag, default, or invocation changes. Migrate, sync, and CDC paths are untouched except for the two byte-transparent throughput fixes.
- No chunk-format change. The fast codec emits byte-identical wire bytes; backups written by v0.99.39 are readable by older binaries and vice versa.
- Manifest format v3 applies ONLY to in-progress (sidecar-layout) backups. An OLDER binary asked to RESUME a backup that crashed under v0.99.39 refuses loudly ("upgrade sluice") instead of silently resuming off a base that under-reports progress; v0.99.39 resumes old-format in-progress backups unchanged (pinned both directions). Finalized backups keep the pre-existing format version and are fully readable by older binaries.
- Recorded
schema_hashvalues change for schemas with columns that have no default (task #49).ComputeSchemaHashnow normalizes a nil column default to the explicitDefaultNonethe manifest decode materializes, so a reader-fresh schema and the same schema re-read from a manifest fingerprint identically. Harmless: the stored hash is write-only today (nothing compares against a previously stored value), and the resume drift guard always recomputes both sides.
Who needs this — action required
- Everyone using
backup/restore— both legs are −51% on the bench corpus. Action: just upgrade; same commands, same artifacts. - Anyone backing up many-table schemas (thousands and up) — the per-chunk manifest rewrite was quadratic in table count and is the difference between feasible and ~78 hours of pure checkpoint I/O at 100k tables. Action: upgrade before attempting very-large-table-count backups; note blob stores without append (S3/GCS/Azure) keep the old checkpoint cost and warn loudly.
- Anyone reading ≥50k-table PG schemas from containers with default
/dev/shm— previously died loudly with SQLSTATE 53100. Action: upgrade and drop any--shm-sizeworkarounds. No data was ever affected — the old behavior was a loud exit, never silent loss. - Mixed-version fleets — don't try to resume a v0.99.39 in-progress (crashed) backup with an older binary; it refuses loudly by design. Finalized backups interoperate in both directions. Everything else requires no action.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.39 · Container: ghcr.io/sluicesync/sluice:0.99.39
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md