Releases: sluicesync/sluice
v0.99.47
sluice v0.99.47
Bug-fix patch: the target connection-budget preflight no longer false-refuses cold start on a tight or modern (PG 18) managed Postgres. If a sluice migrate/sync ever refused with target connection budget exhausted on a small managed Postgres that clearly had connections to spare, this is the fix.
Fixed
- Connection-budget preflight counts only client backends. The preflight
computed in-use connections with an unfiltered
SELECT count(*) FROM pg_stat_activity, which also counts PostgreSQL's
background processes — checkpointer, background/wal writer, autovacuum
launcher, archiver, logical-replication launcher, and PG 18+ async I/O
workers. None of those consume amax_connectionsslot, so in-use was
over-reported by the background-process count (≈9 on a PG-18 managed
instance). On a tight target — a managed Postgres withmax_connections=25
and ~9 background processes — this produced a false
target connection budget exhaustedthat blocked cold start entirely,
even though only ~4 real client backends were in use. The probe now counts
WHERE backend_type = 'client backend'(PG 10+; sluice's pgoutput CDC
already requires PG 10+). The role/database sub-probes already filtered
correctly — only the global count was affected.
Who needs this
- Anyone running sluice against a small managed Postgres (tight
max_connections) — especially PlanetScale Postgres and other PG 18
instances with async I/O workers — who hit a spurious
target connection budget exhaustedon cold start. No flag or config
change needed; upgrade and re-run.
Install
# Linux/macOS/Windows binaries + checksums on the release page.
docker pull ghcr.io/sluicesync/sluice:v0.99.47
v0.99.46
sluice v0.99.46
Completes the schema-change-forwarding story: a PostgreSQL-source RENAME COLUMN now forwards through the live CDC stream instead of halting it — when sluice can prove it's a rename (not a drop+add) via the column's stable catalog identity. This is the final piece of the default-on schema-change forwarding that landed in v0.99.45 (ADR-0091); see Compatibility for the safety reasoning.
Features
- PostgreSQL-source
RENAME COLUMNforwarding (ADR-0091, F7b). Under the
default--schema-changes=forward, a column rename on a PostgreSQL source now
forwards to the target asALTER TABLE … RENAME COLUMN(data preserved on the
target) — but only when it is provable. A rename and aDROP x + ADD yof
the same type are indistinguishable from the replication stream's row shape
alone, so sluice proves the distinction withpg_attribute.attnum, the
column's catalog identity that is stable across a rename: same attnum +
different name = a real rename (forward, preserve data); a different attnum = a
genuine drop+add (refuse loudly). The proof is definitive, so this can only
ever forward a true rename or refuse — it can never mis-forward a drop+add and
lose data. Works same-engine (PG → PG) and cross-engine (PG → MySQL, where the
rename is applied on the MySQL target). A MySQL source has no equivalent
stable column id, so a MySQL-source rename continues to refuse loudly (drain +
rename manually) — unchanged.
Compatibility
- Extends the v0.99.45 behavior change to PostgreSQL renames. v0.99.45 made
sluice syncforward source DDL by default; renames were the one shape it
still refused. On a PostgreSQL source, aRENAME COLUMNnow forwards
automatically. To keep the conservative halt-on-DDL behavior for all shapes,
set--schema-changes=refuse(unchanged from v0.99.45). No on-disk/format
change;migrateand the cold-start copy path are untouched. - The rename proof relies on a column-identity catalog read on the PostgreSQL
source at schema-change boundaries (rare events) — no steady-state cost.
Who needs this
- Operators running a live PostgreSQL-source sync who rename columns as part
of routine schema evolution — the stream now stays online through the rename
instead of refusing and forcing a drain + manual DDL. - Everyone else is unaffected: MySQL-source renames still refuse (safely), and
no other shape's behavior changes from v0.99.45.
Install
# Linux/macOS/Windows binaries + checksums on the release page.
# Container image:
docker pull ghcr.io/sluicesync/sluice:v0.99.46
v0.99.45
sluice v0.99.45
New: sluice sync now forwards source schema changes by default — your CDC stream stays online through routine column adds/drops/type changes instead of refusing or crash-looping. Three changes from the first real PlanetScale soak: (1) default-on single-stream schema-change forwarding via a new tristate --schema-changes=forward|refuse (default forward); (2) source-side Vitess schema-resolution gaps (DDL cutover / historian off) now ride out in-process instead of killing the stream; (3) target schema-drift apply errors now self-heal instead of a tight restart crash-loop. This is a behavior change on upgrade — see Compatibility. No on-disk/format changes; migrate and the cold-start copy path are untouched.
Features
- Default-on schema-change forwarding for single-stream CDC (ADR-0091, F7a). A new tristate flag
--schema-changes=forward|refuse(defaultforward) replaces the old opt-in ADD-COLUMN-only path. When the source applies an unambiguous DDL change, sluice now retargets it to the target dialect and applies it in-line on the CDC apply boundary, so the sync stays online instead of refusing and forcing a manual drain-and-DDL. The per-shape forwarding matrix (ADR-0091 §1d is the source of truth) is:- ADD COLUMN, DROP COLUMN, ALTER COLUMN TYPE — forward on both source engines, same-engine and cross-engine (MySQL↔PG). DROP COLUMN auto-applies on the target.
- ALTER NULLABILITY — forwards on a MySQL source only (the reader's emission gate is widened to a separate nullability-delta signature; the value-fidelity decode signature is left untouched). PG-source nullability is not forwarded — pgoutput's wire carries no nullability flag.
- REORDER — a no-op (sluice decodes by column name, not ordinal).
- RENAME COLUMN — refuses loudly on both engines. From the stream alone a rename is indistinguishable from a same-type drop+add, and guessing wrong risks silent data loss. (PG
attnum-proven rename forwarding is a planned follow-up, F7b.) - Multi-shape combos and ADD COLUMN with a volatile/computed DEFAULT (
NOW()/nextval/random) — refuse loudly, preserved verbatim from ADR-0058. - Documented limitations — not forwarded (the wire doesn't carry the metadata): PG-source nullability / index / check; MySQL-source index / check. These produce no boundary and so are invisible to forwarding. This is not silent corruption: any resulting incompatibility (e.g. a source
DROP NOT NULLthe target still enforces) surfaces as a loud apply error on the next affected row; a benign one (a missing secondary index) simply leaves the target without that object. - Safety against phantom-destructive forwards: a seed-guard never forwards a destructive shape classified against the cold-start baseline — only on a genuine CDC→CDC boundary — and the PG normalizer strips the generated columns + secondary indexes that pgoutput omits from the cold-start seed, so a phantom DROP can't be synthesized and applied (this was a CRITICAL regression caught by CI on the flip and fixed before ship).
--forward-schema-add-columnis deprecated: it still forwards (now subsumed by--schema-changes=forward) and emits a one-time WARN. Pin the old conservative behavior with--schema-changes=refuse.
Fixed
-
Source-side Vitess schema-resolution errors are now retriable, not terminal (F9). The source vstreamer resolves each row event against the table schema for the replay position; right after a DDL cutover — or when the Vitess schema historian is off (
track_schema_versionsis disabled by default on PlanetScale) — that lookup transiently misses withunknown table <t> in schema/no schema found for table <t>. These arrive as free-text VStream errors with no gRPC status or MySQL-error wrapper, so they fell through to terminal and killed the stream on a window that clears itself once the historian catches up. They are now classifiedir.RetriableError, so the ADR-0038 backoff rides out the cutover window in-process. Substring-matched and pinned, with a near-miss guard so a bare "unknown table" (a real DROP/typo) stays terminal. Affected releases: the reader-error classification shipped in v0.46.0, so every CDC release v0.46.0 through v0.99.44 carried the terminal misclassification (introduced bye320a49,internal/engines/mysql/reader_errors.go); it became materially more likely to bite on PlanetScale (historian off) and on self-hosted Vitess once warm-resume landed in v0.99.44. -
Target schema-drift apply errors no longer tight-restart crash-loop; the sync self-heals (F8). A source
ADD COLUMN(or new table) that the operator hasn't yet created on the target made the apply fail terminal — PG42703undefined_column /42P01undefined_table, MySQL1054ER_BAD_FIELD_ERROR /1146ER_NO_SUCH_TABLE — exiting the process; under a supervisor that became a ~6s tight-restart loop (the soak observedNRestarts=1821). These codes are now classifiedir.RetriableErrorwith a remedy-named message, so the ADR-0038 exponential backoff rides them out in-process and the sync self-heals the moment the operator adds the missing column/table on the target (verified live on the soak). The wrap keeps the underlying*PgError/*MySQLErrorreachable viaerrors.As, so the offending column stays named on every (loud) retry; a genuine sluice bug producing these still fails loud after the retry budget — just not in a tight loop. Covers MySQL→MySQL (incl. PlanetScale→PlanetScale) and PG targets symmetrically. Scope:ADD COLUMN/ missing-table only;DROP COLUMN, rename, and reorder drift are tracked separately and intentionally out of scope. Affected releases: the default-deny (terminal) classification of these codes has existed since the bounded-retry applier framework shipped in v0.42.0 (introduced by008f2f2,internal/engines/postgres/applier_errors.go), so every CDC release v0.42.0 through v0.99.44 carried the crash-loop behavior.
Compatibility
- Drop-in on disk, but a deliberate runtime behavior change on upgrade — read this. With the new default
--schema-changes=forward, a continuoussyncthat previously refused loudly on a source DDL change now forwards it by default — includingDROP COLUMN, which auto-applies (drops the column) on the target. If your operational model depended on sluice halting on source DDL so you could coordinate the change by hand, set--schema-changes=refuseto restore the exact pre-v0.99.45 conservative behavior. The behavior change applies only to the shapes that actually forward (ADR-0091 §1d's ✅ rows); refused shapes (RENAME, multi-shape, volatile DEFAULT) still refuse loudly as before. --forward-schema-add-columnis deprecated (still works, warns). No action required unless you want to silence the warning; replace it with the default--schema-changes=forward.- No new required flags; no on-disk/format changes. Existing backups restore unchanged.
migrate, the cold-start bulk-copy path, and cross-engine value translation are untouched. The F8/F9 retriability fixes need no configuration.
Who needs this — action required
- Anyone running live
sluice sync(CDC) who does routine schema evolution on the source — review the new default before upgrading. After upgrade, source column add/drop/type changes (and MySQL nullability changes) forward to the target automatically. Action: decide whether you want that (the new defaultforward, recommended for uptime) or the old halt-on-DDL behavior (--schema-changes=refuse). If you relied on the stream stopping so you could apply DDL manually, you must set--schema-changes=refuseor your sync will now forward — including target column drops. No data is silently lost either way (refused/unforwardable shapes fail loud), but the forwarding is real and applies on upgrade. - PlanetScale / Vitess
syncusers — upgrade recommended (F9). A DDL cutover or the (default-off on PlanetScale) schema historian no longer kills the stream on the transientunknown table … in schemawindow; it rides out in-process. No action beyond upgrading. - Anyone who hit the schema-drift restart crash-loop (F8) — upgrade fixes it; no re-verification needed. A missing target column/table no longer tight-restart-loops the process; the sync self-heals once you add the column/table on the target. This fix changes failure handling only — it does not affect already-applied data, so no count re-verification is required.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.45 · Container: ghcr.io/sluicesync/sluice:0.99.45
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.44
sluice v0.99.44
Reliable-by-default CDC throughput + VStream throttle resilience. Three changes from the first real PlanetScale long-haul soak: (1) sluice sync now adapts its apply batch size by default — >10× out-of-box CDC throughput — with a guard that keeps keyless tables safe; (2) a transient source-side Vitess throttle (or failover) is now ridden out in-process instead of crash-looping; (3) self-hosted --source-driver=vitess can finally warm-resume. Drop-in from v0.99.43 — no format or breaking changes; migrate, Postgres paths, and cross-engine behavior are untouched.
Changed
sluice sync --apply-batch-sizenow defaults toauto(ADR-0089) — >10× CDC apply throughput out of the box. The ADR-0052 AIMD batch-size controller (adapts to a p95-latency target, floor 1) has shipped since v0.72.0, but the conservative default--apply-batch-size=1made its cap equal its floor, leaving it dormant for every default user — sluice shipped an adaptive throughput controller that did nothing unless you knew to pass a flag. The first real PlanetScale soak measured the cost: single-row apply drained a backlog at ~240 rows/s vs ~6,500 atauto(>10×), and the slow drain badly compounded throttle recovery. The default is nowauto(engine ceiling 1000 mysql/postgres, 100 planetscale; the controller adapts within[1, ceiling]and backs off under pressure). Safety guard: a table with no PRIMARY KEY and no usable unique index (a non-idempotent plain INSERT on replay — Bug 125 class 3) is never batched — each such change commits alone, so a crash-replay's duplicate blast radius stays at exactly 1 (identical to--apply-batch-size=1), while PK/unique tables batch and adapt; a one-time WARN names any such table. Restore the old behavior with--apply-batch-size=1(conservative single-row) or--no-auto-tune(static cap).
Fixed
- VStream throttle/large-transaction stalls no longer crash-loop a continuous sync (Bug 141 / ADR-0090). When a Vitess source's tablet throttler engages (e.g. a co-tenant migration, or a heavy write burst lagging the replica), vtgate withholds change events and — near its own 10-minute tolerance — even heartbeats, so sluice's liveness/progress watchdog fired and misdiagnosed the transient throttle as a failover hang. The error was terminal, so the
syncprocess exited; under a supervisor (systemd, k8s) it restarted, warm-resumed to the same throttled position, re-stalled, and exited again — a tight, non-converging crash-loop. The watchdog timeouts are nowir.RetriableError, so the existing ADR-0038 exponential-backoff retry reconnects from the last position in-process and rides out the throttle until it clears (the correct recovery for a real failover too). A genuinely non-healing wedge (primary-only cluster with no serving replica) still fails loud after the bounded retry budget — just not in a tight loop. The headline finding of the soak, reproduced and root-caused on a self-hosted Vitess-24 cluster. - Self-hosted
--source-driver=vitesscan now warm-resume (Bug 142). Thevitessflavor's engine name ("vitess") wasn't in the position-decode accept set, so a resumed CDC position stampedEngine="vitess"was rejected — every restart of a self-hosted Vitess continuous sync crash-looped withwrong engine "vitess"and could never warm-resume. Unconditional, not throttle-gated. PlanetScale was unaffected (flavor name"planetscale"). The decoder now accepts"vitess".
Compatibility
- Drop-in from v0.99.43 — no on-disk/format or breaking changes. Existing backups restore unchanged;
migrate, Postgres sources/targets, and cross-engine translation are untouched. - Behavior change (intended):
sluice syncnow batches CDC applies by default (--apply-batch-size=auto). For nearly all OLTP workloads this is a large throughput win at no correctness cost (ADR-0010 idempotency; keyless tables auto-clamp to single-row). Pin the old behavior with--apply-batch-size=1or--no-auto-tune. - No new required flags. The watchdog and warm-resume fixes need no configuration.
Who needs this
- Anyone running
sluice sync(CDC) — especially against Vitess/PlanetScale. You get an order-of-magnitude faster steady-state and backlog catch-up out of the box, and a transient source throttle now self-recovers instead of wedging the sync in a restart loop. - Self-hosted Vitess (
--source-driver=vitess) users — upgrade required for continuous sync: before this release a vitess-flavor sync could not survive a restart at all.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.44 · Container: ghcr.io/sluicesync/sluice:0.99.44
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.43
sluice v0.99.43
Two MySQL/Vitess improvements: (1) backup full now reads tables in parallel on MySQL — ~2.6× faster dump, ~2.3× faster restore (ADR-0088); (2) a throttled or idle Vitess VStream is now surfaced loudly instead of stalling silently (observability). Drop-in from v0.99.42 — no format, flag, or breaking changes; migrate, the Postgres paths, and cross-engine behavior are untouched.
Added
-
MySQL
backup full: coordinated parallel backup snapshot (ADR-0088) — dump 2.63× / restore 2.26× faster. sluice's MySQL backup table sweep was serial (a singleSTART TRANSACTION WITH CONSISTENT SNAPSHOTconnection) because MySQL — unlike Postgres — has no shareable exported snapshot to lazily import onto N parallel readers. It now opens N reader transactions whose consistent snapshots coincide under a briefFLUSH TABLES WITH READ LOCKwindow (the same mechanism mydumper uses), so--table-parallelism > 1(default auto = 4) overlaps cross-table reads on a vanilla MySQL source. Measured on a 16.25 GB / 33 M-row corpus: dump 184 s → 70 s (2.63×), restore 404 s → 179 s (2.26×), artifact byte-unchanged. Cross-table consistency and the anchoredEndPositionare preserved — the N readers' snapshots are identical by construction, so a backup taken under concurrent writes is point-in-time consistent across every table. Falls back — loudly — to the serial single-reader path when the source role lacks theRELOADprivilege (most managed tiers), so it never silently produces an inconsistent parallel read. PlanetScale/Vitess sources are unaffected (they keep the VStream-COPY backup path). Seedocs/comparison-backup.mdfor the full fair-fight vsmysqldump/mydumper. -
Vitess/MySQL CDC: a throttled-or-idle VStream is surfaced, not silently stalled (observability; roadmap item 19(a)). When a Vitess source's tablet throttler engages mid-stream, vtgate withholds row/change events but keeps sending ~5 s heartbeats and strips the tablet's in-band
VEvent.throttledflag — so the stream correctly stays alive (the progress watchdog re-arms on heartbeats) but the stall was silent: unbounded replication lag with zero diagnostic. Three observability-only changes (the resilient streaming behavior is unchanged — sluice still waits and catches up when the throttle clears):- the at-stream-open liveness error now names the source throttler as a candidate cause alongside the primary-only topology wedge (so operators aren't sent down the wrong path);
- a new soft idle-WARN fires once per quiet spell — "alive (heartbeats flowing) but no change events for N s — the source may be throttled or genuinely idle; check
SHOW VITESS_THROTTLED_APPSon the primary" — cleared by the next real change event, default 30 s, tunable per-DSN viavstream_idle_warn_timeout(0disables the WARN only; the hard liveness guards are unaffected); docs/vitess-vstream-troubleshooting.mddocuments the mechanism, detection, and the real-world triggers (the prime one being a co-tenant Vitess migration —OnlineDDL/MoveTables/Reshard— on the same keyspace, which moves the shared lag metric and throttles unrelated streams; not your own write rate).
Compatibility
- Drop-in from v0.99.42 — no format or breaking changes. On a vanilla MySQL
backup full,--table-parallelism > 1(default auto = 4) now engages the ADR-0088 coordinated FTWRL path instead of sweeping serially; the artifact is byte-equivalent and the recorded position is unchanged. A source role withoutRELOADtransparently keeps the prior serial behavior (loud INFO). - New opt-in DSN param:
vstream_idle_warn_timeout(default30s,0disables the idle WARN only). No existing DSN needs changing. - No on-disk/chunk-format change; existing backups restore unchanged.
migrate, Postgres sources/targets, and cross-engine translation are untouched.
Who needs this
- Anyone backing up MySQL with sluice — the parallel sweep is automatic at the default
--table-parallelismon a vanilla MySQL source withRELOAD; just upgrade. (NoRELOAD? You keep today's serial behavior, loudly logged.) - Anyone running
sluice syncagainst a Vitess/PlanetScale source — you now get a loud, actionable WARN when the source throttler pauses your stream, instead of a silent lag climb. If you watch for it, thedocs/vitess-vstream-troubleshooting.mdguide tells you the likely cause (usually a co-tenant migration) and that no data is lost — the stream converges once the throttle clears.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.43 · Container: ghcr.io/sluicesync/sluice:0.99.43
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.42
sluice v0.99.42
One MySQL-CDC correctness fix — a source-side TRUNCATE TABLE carrying a leading SQL comment was silently not propagated to the target (Bug 140). MySQL keeps a statement's leading comment in the binlog, the reader's truncate detection required the statement to start with TRUNCATE, so a commented truncate (a hand-written migration note, an APM/ORM query tag) was dropped — and on a live MySQL → {MySQL, Postgres} sync the target silently kept the rows the source truncated, with no error and no lag signal. Postgres sources were never affected. Drop-in from v0.99.41 — no flag, default, format, or invocation changes; no re-verification of any prior migration or backup needed.
Fixed
- MySQL CDC now applies a
TRUNCATE TABLEeven when the statement carries a leading SQL comment (Bug 140). MySQL preserves a statement's leading comment verbatim in the binlogQUERY_EVENT(only the trailing delimiter is stripped). The reader'sparseTruncateTablerequired the body to start withTRUNCATE, so a commented truncate —-- clear staging\nTRUNCATE TABLE t, or a query tag like/* trace=abc */ TRUNCATE …(sqlcommenter / ORM query-log tags / hand-written migrations) — returned "not a truncate", fell through to generic DDL handling (schema-cache invalidation only), and never emitted a typed truncate event. On a live MySQL-source continuous sync the target therefore silently retained every row the source truncated, the stream never converged, and nothing surfaced an error or lag — a HIGH silent-divergence class on a routine operation. The reader now strips leading SQL comments (--,#,/* */) and whitespace before recognising the command; executable comments (/*! version-gated */,/*+ optimizer hint */) are deliberately left in place — stripping them could discard conditionally-executed SQL, and a statement led by one simply falls through to generic DDL handling exactly as before (no typed event, but no incorrectness). Trailing comments remain out of scope (they fail the table-name parse into a loud apply-side error, not silent loss). Found by the deep sync-convergence property hunt (its harness renders every transaction with a-- tx Ncomment, which is exactly the real-world trigger); confirmed via instrumented binlog-event replay (the reader logged the truncateQUERY_EVENTwith the comment attached and "parse = false"), and a minimal no-comment repro propagated cleanly, isolating the leading comment as the cause. Affected releases: every release with MySQL trigger-less binlog CDC — but only for truncates whose statement text carries a leading comment; a bareTRUNCATE TABLE twas always propagated correctly. Pinned by unit comment-variant cases (line / hash / block / stacked / CRLF /---without-space non-comment / executable-comment pass-through) and a new MySQL-source integration test (TestBug140_MySQLToMySQL_CommentedTruncatePropagates— the only prior truncate integration test was Postgres-only).
Compatibility
- No breaking changes. Drop-in from v0.99.41 — no flag, default, on-disk format, or invocation changes.
migrate,sync,backup, and the Postgres CDC path are entirely unchanged. The only behavioral change is that a MySQL-sourceTRUNCATEwith a leading comment now reaches the target (as a bareTRUNCATEalways did). - Postgres sources unaffected. pgoutput emits a typed truncate message; sluice never parsed a query string there, so PG → {PG, MySQL} truncate propagation was always correct.
- No re-verification required. This fix changes what a future commented truncate does; it does not alter any data already migrated. A target that diverged under the bug is corrected the moment the truncate is re-applied (or by re-running the sync) — but nothing about prior, non-truncate data was wrong.
Who needs this — action required
- Anyone running a live MySQL-source continuous sync (
sluice sync, MySQL → MySQL or MySQL → Postgres) whose source issuesTRUNCATE TABLEstatements that may carry a leading comment — i.e. truncates run through tools that prepend query tags (APM/sqlcommenter, ORM query-log tags) or hand-written migration scripts with comments. Before this release such a truncate left phantom rows on the target. Upgrade; the next commented truncate propagates. If you suspect a past commented truncate diverged a target, re-run the affected table's sync (or re-issue the truncate) to reconcile. - Everyone else: a routine upgrade. A bare
TRUNCATE TABLE t(no comment) was always handled correctly, and Postgres sources were never affected.
Also in this release (internal / test-only)
- The random-op sync-convergence property now also covers the cross-engine directions (PG↔MySQL) with a value-semantic canonical compare, hardening the cross-engine CDC apply path against silent divergence. Building that out fixed two harness-side cross-engine canonicalisation edges. Test-only — no runtime change.
- A flaky AIMD-controller integration test was stabilised (dynamic metrics port). Test-only.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.42 · Container: ghcr.io/sluicesync/sluice:0.99.42
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.41
One backup-DR fix — backup compact no longer refuses an ordinary rotated backup stream chain when a rotation-born segment never received a rollover commit in its creating session (the "rotate on a timer, stop while idle" workflow, or a crash/end at a rotation boundary). The bug always failed LOUDLY with zero data loss — it just made the compact DR-maintenance feature unusable across that one boundary, blaming "a pre-ADR-0067, imported, or corrupted lineage" for a chain its own rotation produced. Drop-in from v0.99.40 — no flag, default, or invocation changes; no re-verification of any prior backup needed.
Fixed
backup compact(naive AND--smart-compaction) now splits a merge group at a rotation-boundary coverage gap instead of refusing the whole run (Bug 139, ADR-0087). A rotation-born segment whose creating session never committed an incremental carries noincremental_coverage_startstamp, so it resolves to its full's snapshot anchorS— which lands a few WAL bytes past the prior segment'send_position(P_N). That delta IS the gap. Pre-fix, compact read it as a position gap and refused the ENTIRE run with a message blaming "a pre-ADR-0067, imported, or corrupted lineage" — and a later resume never healed it, so the boundary stayed permanently un-compactable. Compact now subdivides the merge group at every coverage-gap boundary (subdivideAtCoverageGaps): the stamp-less segment stays its own group (one operator-accurate WARN naming the boundary; no data lost, chain fully restorable) while every contiguous run around it still merges — naive,--smart-compaction, and--dry-runall produce the same subdivided plan. Separately, the nextbackup stream/backup incrementalresume that lands on a rotation-born zero-incremental open segment now replays from the prior segment'send_position(P_N), so its first incremental honestly stampsincremental_coverage_start = P_Nand the boundary heals — the chain becomes fully compactable (N→1). Neither half ever stamps coverage that no committed incremental proves (creation-time stamping and resume-backfill were deliberately REJECTED — a crash or walsender lag can leave real events in(P_N, S]that live only in the new segment's full, which compact drops; a fabricated stamp would convert today's loud refusal into silent DR loss). Affected releases: v0.88.0 through v0.99.40 — the strict contiguous-rotation handoff that produces theS > P_Nboundary shipped with ADR-0067 (Bug 95) at v0.88.0; before that, rotated chains were refused by design, not by this false-positive. The byte-identical refusal was reproducible on v0.99.39/v0.99.40 and with--smart-compactionoff. Pinned by the compact-split unit matrix (trailing OPEN/CAPPED, mid-chain stamp-less, multi-gap, contiguous-still-merges-4-to-1, dry-run), the resume-rule unit test (resumes atP_N; first incremental stampsP_N; negative cases keep prior behavior), and a PG idle-stop integration repro (naive + smart split + restore == oracle; resume-heal → whole-chain N→1 + restore == oracle).
Compatibility
- No breaking changes. Drop-in from v0.99.40 — no flag, default, or invocation changes.
migrate,sync,backup full, and CDC behavior are entirely unchanged. The only behavioral change is thatbackup compactsucceeds (with a split + WARN) on a chain it previously refused, and that a rotation-boundary resume now replays fromP_Nso the first committed incremental stamps the coverage honestly and the boundary becomes compactable. - No chunk-format or on-disk change. Existing chains are not rewritten. A chain that today straddles the rotation boundary will compact with a split (the stamp-less segment stays its own group); it heals to fully compactable on its next resumed write, with no manual intervention.
- Both engines, both compaction strategies. The fix covers naive compaction and
--smart-compaction, on PG and MySQL chains; the resume rule was verified strictly-after on both engines (a freshStreamChanges(P_N)with no skip-through, sound on WAL retention + ADR-0010 idempotent overlap replay).
Who needs this — action required
- Nobody needs to re-verify or re-run any prior backup. This was a LOUD refusal with zero data loss — no backup ever completed with wrong or missing data because of it. Every artifact compact previously refused to merge is intact and independently restorable.
- Anyone whose
backup compactfailed across a rotation boundary with "a pre-ADR-0067, imported, or corrupted lineage" on a chain their ownbackup streamproduced (rotate-on-timer + idle stop, or a crash/end at a rotation boundary) — upgrade and re-run compact. It now succeeds: the boundary segment is split into its own group (one WARN) and everything around it merges. No chain rewrite is required. - Operators running rotated
backup streamchains long-term — once on v0.99.41, existing straddling chains compact with a split immediately, and heal to fully compactable (no split, no WARN) on their next resumed write. No flag to set, no migration to run; just upgrade.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.41 · Container: ghcr.io/sluicesync/sluice:0.99.41
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.40
Two backup-correctness fixes — a single float NaN/±Infinity row no longer makes a database un-backupable (backup full refused the whole table loudly with json: unsupported value: NaN), and backup compact --strategy=smart no longer leaks one open store handle per compacted chunk (silent FD lingering on Linux, fatal "Access is denied" on Windows). Both bugs are pre-existing and both always failed LOUDLY — there is no silent loss anywhere in this range. Drop-in from v0.99.39 — no flag, default, or invocation changes.
Fixed
backup fullno longer refuses tables containing floatNaN/±Infinity(Bug 138). PGfloat4/float8columns legally hold IEEE specials andmigratecarries them exactly — but the chunk codec rendered floats as JSON numbers, which cannot represent them, so one NaN row made the whole database un-backupable (loud refusal:json: unsupported value: NaN;numeric-typed NaN always backed up fine — only the float path refused). Non-finite floats now ride a new additive tagged envelope ({"_t":"f64s","v":"NaN"|"+Inf"|"-Inf"}) on BOTH codec paths — the fast path emits it byte-identically to the legacy marshal, and the fast decoder accepts exactly the three canonical sentinels, bailing to the legacy path (the loud error oracle) on anything else. The same envelope covers the CDC change-chunk path (live duringsyncstreaming andbackup incremental). Restores arefloat8send-BIT-IDENTICAL to apg_dumpround trip: ±Inf sign-exact, every NaN canonicalized to the IEEE quiet NaN (0x7ff8…0) exactly as PG's own text format does (NaN payload bits are not representable in either format). Affected releases: every release withbackup— v0.15.0 through v0.99.39 (the float-as-JSON-number passthrough dates to the original Phase-1 chunk codec). Pinned by a full family × shape matrix ({NaN, +Inf, −Inf, −0, finite} × {f64, f32} × {scalar, list, map} with bit-level assertions through the real writer/reader), a strict-decode ladder (alien payloads fail loudly), the same matrix on the change-chunk path, a real-PG integration round trip assertingfloat8sendbits at the target, and the differential sweep/fuzzers now comparing NaN bit patterns.backup compact --strategy=smartleaked one open store handle per compacted change chunk (task #9). The decode pass wrapped its byte-counting reader inio.NopCloser, so the handle opened bystore.Getwas released only on the constructor-error path. On Linux the leaked descriptor lingered silently until process exit (which is why CI never saw it); on Windows it was fatal — the rewrite step renames over the very path the leaked handle still holds open, failing loudly with "Access is denied". The counting reader now owns the store handle (itsClosecloses through), so the chunk reader releases it on every path. Affected releases: v0.85.0 (where smart compaction shipped) through v0.99.39. Pinned by a handle-tracking store wrapper (revert-verified — the old code leaks exactly one handle per chunk) and ground-truthed on the real Windows repro: the previously-deterministic smart-compaction integration failure now passes repeatedly.
Compatibility
- No breaking changes. Drop-in from v0.99.39 — no flag, default, or invocation changes.
migrate,sync, and CDC behavior are unchanged apart from change chunks now being able to carry non-finite floats. - Chunk format: additive tag only. Chunks WITHOUT non-finite floats are byte-unchanged — fully interoperable with older binaries in both directions. A chunk that DOES contain a non-finite float is refused LOUDLY ("unknown value tag") by v0.99.39-and-older binaries, never read silently wrong — this is the format's designed additive-tag forward-compatibility mechanism.
Who needs this — action required
- Anyone whose
backup fullfailed withjson: unsupported value: NaN— upgrade and re-run the backup. The old behavior was a loud refusal before any artifact was finalized; no prior backup silently dropped or mangled the values. - Anyone running
backup compact --strategy=smarton Windows — previously failed deterministically with "Access is denied"; upgrade and re-run. Linux users get the FD hygiene fix for free (the leak never corrupted output there — handles just lingered until exit). - Mixed-version fleets — backups taken by v0.99.40 from data containing float NaN/±Inf cannot be restored by older binaries (loud "unknown value tag" refusal by design). Restore such backups with v0.99.40+.
- Nobody needs to re-verify prior migrations or backups. Both bugs were loud failures — neither could complete with wrong or missing data.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.40 · Container: ghcr.io/sluicesync/sluice:0.99.40
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.39
Performance — backup and restore both run −51% end-to-end on the 136 GB / 431M-row bench corpus (backup full 881 s → 435 s, restore 2810 s → 1390 s), shrinking the gap to the pg_dump/pg_restore -j8 specialists from ~3.1–3.2× to 1.83× / 1.51×; the O(N²) manifest-rewrite wall (~78 hours of pure checkpoint I/O at 100k tables) is gone; and PG schema reads no longer die with SQLSTATE 53100 on ≥50k-table catalogs under container-default 64 MB /dev/shm. Drop-in from v0.99.38 — no flag or default changes, and finalized backups keep the exact same on-disk format.
Performance
- Backup/restore per-row JSON codec rewritten as a direct buffer-append fast path (tasks #51/#52). Profiling the 136 GB bench corpus showed the reflection-based
encoding/jsonround trip of the per-row map was 49% ofbackup fullCPU and 69% ofrestoreCPU. The chunk row encode/decode now runs on a specialized codec that emits and parses the SAME wire bytes — byte-identical output for every shape the fast path accepts, no chunk-format change, old and new binaries read each other's chunks — at ~10× per row in both directions (encode 82 → 0 allocs/row, decode 189 → 27). Any value or line outside the canonical shapes falls back to the legacy path, which remains the semantic and error oracle; differential sweeps plus two fuzz targets pin the two paths equivalent on arbitrary input. Together with the O(1) checkpoints below, measured end-to-end on the same corpus: both legs −51%, zero-loss (docs/comparison-backup.mdhas the full methodology). - Backup checkpoints are O(1) per event — the manifest is no longer rewritten per chunk/table (task #54, ADR-0086). Every per-chunk / per-table checkpoint used to re-marshal the ENTIRE manifest (embedded schema included) and re-Put
manifest.json, making the row sweep quadratic in table count — the #38 scale probe measured ~78 hours of pure manifest rewriting at 100k tables. The in-progress manifest is now a base written once plus an append-onlymanifest.progress.jsonlsidecar (one attempt-ID-stamped JSON line per event); the manifest is marshaled exactly twice per run (base + final), and the finalized backup folds everything back into one self-containedmanifest.json— the on-disk layout of a successful backup is unchanged. Pinned by a 10k-table committer benchmark asserting checkpoint bytes are independent of table count. Stores without an append primitive (S3/GCS/Azure blob stores) keep the previous full-rewrite checkpoints and say so loudly on large corpora. - PG→PG raw-copy single stream is ~4.9× faster (task #37, PR #196). The PG server emits one CopyData message per row on
COPY TO STDOUT, and each row paid a synchronous unbuffered-pipe rendezvous plus a ~265-byte socket write to the target — 81.8% of single-stream CPU. A 64 KiB buffer ahead of the pipe coalesces the frames (byte-transparent — the COPY stream has no per-Write framing): 72.6 s → 15.0 s on a 4M-row / 1040 MB single-stream run (14 → ~73 MB/s), checksum-verified zero-loss, pinned by the existing 13-test raw-copy + sync cold-start suites. - MySQL
LOAD DATAbulk writes get the same fix (PR #197). The TSV encoder issued one unbuffered pipe write per row — the identical per-row pipe-rendezvous class as the raw-copy lane — and is now buffered the same way (64 KiB, flushed before close on success; the error path still poisons the read so failures stay loud). Byte-transparent; the existing LOAD DATA zero-loss and warning-probe pins cover it.
Fixed
- PG schema reads no longer die with SQLSTATE 53100 on huge catalogs under small
/dev/shm(task #55, found by the #38 scale probe). At ≥50k-table catalog sizes Postgres plans parallel hash joins for several of sluice's catalog metadata queries; the parallel workers allocate their shared hash tables as dynamic shared memory segments in/dev/shm, which on container-default 64 MB shm exhausts withcould not resize shared memory segment … No space left on device (SQLSTATE 53100). Every PG SchemaReader catalog query now runs in its own read-only transaction withSET LOCAL max_parallel_workers_per_gather = 0— serial plans build hash tables in process-localwork_memand cannot hit the wall, and parallelism buys nothing on catalog reads (validated on a real 50k-table / 150k-index rig: ~15 s either way with headroom, and the fixed binary succeeds with/dev/shm100% full where the previous one failed). This was always a loud failure — exit non-zero, no data ever moved or lost. Affects all prior releases reading very large PG catalogs in shm-constrained containers; no--shm-sizetuning needed anymore.
Compatibility
- No breaking changes. Drop-in from v0.99.38 — no flag, default, or invocation changes. Migrate, sync, and CDC paths are untouched except for the two byte-transparent throughput fixes.
- No chunk-format change. The fast codec emits byte-identical wire bytes; backups written by v0.99.39 are readable by older binaries and vice versa.
- Manifest format v3 applies ONLY to in-progress (sidecar-layout) backups. An OLDER binary asked to RESUME a backup that crashed under v0.99.39 refuses loudly ("upgrade sluice") instead of silently resuming off a base that under-reports progress; v0.99.39 resumes old-format in-progress backups unchanged (pinned both directions). Finalized backups keep the pre-existing format version and are fully readable by older binaries.
- Recorded
schema_hashvalues change for schemas with columns that have no default (task #49).ComputeSchemaHashnow normalizes a nil column default to the explicitDefaultNonethe manifest decode materializes, so a reader-fresh schema and the same schema re-read from a manifest fingerprint identically. Harmless: the stored hash is write-only today (nothing compares against a previously stored value), and the resume drift guard always recomputes both sides.
Who needs this — action required
- Everyone using
backup/restore— both legs are −51% on the bench corpus. Action: just upgrade; same commands, same artifacts. - Anyone backing up many-table schemas (thousands and up) — the per-chunk manifest rewrite was quadratic in table count and is the difference between feasible and ~78 hours of pure checkpoint I/O at 100k tables. Action: upgrade before attempting very-large-table-count backups; note blob stores without append (S3/GCS/Azure) keep the old checkpoint cost and warn loudly.
- Anyone reading ≥50k-table PG schemas from containers with default
/dev/shm— previously died loudly with SQLSTATE 53100. Action: upgrade and drop any--shm-sizeworkarounds. No data was ever affected — the old behavior was a loud exit, never silent loss. - Mixed-version fleets — don't try to resume a v0.99.39 in-progress (crashed) backup with an older binary; it refuses loudly by design. Finalized backups interoperate in both directions. Everything else requires no action.
Install: brew install sluicesync/tap/sluice · go install sluicesync.dev/sluice/cmd/sluice@v0.99.39 · Container: ghcr.io/sluicesync/sluice:0.99.39
Full changelog: https://github.com/sluicesync/sluice/blob/main/CHANGELOG.md
v0.99.38
sluice v0.99.38
Crash-resumed anchored backups now keep a gap-free incremental chain (ADR-0085). Previously, resuming an interrupted backup full silently re-anchored the chain at a NEW snapshot position while keeping already-completed tables from the OLD one — writes to those kept tables in between landed in neither the full nor any incremental, with exit 0 everywhere. Found by code-reading during the ADR-0083 work and closed before any release carried a worse variant.
Fixed
- Silent-loss class closed: resume adopts the FIRST attempt's anchor (Bug-class: chain gap on crash-resume). The in-progress manifest now records the snapshot anchor from the moment it is first committed, and a resumed run records that position as the chain handoff — kept tables replay exactly, and tables re-streamed under the newer snapshot are healed by the first incremental's idempotent replay (the ADR-0010 applier convergence argument, now formally load-bearing for backup chains). Loud refusals guard the cases where that argument doesn't hold: a truly keyless table needing a re-stream, or schema DDL between the attempts (both name
--force-overwriteas the escape hatch). Manifests from older binaries (no recorded anchor) fall back to re-streaming everything, with a WARN. --chain-slotcrash recovery reversed: the resume now ADOPTS the surviving chain slot instead of telling you to drop it. The leaked slot after a hard crash is not debris — it is the only thing retaining the WAL that makes a sound resume possible. The already-exists refusal previously advisedsluice slot drop+ retry, which destroyed exactly that retention and funneled into the silent gap; the resume now verifies the slot can serve the original anchor (the chain preflight) and proceeds, and the refusal message names re-running the same command as the recovery.- Incrementals and streams refuse to chain off an in-progress (crashed, unfinished) full — previously this produced a confusing "from now" fallback; with anchors now recorded early it would have been silently wrong instead. The refusal says to finish or resume the full first.
- Test-infrastructure: the ADR-0046 crash-injection matrix no longer flakes on the post-crash walsender race (deterministic slot handoff between the crash and recovery runs).
Compatibility
- A resumed anchored full's
EndPositionnow means "the FIRST attempt's anchor" (the chain-sound choice). Per-table data in a resumed full remains mixed-consistency across attempts exactly as before — but now the first incremental genuinely converges it. - Resumes of backups taken by pre-v0.99.38 binaries re-stream all tables once (the old anchor is unknowable); subsequent crashes/resumes get the new table-granular behavior.
Who needs this
- Anyone using backup chains whose fulls can be interrupted — especially
--chain-slotusers, whose documented crash recovery previously led into the gap.
Install
Binaries for Linux/macOS/Windows (x86_64 + arm64) attached; container image ghcr.io/sluicesync/sluice:0.99.38. Verify with checksums.txt.