fix: Memory-efficient sync and reconciliation for large tables#76
Merged
Conversation
The sync_table method was loading entire tables into memory before processing, causing: - 10GB+ memory usage for tables with millions of rows - Connection timeouts when queries exceeded ELB idle timeouts - Failed syncs with "connection closed" errors Changes: - Use existing batched reader (read_changes_batched + fetch_batch) instead of loading all rows at once - Process and write each batch immediately (memory = O(batch_size)) - Update sync state after each batch for resume capability - Add progress logging every 10 batches - Increase default batch_size from 1000 to 10000 for better throughput - Check for xmin wraparound at start rather than during read This reduces memory from O(total_rows) to O(batch_size), enabling sync of tables with millions of rows without OOM or timeouts. Closes #74
Add cross-platform memory detection and automatic batch size calculation to prevent OOM on small instances while maximizing throughput on larger ones. New functions in utils.rs: - get_available_memory(): Cross-platform (Linux, macOS, Windows) - Linux: Reads MemAvailable from /proc/meminfo - macOS: Uses sysctl + vm_stat for free/inactive pages - Windows: Uses GlobalMemoryStatusEx Win32 API - calculate_optimal_batch_size(): Auto-calculates based on memory - Uses 25% of available memory as working budget - Assumes 2KB per row (conservative estimate) - Clamps between 1,000 and 50,000 rows Expected batch sizes by instance type: - t3.nano (512MB): ~1,000 rows - t3.small (2GB): ~10,000 rows - t3.large (8GB+): 50,000 rows (capped) Refs #74
The reconciler was loading ALL primary keys from both source and target tables into memory before comparing them. For tables with millions of rows (e.g., 14M rows), this caused: - 2-3 GB memory usage just for PKs - Potential OOM on memory-constrained instances - Connection timeouts during long-running PK fetch queries Changes: - Add reconcile_table_batched() using merge-join comparison - Implement PkBatchReader with keyset pagination (WHERE pk > last_pk) - Fetch PKs in sorted batches from both databases - Compare using single-pass merge-join (both streams sorted) - Delete orphans in batches as they're discovered - Add progress logging every 100K comparisons This reduces memory from O(total_rows) to O(batch_size), enabling reconciliation of tables with millions of rows without OOM. Closes #75
This commit fixes critical correctness issues identified in PR #76 review: ## Critical Fix 1: xmin batching skipping rows with same xmin The batched xmin reader was using `WHERE xmin > $1` which skips rows when multiple rows share the same xmin (bulk inserts in single transaction). Fix: Use (xmin, ctid) as compound pagination key. ctid provides a stable tie-breaker for rows with identical xmin values. - Add `last_ctid` field to BatchReader - Use `WHERE (xmin, ctid) > ($1, $2::tid)` for subsequent batches - Include `ctid::text` in SELECT and ORDER BY ## Critical Fix 2: Reconciler PK ordering mismatch PKs were cast to ::text in SELECT but ORDER BY used native column types. For numeric PKs: "10" < "2" lexicographically but 10 > 2 numerically. This caused false orphan detection and data loss. Fix: Use ::text cast in both SELECT and ORDER BY to ensure SQL stream order matches Rust's lexicographic string comparison. - Change ORDER BY from `"col"` to `"col"::text` - Change WHERE from `"col" > $1` to `"col"::text > $1` ## Moderate Fix: macOS page size detection Apple Silicon uses 16KB pages, not 4KB. Hardcoded 4KB underestimated available memory by 4x, leading to unnecessarily small batch sizes. Fix: Use `sysctl hw.pagesize` to get actual page size.
Contributor
Author
Review Fixes AppliedAddressed all findings from the code review: Critical Fixes1. xmin batching now handles duplicate xmin values (reader.rs)
2. Reconciler PK ordering now consistent (reconciler.rs)
Moderate Fix3. macOS page size detection (utils.rs)
All 228 tests pass, clippy clean. |
This was referenced Dec 10, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes critical memory and timeout issues in the xmin-based sync daemon that were causing failures on tables with millions of rows:
Changes
Batched sync processing (
sync_table)read_changes_batched()+fetch_batch()instead of loading all rowsAuto-detect batch size based on available memory
Batched reconciliation (
reconcile_table_batched)WHERE pk > last_pk) for efficient batchingMemory Impact
Testing
Test plan
cargo fmtCloses #74
Closes #75