Skip to content

Chaos pg and perf improvements#61

Merged
vnvo merged 20 commits intomainfrom
chaos-pg
Mar 28, 2026
Merged

Chaos pg and perf improvements#61
vnvo merged 20 commits intomainfrom
chaos-pg

Conversation

@vnvo
Copy link
Copy Markdown
Owner

@vnvo vnvo commented Mar 28, 2026

Summary

  • Kafka sink linger.ms fixDEFAULT_LINGER_MS lowered from 20ms to 5ms; was the primary throughput bottleneck capping small batches at ~8K events/s
  • Coordinator batch pipelining — new max_inflight setting decouples accumulation from delivery via bounded channel, preserving checkpoint ordering
  • Drain benchmark results — MySQL: 117K events/s peak, Postgres: 51K events/s peak (Docker, single machine, linger.ms=0, max_events=4000)
  • pgwire-replication 0.3.1 — drain-phase tight loop + reusable read buffer; Postgres CDC +32%
  • Kafka KRaft — dropped Zookeeper, migrated to cp-kafka:7.7.1
  • Chaos UI — activity bar, unified console log, Docker image selector with stale detection
  • Performance tuning guide — new docs/src/performance.md

What changed

Throughput

  • Kafka sink DEFAULT_LINGER_MS: 20 → 5 (steady-state), drain uses linger.ms=0
  • Coordinator: pipelined delivery with max_inflight (default 1, drain uses 4)
  • Drain defaults: max_events=4000, max_inflight=4, linger.ms=0
  • MySQL writer: 64-row multi-INSERT with connection reuse
  • pgwire-replication 0.3.1: tight loop reads up to 256 WAL messages without select!/timeout overhead

Infrastructure

  • Kafka KRaft mode (cp-kafka:7.7.1, no Zookeeper)
  • Grafana dashboard: collapsible rows, multi-pipeline support, running-pipeline filtering

Chaos UI

  • Activity bar with per-button loading and task history
  • Unified console log with smart auto-scroll (pauses when user scrolls up)
  • Docker image selector dropdown with tag/size/age metadata
  • Stale image badge when container runs an outdated image after rebuild
  • Bytes throughput metrics (source_bytes_total, bytes_total)
  • Per-scenario proxy bypass (--no-proxy)

Fixes

  • MySQL drain writer pool.disconnect() deadlock
  • escape_like test using glob * instead of literal %
  • Failover e2e tests: connection retry loop replacing fixed sleep(10s)
  • Dockerfile restored to --locked after local pgwire-replication dev cycle

Docs

  • New performance tuning guide with source-specific configuration advice
  • Updated development guide with current drain defaults, pg-soak profile, expanded UI features
  • README: performance highlights in features section
  • Changelog updated

Test plan

  • cargo test --workspace — all unit tests pass (119 in sources)
  • cargo test -p sources --test failover_e2e -- --include-ignored --test-threads=1 — integration tests stable with retry loop
  • MySQL drain benchmark: --drain-max-events 4000 --drain-kafka-conf linger.ms=0 -> ~117K events/s
  • Postgres drain benchmark: same flags -> ~48K events/s
  • Chaos UI: activity bar visible during infra actions, console shows all output, image selector works, stale badge appears after rebuild
  • Kafka starts in KRaft mode without Zookeeper (docker compose up -d)

Checklist

  • Tests pass (cargo test)
  • Code formatted (cargo fmt)

vnvo added 20 commits March 26, 2026 01:10
- Add postgres support to backlog drain benchmark (run_pg)
- Add Config Lab tab with A/B comparison runner and presets
- Add PATCH proxy endpoint to chaos UI backend
- Fix pg_writer_loop: reconnect on error, f64->numeric type mismatch
- Fix escape_like: escape underscore, convert glob * to SQL %
- Promote writer errors from debug to warn level
- Add df_base/pipeline fields to SoakSource
Source hot-path optimizations (profiler-guided):
- fast_uuid_v7: atomic counter replaces getrandom syscall (3.1% -> 0.2%)
- Remove #[instrument] from read_next_event/dispatch_event
- try_send with async fallback in send_event
- Arc<Vec<RelationColumn>> and Arc<str> qualified_name per relation
- Arc<PostgresTableSchema> in LoadedSchema cache
- Static PG_VERSION LazyLock, hand-written checkpoint JSON
- Pre-allocated batch vector in coordinator

Chaos UI:
- Config Lab tab with A/B comparison runner and presets
- Service refresh button (--no-deps --force-recreate)
- Dynamic port/URL link badges from docker port
- Grafana in-flight panel color thresholds (green=draining, orange=buffering)
- Wider services card layout
- Update pgwire-replication to v0.3 (buffered WAL reads)
- Adapt Io error handling for Arc<io::Error> (use err.kind() not string match)
- Include ErrorKind in connect error details
- Fix clippy: range patterns, useless format!
Zero-copy tuple parsing:
- PgColumnValue::Text/Binary now hold Bytes slices (not String/Vec)
- parse_tuple_data uses Bytes::slice() instead of copy + from_utf8_lossy
- DML handlers pass payload_bytes for zero-copy slicing

Counter cache:
- RunCtx.counter_cache caches metrics::Counter per (table, op)
- send_event does entry().or_insert_with_key() - no hash lookup per event

LSN cache:
- RunCtx.cached_lsn avoids reformatting same LSN within a transaction
- make_checkpoint_meta_str accepts pre-formatted LSN string

Results (direct connection, 1M rows):
- Postgres: 48.5K avg, 52.1K peak (was 29.4K baseline)
- Now matches MySQL throughput (50K avg, 53.2K peak)
- Add deltaforge_source_bytes_total and deltaforge_bytes_total counters
- Add per-scenario proxy bypass (--no-proxy) with pipeline DSN patching
- Redesign Grafana dashboard with collapsible rows, multi-pipeline support
- Add proxy toggle and service refresh to chaos UI
- Add toxiproxy proxy_states/proxy_summary helpers
Reuse connections via outer/inner loop pattern and use multi-row
INSERT statements (64 rows per statement) to reduce round trips
during backlog population.
Prevents linger wait from capping throughput on small coordinator
batches. With linger.ms=20, each 200-event batch waited the full
20ms before rdkafka sent — limiting throughput to ~8K events/s.
Decouple event accumulation from sink delivery using a bounded mpsc
channel (capacity = max_inflight). A dedicated delivery task processes
batches in FIFO order, preserving checkpoint ordering while overlapping
accumulation with delivery.
- Drain defaults: max_events=4000, linger.ms=0, max_inflight=4
- Add --drain-max-inflight CLI arg
- MySQL writer: 64-row multi-INSERT with connection reuse
- Remove pool.disconnect() deadlock
- Migrate Kafka from cp-kafka:7.5.0+Zookeeper to cp-kafka:7.7.1 KRaft
- Upgrade pgwire-replication to 0.3.1 (drain-phase tight loop, reusable
  read buffer — +32% Postgres CDC throughput)
- Activity bar with per-button loading state and task history
- Unified console log with smart auto-scroll
- Docker image selector dropdown with tag/size/age metadata
- Stale image badge when container uses outdated image ID
- Remove zookeeper from UI service list
- Fix test_build_pattern_query to use glob * instead of literal %
- Replace fixed sleep(10s) with connection retry loop in failover tests
- Restore Dockerfile.debug --locked flag after local pgwire dev cycle
- Update changelog with throughput optimizations
- New performance.md with benchmark results, tuning parameters,
  source-specific notes, profiling instructions, and drain benchmark usage
- Update development.md with current drain defaults, pg-soak profile,
  expanded playground UI features
@vnvo vnvo merged commit 426b550 into main Mar 28, 2026
3 of 4 checks passed
@vnvo vnvo deleted the chaos-pg branch March 28, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant