Skip to content

Phase 10: automatic cluster membership lifecycle#87

Merged
transfix merged 3 commits into
masterfrom
feature/phase10-cluster-membership
May 23, 2026
Merged

Phase 10: automatic cluster membership lifecycle#87
transfix merged 3 commits into
masterfrom
feature/phase10-cluster-membership

Conversation

@transfix
Copy link
Copy Markdown
Owner

Add state_cluster_membership: a background component that manages peer lifecycle through heartbeat emission, failure detection, and automatic eviction.

Components

state_cluster_membership (header + implementation + 21 tests)

Heartbeat Emission

  • Background tick loop periodically publishes OOB messages on __system.membership.<cluster>.<node>
  • Configurable interval (default 1s)

Failure Detection

Tracks last_heartbeat_ns per peer with three-stage state machine:

  • alive → suspect after suspect_timeout (default 3s)
  • suspect → dead after dead_timeout (default 5s)
  • dead → evicted after evict_timeout (default 10s) — removes from peer_registry + replica

Event Callbacks

  • peer_joined, peer_suspect, peer_dead, peer_evicted events
  • Multiple listeners via add_callback() / remove_callback()
  • Fired from tick loop thread (non-blocking handlers recommended)

Inbound Heartbeat Processing

  • on_heartbeat() auto-registers unknown peers in registry + replica
  • Refreshes liveness; restores suspect/dead peers to alive

Observability

  • Counters: heartbeats sent/received, peers suspected/dead/evicted/joined
  • Clock injection via set_clock() for deterministic testing
  • Integrates with existing peer_registry, replica, shard, and transport

Tests (21)

Construction, start/stop lifecycle, peer registration, heartbeat processing, failure detection through all states (suspect/dead/evict), heartbeat revival, event callbacks, config updates, and two-node integration with shard + inproc transport.

Add state_cluster_membership: a background component that manages
peer lifecycle through heartbeat emission, failure detection, and
automatic eviction.

Heartbeat emission:
  Periodically publishes OOB messages on __system.membership.<cluster>.<node>
  so other nodes know this process is alive.

Failure detection:
  Tracks last_heartbeat_ns per peer and transitions through states:
    alive -> suspect (after suspect_timeout)
    suspect -> dead  (after dead_timeout)
    dead -> evicted  (removed from peer_registry + replica)

Event callbacks:
  peer_joined, peer_suspect, peer_dead, peer_evicted events fire
  from the tick loop thread for other components to react.

Inbound heartbeats:
  on_heartbeat() auto-registers unknown peers and refreshes liveness.
  Suspect/dead peers are restored to alive on fresh heartbeat.

Observability:
  Counters for heartbeats sent/received, peers suspected/dead/evicted/joined.
  Clock injection via set_clock() for deterministic testing.

21 tests cover construction, start/stop, peer registration, heartbeat
processing, failure detection through all states, event callbacks,
config updates, and two-node integration with shard + transport.
@transfix transfix force-pushed the feature/phase10-cluster-membership branch from 9e13f68 to 3823a1b Compare May 22, 2026 20:37
@transfix transfix merged commit 273cc81 into master May 23, 2026
14 checks passed
@transfix transfix deleted the feature/phase10-cluster-membership branch May 23, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant