[EE2-1481]: feature: VM pool#1
Merged
Merged
Conversation
Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>
Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>
Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>
skhaz
added a commit
that referenced
this pull request
Apr 9, 2026
… synchronous demonitor, monitor cleanup - Fix #1: Process exit leave events now include PID repeated per join occurrence (e.g. 3 joins -> leave event with [pid, pid, pid]), matching Erlang pg behavior for correct member count tracking. - Fix #2: handleSync uses differential comparison instead of full teardown+rebuild, emitting join/leave events only for actual PID changes. Eliminates spurious events on node reconnection/sync. Adds diffPIDs helper with correct multiplicity handling. - Fix #3: Monitor/Events unsubscribe is now synchronous - blocks until the event loop processes the removal, guaranteeing no events arrive after unsubscribe returns (matches Erlang pg:demonitor/2). - Fix #4: PG service now monitors subscriber PIDs via topology and automatically removes monitor subscriptions when the monitoring process dies (removeMonitorsByPID). Adds hasMonitorSubscriptions/ hasGroupMemberships helpers for correct demonitor lifecycle. All 147 Go tests pass, 0 golangci-lint issues, 47 Lua E2E tests pass.
wolfy-j
added a commit
that referenced
this pull request
May 28, 2026
* Add pg (process groups) module with review fixes
Implement distributed named process groups following Erlang/OTP pg semantics.
Fix CommandID collision with stream module, shared payload slice aliasing,
and duplicated sendToMembers logic found during code review.
## Matching Semantics with Erlang pg
| Feature | Erlang pg | This implementation |
|---|---|---|
| Named groups | Processes join arbitrary named groups | Same (Group = string) |
| Multi-join | A process can join the same group multiple times and must leave the same number of times | Same (state tracks duplicate group entries) |
| Auto-removal on exit | Processes are automatically removed from all groups when they terminate | Same (via monitorProcess / handleProcessExit) |
| Cluster-wide membership | Group membership is visible across all connected nodes | Same (local + remote state, discover/sync protocol) |
| Eventually consistent | No strong consistency guarantees; nodes converge via async messaging | Same (async broadcast of join/leave/sync messages) |
| get_members / get_local_members | Separate queries for all-nodes vs local-node members | Same (GetMembers vs GetLocalMembers) |
| which_groups | Returns all groups with at least one member | Same |
## Differences from Erlang pg
| Feature | Erlang pg | This implementation |
|---|---|---|
| Scopes | Supports named scopes (multiple independent pg instances) | Single global scope, no scope parameter |
| Monitor callbacks | pg:monitor/1,2 lets a process receive join/leave notifications | No equivalent, query-only |
| Broadcast | No built-in broadcast; sending to members is left to the caller | Adds Broadcast and BroadcastLocal as first-class operations |
| Consistency model | gen_server with ordered message processing per-scope | Single-goroutine event loop via actions channel (functionally equivalent) |
## Review Fixes
- Move pg CommandIDs from 50-56 to 200-206 to avoid collision with stream module (50-57)
- Add defensive payload slice copy in sendToMembers to prevent data corruption
- Extract duplicated sendToMembers into shared broadcast.go helper
* Fix pg dispatcher boot ordering: depend on pg system component
The pg dispatcher was loading before the pg service was in context,
causing GetProcessGroups(ctx) to return nil and silently skipping
all command handler registration. Add explicit dependency on the
pg boot component to ensure correct ordering.
* Move PGDispatcher to system components to fix core-only boot
PGDispatcher depends on the "pg" system component, but was included
in dispatchers.All() which is part of core.All(). This broke
TestCorePlugins because the "pg" node doesn't exist in core-only
boot, causing dependency resolution to fail with a stuck node error.
Move PGDispatcher to system.All() alongside PG() where it belongs,
keeping the boot ordering guarantee while allowing core-only loading.
* Add pg.events() for process group membership change notifications
Implements Erlang PG monitor pattern: Lua processes can subscribe to
join/leave events via pg.events(), which returns a subscription with
a channel that delivers membership change events in real time.
- Add MembershipEvent struct and event constants to api/pg
- Add emitJoinEvent/emitLeaveEvent to pg service, wired into all
join/leave paths including remote and process exit handlers
- Add EventsYield and Subscription type to Lua pg module
- Add unit tests, service event emission tests, and 4 integration
tests using inline.Pool (97 pg module tests, 107 system/pg tests)
* Implement full Erlang PG compliance: monitor, scopes, batch ops, RCU snapshots
Close all gaps from PG compliance review:
- Fix leaveRemote multi-join bug in state.go (Phase 1.1)
- Add pg.monitor(group) with atomic snapshot+subscribe (Phase 2.1)
- Update pg.events() to return groups snapshot as second value (Phase 2.1.5)
- Add named scopes via pg.scope(name) with prefix isolation (Phase 3.1)
- Add batch join/leave accepting table of groups (Phase 4.1)
- Add which_local_groups() (Phase 4.2)
- Add sub:close({flush=true}) via Channel.Drain() (Phase 4.3)
- Lock-free reads via RCU atomic.Pointer snapshots (Phase 5.1)
- Clear snapshot on Stop() so readers get nil after shutdown
- Add publishSnapshot() to protocol test submit closures
All 335 tests pass (139 system/pg, 17 api/pg, 136 module, 43 Lua E2E).
* Fix Erlang PG semantics: leaveAllLocal preserves duplicates, LeaveGroups uses best-effort
Two bugs found comparing against Erlang PG source (pg.erl):
1. leaveAllLocal was deduplicating groups, losing multi-join count for
remote broadcast. When a process joins group A twice and exits,
broadcastLeave now correctly sends A twice so remote nodes remove
both occurrences. Event emission is still deduplicated (one event
per unique group).
2. LeaveGroups was failing fast on first not-joined group, causing
partial state mutation. Now uses best-effort semantics: tries all
groups, skips non-members, returns ErrNotJoined only if NONE
succeeded.
All 134 Go tests pass, golangci-lint clean.
* Fix 4 Erlang PG semantic gaps: event multiplicity, differential sync, synchronous demonitor, monitor cleanup
- Fix #1: Process exit leave events now include PID repeated per join
occurrence (e.g. 3 joins -> leave event with [pid, pid, pid]),
matching Erlang pg behavior for correct member count tracking.
- Fix #2: handleSync uses differential comparison instead of full
teardown+rebuild, emitting join/leave events only for actual PID
changes. Eliminates spurious events on node reconnection/sync.
Adds diffPIDs helper with correct multiplicity handling.
- Fix #3: Monitor/Events unsubscribe is now synchronous - blocks
until the event loop processes the removal, guaranteeing no events
arrive after unsubscribe returns (matches Erlang pg:demonitor/2).
- Fix #4: PG service now monitors subscriber PIDs via topology and
automatically removes monitor subscriptions when the monitoring
process dies (removeMonitorsByPID). Adds hasMonitorSubscriptions/
hasGroupMemberships helpers for correct demonitor lifecycle.
All 147 Go tests pass, 0 golangci-lint issues, 47 Lua E2E tests pass.
* Fix spurious leave events in handleRemoteLeave and add tests
leaveRemote now returns a map of actually-removed PIDs per group,
and handleRemoteLeave only emits events for groups the PID was truly
a member of. Adds 4 state, 2 protocol, 2 service, and 1 integration
test to verify correct behavior.
* Refactor PG to registry-based scope manager with service addressing
Replace singleton ProcessGroups with per-scope ScopeService instances resolved
via resource registry. Make dispatcher stateless by embedding service reference
in commands. Replace synthetic PIDs with empty-UniqID service addresses routed
by host-level relay dispatch.
* fix: resolve internode connection and routing bugs for distributed PG
Fix 12 bugs discovered during distributed PG cluster integration testing:
- Fix PID generator using empty node ID (boot/components/core/infrastructure.go)
- Fix address parsing in connectToNode producing invalid host:port addresses
- Fix port mismatch when connection manager restarts with AutoPort
- Fix race condition in NodeJoined event ordering between internode and PG
- Fix router internode receiver silently ignored due to first-write-wins
- Fix MsgPack codec missing WriteExt flag causing string/bytes ambiguity
- Fix syncRemote deleting remote node entry on empty groups (discover loop)
- Fix stale control loop race on node leave/rejoin
- Fix inbound connections dropped from not-yet-managed nodes
- Fix cleanupControlLoop deleting replacement loop instance
- Fix ensureNodeManaged calling AddManagedNode on already-managed nodes
- Fix AddManagedNode destroying active inbound connections
* cleanup: pool release hygiene, log level downgrades, PID precomputation
- Change Groups[:0] to nil in JoinGroupsCmd/LeaveGroupsCmd Release()
to match every other pooled command type in api/ and avoid retaining
backing arrays in the pool
- Downgrade 4 chatty Info-level function-entry trace logs to Debug in
internode connection manager (handleConnect, handleConnected,
handleDisconnected, Node already managed)
- Add Precomputed() to PIDs constructed in NewServicePackage to avoid
repeated string rebuilds on every String() call
* fix: use Done channel instead of sleep for deadline test on Windows
time.Sleep(1ms) was insufficient for a 1ns context timeout on Windows
due to ~15ms timer resolution, causing the test to get ErrNotFound
instead of context.DeadlineExceeded.
* Add Raft-based Global Registry for cluster-wide process naming
Implements distributed global registry using Raft consensus:
API:
- api/globalreg/: Global registry API with Register, Unregister, Lookup
- api/raft/: Raft consensus API for distributed state machine
System:
- system/raft/: Raft node implementation with membership handler
- system/globalreg/: Sharded FSM + Service for global registry
- FSM: Replicated state machine for name→PID mappings
- Service: Leader forwarding, auto-cleanup on node departure
- Commands: Register, Unregister, RemoveNode, Heartbeat
Boot:
- boot/components/system/raft.go: Raft component initialization
- boot/components/system/all.go: Add Raft() to component list
Lua:
- runtime/lua/modules/process/module.go: Add GLOBAL registration mode
Features:
- Cluster-wide unique names via Raft linearizability
- Automatic cleanup when nodes leave cluster
- Leader forwarding for write operations
- Retry logic for follower leader discovery
* Add PG test harness and comprehensive e2e tests
- system/pg/harness/: New test harness for multi-node PG testing
- TestCluster: Multi-node test cluster setup
- TestNode: Individual node wrapper with mock topology/router
- Helper methods for join/leave/broadcast/assertions
- EventCollector for event verification
- e2e_test.go: 19 end-to-end test scenarios
- Single and multi-node operations
- Group membership, broadcasts, monitoring
- Edge cases and error conditions
- stress_test.go: 8 stress tests
- 100+ processes, concurrent operations
- Memory stability, burst traffic
- Concurrent group creation
- chaos_test.go: 10 chaos tests
- Node failures and recovery
- Network partitions
- Rapid failure sequences
- bench_test.go: 8 benchmark tests
- Join/leave/get members operations
- Parallel and atomic operations
Note: Tests using mock router don't have cross-node sync.
Tests requiring real inter-node communication should use
integration test setup with real relay.
* Add sharded global registry with 2PC distributed transactions
Implements lock-free sharded registry architecture:
- ShardCoordinator: Manages multiple shards using sync.Map
- Shard: Individual FSM per shard with atomic counters
- TransactionManager: 2PC coordinator for cross-shard atomicity
- FNV-1a consistent hashing for name distribution
- Lock-free reads via sync.Map and atomic operations
- All 14 sharded registry tests passing
- Fixes UnregisterMulti 2PC prepare phase for both register/unregister
* Add Lua process module global registry integration tests
Adds comprehensive tests for the Lua process module's global registry
integration using a mock registry implementation:
- Registration and lookup
- Conflict detection
- Concurrent access
- Multiple names
- Unregistration
- Node cleanup
- Linearizability
- Idempotent re-registration
- Stress testing
12 tests covering mock registry behavior for Lua integration.
* Add SyncedCluster test harness with 23 new PG synced tests
- Add SyncedCluster infrastructure with shared event bus
- Export ForwardingRouter for cross-node message routing
- Add SetMonitorError to mockTopology for error injection tests
- Fix 4 pre-existing E2E test bugs (ref-counting, local-view assertions)
- Add 23 new synced tests covering local behavior, error injection,
concurrent operations, and edge cases
* fix(runtime): prevent memory leak in fenceCaches sync.Map
The fenceCaches sync.Map in runtime/lua/modules/process/module.go was
growing unbounded because entries were never cleaned up in production.
When an LState was closed, the cache entry remained forever.
Fix: Register a cleanup callback via resource.Store when the cache is
first created. The callback runs when the process ends, removing the
entry from fenceCaches.
This follows the established pattern used by pg, fs, sql modules.
Changes:
- Add resource.Store cleanup registration in getFenceCache()
- Update fenceCaches comment to reflect actual cleanup mechanism
* Add 17 new E2E tests and 9 SyncedCluster chaos/stress tests
E2E test coverage additions:
- LeaveIdempotency, JoinAfterLeave, BroadcastMultipleMembers
- ConcurrentLeaves, MixedJoinLeave, DifferentPIDsSameNodeSameGroup
- SamePIDMultipleGroups, PartialLeaveGroups, NodeRecoveryMembership
- GroupNamesWithSpecialChars, EmptyStringGroup, VerifyTopologyMonitor
- ServiceStopStart, BroadcastLocalEmptyGroup, LeaveOneOfManyGroups
- Fixed existing tests to account for ref-counting behavior
SyncedCluster chaos/stress tests:
- RapidNodeFailure, ConcurrentNodeOperations, PartialNodeFailure
- CascadeFailure, HighVolumeJoins, SustainedLoad
- MemoryPattern, ConcurrentGroups, BurstOperations
* Fix pre-existing chaos/stress tests for local-only semantics
- Update chaos tests to count local members per-node instead of
expecting cross-node sync via AssertGroupSize
- Update stress tests to use local member counting
- Remove WaitForSync calls that expected cross-node synchronization
- Tests now correctly work with TestCluster (local-only) semantics
- 7 chaos tests and 8 stress tests now passing
* fix(globalreg,pg): dedupe registrations and persist fence tokens in snapshots
- Fix idempotent register to not increment nameCount on re-registration
- Persist AppliedAt fence tokens in snapshot/restore (adds 'a' codec field)
- Fix JoinGroups to dedupe groups when checking maxGroups limit
- Respect maxMembersPerGroup when same group appears multiple times
- Add comprehensive tests for all edge cases
* feat(pg): add circuit breaker, retry mechanism and queue backpressure
Add resilience features to process groups:
- Circuit breaker per node (3 failures / 10s reset)
- Exponential backoff retry queue for failed broadcasts
- Queue overflow protection with warnings at 75% capacity
- Protocol and broadcast timeouts (5s default)
- New errors: ErrQueueFull, ErrBroadcastTimeout, ErrCircuitOpen
Fix race condition in retry queue startup. All tests pass with -race.
Refs #pg-consistency
* refactor: reorder struct fields and clean up comments
Optimize struct field ordering for alignment, remove redundant
inline comments, and rename unused context parameters to _.
* chore: remove PG harness tests (moved to wippyai/pg-harness)
The PG harness test suite (harness infrastructure + 247 tests across
LIFE/CHAOS/LOAD/CONC/BP/PROTO sections) now lives in the standalone
github.com/wippyai/pg-harness repository at the sibling path
../pg-harness.
* fix(pg): race-free Start/Stop ctx swap via atomic.Pointer
Service.Start() previously wrote s.ctx and s.cancel as plain field
assignments while concurrent submitters read <-s.ctx.Done() in submit(),
submitError(), eventLoop(), Join/Leave/Broadcast and every other hot path.
go test -race -count=3 (exercised from the pg-testapp TestNODE_
RestartUnderPressure chaos scenario) reliably caught two distinct races:
WARNING: DATA RACE
Write at 0x...: pg.(*Service).Start() service.go:142
Previous read: pg.(*Service).submit() service.go:276
WARNING: DATA RACE
Read at 0x...: sync/atomic.LoadPointer via Join() service.go:558
Previous write: context.WithCancel from Start() service.go:142
Fix: replace the two mutable fields with a single
atomic.Pointer[serviceCtx] holding the Start-scoped ctx + cancel. The
holder is initialised at construction with a pre-cancelled sentinel
(closedServiceCtx) so pre-Start / post-Stop reads are lock-free and
need no nil guard; Start() publishes a fresh pair with one atomic
Store, Stop() atomically Swaps back to the sentinel and only then
cancels the old ctx. All hot-path reads go through s.currentCtx(),
a single atomic.Load.
Mirrors the existing atomic.Pointer[stateSnapshot] pattern already used
for lock-free membership reads. No lock added, no contention change.
Verified:
go test -race ./system/pg/... — clean (9.5 s)
go test -race -count=3 ./harness/... (pg-testapp) — clean (73 s,
88 tests × 3)
golangci-lint run ./system/pg/... — 0 issues
* wip: in-flight changes across pg, globalreg, raft, topology, cluster
Snapshot of uncommitted work on feature/pg-process-groups before pushing:
- system/pg/ — protocol/retry/state/errors refinements + test tweaks
- system/globalreg/ + sharded/
— commands, fsm, service, coordinator, shard,
transaction, state
- system/raft/membership.go
- system/topology/pid_registry.go
- cluster/internode/ — codec, state_manager
- boot/components/system/cluster.go
- runtime/lua/modules/pg/ — module, module_test, yields
- otel-harness.md — reference catalog (new)
Builds clean (go build ./...).
* feat(raft): cap voters with non-voter overflow and odd-size ladder
Adds a leader-driven membership reconciler that caps Raft voters at
MaxVoters (default 5) along an odd ladder (1,3,5,5,5,7,...), keeping
surplus eligible nodes as non-voters so they receive the log without
inflating quorum cost. Selection is rank-based (raft_priority asc,
NodeID asc) with failure-domain spreading and soft stickiness so
existing voters aren't churned across reconcile passes.
- api/raft: extend Service with AddNonvoter/DemoteVoter/
LeadershipTransfer; add ErrServerNotFound.
- system/raft: split selection (pure) from handler (loops); decoupled
subscriber + reconcile goroutines with debounce and coalesced signal;
self-demote/self-remove triggers LeadershipTransfer first to avoid
killing the local leader mid-pass; soft errors abort the pass.
- cluster/membership: add UpdateMeta and clone NodeMeta under lock to
fix a race in the gossip delegate.
- boot: gossip raft_eligible/raft_priority/failure_domain hints; wire
the reconciler with HandlerConfig from cluster config.
- tests: pure unit tests for selection + handler with fake services and
a real eventbus debounce check; //go:build integration tests spin up
3/5/7-node in-process clusters to verify the cap end-to-end against
hashicorp/raft.
* feat(telemetrytest): add in-memory recorder for telemetry unit tests
* feat(pg): emit OTel metrics for join/leave/broadcast operations
* feat(pg): emit OTel metrics for queue, circuit breaker, retry
* feat(pg): emit OTel metrics for dispatcher, fence tokens, globalreg
* feat(pg): emit OTel spans for join/leave/broadcast
* feat(raft): emit OTel metrics for state, term, leader changes, elections
* feat(raft): emit OTel metrics for commit/log lag and AppendEntries latency
* feat(raft): emit OTel metrics+spans for voter ladder and snapshots
* feat(gossip): emit OTel metrics for members, joins, messages
* feat(gossip): emit OTel metrics+spans for probes, suspicion, convergence
* feat(globalreg,relay): emit pg_fence_*/pg_globalreg_* OTel metrics
Wire the runtime dashboards (pg_fence_token, pg_fence_rejection_total,
pg_globalreg_size, pg_globalreg_dedupe_total) from where the work
actually happens:
- system/globalreg/telemetry.go: nil-safe recorders that emit pg_*
metric series via api/metrics.Collector. The pg_ prefix is a metric
naming choice (Grafana contract), not a package ownership statement.
- system/globalreg/state.go: introduce registerOutcome enum so the FSM
can distinguish fresh insert / idempotent dedupe / conflict, and add
Len() for size sampling.
- system/globalreg/fsm.go: emit fence-token gauge on inserted/resolved
registrations, dedupe counter on idempotent re-registration, and
size gauge after every state mutation (including Restore).
- system/globalreg/service.go: take optional metrics.Collector +
meter/tracer providers, install telemetry on the FSM, and expose
RecordFenceRejection so the relay router can report rejections via
callback without depending on metrics directly.
- system/relay/router.go: add SetOnFenceReject callback hook fired at
the existing fence-token rejection site; avoids plumbing telemetry
through ~30 NewRouter call sites.
- boot/components/system/raft.go: pass nil collector triple (matching
the existing raft wiring) and install the fence-reject callback.
- system/pg/telemetry.go: remove the unwired recorders left over from
Task 4 (recordDispatcherInflight, recordFence*, recordGlobalreg*);
metric names are unchanged so dashboards still resolve.
* feat(boot): wire metrics collector and OTel providers into pg/raft/gossip/globalreg constructors
* fix(cluster,raft): wire metrics dep + fix raft addr from gossip ip:port
* feat(raft,gossip): wrap AE pipeline+RequestVote+InstallSnapshot, synthesize gossip_message_total
* feat(observability): bootstrap rare counters, sticky local voter, AE pipeline wrap, 10s push interval
* perf: 30s push interval, lower workload rate (sleep 500ms), match k8s CPU limits
* docs(superpowers): no-crash runtime spec - bounded everything, silent nothing
* docs(superpowers): no-crash runtime implementation plan (13 tasks, TDD)
* test(internode): add baseline OOM reproducer (skipped placeholder)
* feat(internode): per-class bounded queues with drop-oldest|drop-newest
Replace unbounded *list.List per-peer message queue with fixed-capacity ring
buffers per QoS class (RaftControl 4096 drop-oldest, Gossip 1024 drop-newest,
PGBroadcast 2048 drop-newest+error). All drops emit
internode_dropped_total{class,reason="queue_full"} via new telemetry layer.
* fix(internode): meter requeue drops, distinct ErrUnknownClass, gauge on drain
* feat(internode): plumb Class through SendToNode + topic-based mapping
* fix(internode): requeue respects per-class cap (no duplication on reconnect)
Replace RequeueMessages with RequeueMessagesClass, which respects the
per-class ring buffer cap and applies drop-oldest/drop-newest semantics
consistently during requeue. Callers in handleDisconnected, cleanup, and
drainMessages conservatively requeue under ClassPGBroadcast since the
original class is lost at the connection level. Old RequeueMessages
removed entirely; existing tests migrated to RequeueMessagesClass with
ClassRaftControl as the safe default.
* feat(pg): heap-based capped retry queue with drop metrics
* fix(pg): bucket attempt label to bound prometheus cardinality
* feat(pg): observe ErrQueueFull from internode and count broadcast drops
* perf(otel): bound batcher queue/batch + span limits, non-blocking on overflow
* perf(raft): drop per-AppendEntries span on hot path (preserve metrics)
* fix(raft): bound LeadershipTransfer goroutine pin under partition
Extract awaitFutureWithTimeout helper with a buffered(1) channel so the
helper goroutine can always send and exit once hraft's future resolves —
even after the caller has already returned ErrTimeout. Under a network
partition the goroutine stays pinned inside f.Error() until the partition
heals or raft shuts down (ErrRaftShutdown); hraft.Future has no cancel
API so this single-goroutine pin per stuck transfer is the honest bound.
Remove pre-existing dead code defaultTimeout from config.go.
* chore: go mod tidy (remove spurious goleak indirect)
* fix(chaos): close 6 root-cause issues from chaos audit
Bound every unbounded growth point we saw OOM the runtime under
network-partition + stress chaos:
- internode: NodeConnection.activeQueue capped (MaxConnectionQueueSize,
default 4096) with drop-newest + internode_dropped_total
{reason="conn_queue_full"}; orphan-sweep loop reconciles managed
nodes against membership every 60s.
- raft: stderr adapter limiter table is a real LRU (cap 256) keyed by
first-64-byte prefix; previous map grew unbounded under chaos with
many distinct prefixes.
- pg: monitor entries owned by PIDs on a departed node are evicted
on cluster.NodeLeft; circuit breaker map capped at 1024 with a
defense-in-depth eviction counter.
Detect dead raft peers and break the broken-pipe storm at the
transport layer (not just at the log):
- system/raft/peer_state.go (new) wraps hraft.Transport with a
per-peer consecutive-failure tracker. After failureLimit=5 errors
the peer is marked dead for an exponential backoff window (100ms→5s).
Subsequent transport calls return errPeerDead immediately, never
reaching the dead socket.
- raft_peer_dead_total / _skip_total / _recovered_total + a backoff
gauge expose the throttle.
Make raft_leader_changes_total emit reliably:
- LeaderCh is non-buffered with non-blocking writes; the bootstrap
node misses its own first leader-elected event when the goroutine
isn't reading yet. Added an initial seed (state == Leader) and a
ticker reconciliation that compares actual state vs wasLeader and
fires the missing transition. Verified: 20+ leader changes now
recorded under partition where prior runs reported 0.
Silence chaos-time log spam:
- internode SendToNode (ErrNodeNotManaged) and onMessage delivery
failure: log demoted, internode_dropped_total{reason=...} replaces
the line.
- cmd/internal/logger gets zap.SamplingConfig{Initial:100,
Thereafter:100} on the non-Console default — defense-in-depth so a
future hot WARN cannot OOM the runtime.
- raft TCPTransport stderr piped through raftStderrAdapter (rate-
limited per first-64-byte prefix via golang.org/x/time/rate).
Activity-based liveness:
- system/health/ (new) is a process-wide registry of liveness checks.
- /livez handler in boot/components/prometheus aggregates checks; HTTP
503 if any reports unhealthy.
- pg.Service registers pg.broadcast_recent.<host> backed by
system/pg/activity.go atomic timestamp (touched on every TX/RX).
- raft boot registers raft.last_contact (LastContact() vs ceiling).
- cluster boot registers cluster.gossip (memberlist HealthScore).
- Ceiling 30s catches a pod stuck on the wrong side of a partition
before kubelet's failureThreshold expires.
Thread the runtime config that the configmap was already setting:
- raft snapshot_threshold/_interval/_retain, trailing_logs,
heartbeat_timeout, election_timeout, commit_timeout now read from
the boot config instead of being silently ignored.
- prometheus.enabled / prometheus.address now bind /metrics + /livez
on the configured port; pprof gated by WIPPY_DEBUG_PPROF=1.
Reusable cluster wiring for non-boot consumers:
- cluster/stack.go AssembleStack(cfg) returns a Node + Router +
membership + internode wired the same way the boot does, so the
pg-harness in cluster mode joins the gossip ring as a real peer.
Tests: peer_state_test (4 cases including per-peer isolation),
stderr_adapter_test (LRU bounded + LRU evicts oldest, not random),
service_test (removeMonitorsByNode evicts only entries on the
departed node, leaves others). golangci-lint clean across all
modified packages.
* feat(cluster): 3-tier name registry + raft hardening + KV drivers
Three cohesive workstreams that lay the groundwork for hosting userland
on top of the runtime cluster.
1. Three-tier process name registry. LOCAL and GLOBAL existed; this lands
the EVENTUAL tier for ~100k user-session-class names that cannot pay
Raft cost per registration. system/eventualreg implements an ORSWOT
CRDT with 64-shard digests + anti-entropy via memberlist push/pull,
tombstone GC with safe-counter tracking, OTel telemetry. The membership
delegate at cluster/membership/membership.go now multiplexes multiple
UserDelegates over the gossip channel (kind:1|len:4|payload framing)
so eventualreg, kveventual, etc. coexist. Triangular shadow-check in
PIDRegistry blocks LOCAL names from shadowing GLOBAL/EVENTUAL.
2. Re-registration flood bounds + raft rebalancing hardening. Watermarked
reestablishMonitors avoids full FSM rescan on leader failover. Chunked
applyRemoveNode (256 names/Apply) lets foreground writes interleave
during node-departure cleanup. pg discover fan-out capped at 4 random
peers so simultaneous restarts don't produce N^2 messages. New
runtime_name_reregistrations_total{scope} counter feeds the soak gate.
Raft additions: voter-op metrics + 60s churn-burst counter, proactive
dead-voter eviction (peerStateTracker streak + gossip both red),
domain-aware stickiness (same-domain victim preferred when swapping),
defensive quorum-safety clamp, live max_voters reconfig.
3. Cluster KV drivers. New api/kv interface (Get/Put/Delete/CAS/Watch/Scan,
Mode={Raft,Eventual}, ProviderRegistry). system/crdt extracted as a
generic ORSWOT base; kveventual rides it for gossip KV with
UserDelegate kind 0xE2 (vs eventualreg's 0xE1). kvraft runs its own
hashicorp/raft replication group on a separate port via
raft-boltdb/v2 storage; FSM commands Put/Delete/CAS/ReapTTL with
periodic leader-driven TTL reap, per-namespace watch hub with
overflow-drop + metric, snapshot/restore round-trip. Boot wiring
composes both backends into a single ProviderRegistry. No new
dependencies — uses libraries already in go.mod (hashicorp/raft,
raft-boltdb/v2, bbolt, go-msgpack/v2).
Tests: 60+ new unit tests across new packages. Full runtime suite green
(272 packages); lint clean. Covers ORSWOT merge correctness, two-replica
convergence, shard hash determinism, tombstone GC, FSM Put/Get/CAS/TTL,
snapshot+restore round-trip, watch-hub fan-out + overflow.
* fix(cluster): cluster-mode bugs found in chaos rig
Five bugs surfaced when running runtime + pg-harness in the k3d monkey
cluster under active network-partition + delay chaos. Each is fixed
at the root, not papered over.
1. kveventual telemetry label-set mismatch. The bootstrap registered
kveventual_space_open_total with {node} but recordSpaceOpen recorded
with {node, space}, panicking the prometheus exporter on first
space.Open. Same shape on _op_total/_bytes_total/_tombstones_gc_total
where space-level recorders were missing the `node` label entirely.
Bootstraps now use the FULL label set the recorder will later use;
space.go now carries `node` and includes it in every label map.
2. runtime_name_reregistrations_total label-set divergence between
globalreg.recordReregistration (`{scope}`) and eventualreg
(`{node, scope}`). Same exporter panic. globalreg now takes node as
a param so both writers register the same descriptor.
3. cluster.Stack.Start leaked the memberlist gossip port on Membership.
Start failure. memberlist.Create binds the TCP listener BEFORE Join,
so a Join failure left port 7948 bound and any retry hit "address
already in use". Stack.Start now tears membership down on failure,
making the harness retry path actually retryable.
4. kveventual_entries gauge was never set anywhere — dashboards blind
to KV size. Added space.setEntriesGauge() called from Put/Delete/
applyFrame and bootstrapped with {node, space, state} label set.
Verified end-to-end in the cluster: 7+ min soak under
network-partition-50pct + network-delay-200ms-jitter-100ms with all
4 pods at 0 restarts and all hard correctness gates green.
* fix(gossip): budget-aware Drain so user-broadcasts fit MTU
Bug 7 root cause: BroadcastQueue.Drain() ignored memberlist's per-cycle
byte budget and returned ALL pending frames. With the kveventual chaos
workload pushing ~470 ops/sec, each gossip tick produced 30+ frames
totaling 40+KB. memberlist packed them into one compound UDP packet
that exceeded MTU and was silently dropped at the IP layer (or by the
kernel UDP send buffer).
Fixes:
1. crdt + eventualreg BroadcastQueue.Drain(headerOverhead, byteBudget):
- Honor byteBudget — accumulate frames whose total cost
(len(frame) + headerOverhead) fits the budget.
- Re-queue undrained entries at the head of pending; next gossip
cycle picks up where this one left off (no data loss, just
pacing).
2. eventualreg.Service.DrainBroadcasts, kveventual.space.drainBroadcasts:
- Plumb byteBudget through to inner BroadcastQueue.
3. eventualreg.Delegate.GetBroadcasts, kveventual.Service.GetBroadcasts:
- Pass memberlist's `limit` (not _) to inner.
- In kveventual: apportion remaining budget across spaces; once a
gossip cycle is full, untouched spaces keep their entries pending.
4. cluster/membership outer multiplex GetBroadcasts:
- Pass `limit` (not `limit-5`) to inner — the 5-byte mux header is
part of muxOverhead, not a separate budget reservation.
- Add gossip_user_getbroadcasts_overshoot_total counter that fires
if an inner delegate exceeds its budget (defensive observability).
5. cluster/membership telemetry:
- Bootstrap gossip_message_total{kind="user", direction=rx|tx} so
dashboards/gates can distinguish "metric never registered" from
"no user-broadcasts received this window."
6. New regression tests in crdt/delta_budget_test.go assert Drain
honors a tight budget and re-queues; an absurdly tight budget
produces zero frames and keeps the queue intact.
7. New cluster/membership/multiplex_test.go: two-node test verifies
the multiplex routing actually delivers a user-broadcast end-to-end.
* fix(membership): pipe memberlist logs through zap; never silence ERR
When memberlist's stdlib logger was redirected to io.Discard in non-
verbose mode, every "[ERR] Failed to decode user message", "[WARN] Was
able to connect over TCP but UDP probes failed", and similar diagnostic
got dropped on the floor. That is exactly what masked Bug 7
(network-policy blocking UDP/7946) for an entire soak: the cluster
looked healthy from our perspective while the operator-visible failure
surface was reduced to "convergence is slow."
New behavior:
- Always surface [ERR]/[WARN] memberlist lines via zap at matching level.
- Drop [INFO]/[DEBUG] unless VeryVerbose.
- Lines are buffered and split on newline so the underlying log.Logger's
prefix bytes don't get fragmented into multiple zap records.
Add memberlistLogWriter (cluster/membership/log_writer.go) and wire it
in as mlConfig.LogOutput.
* fix(gossip): fair budget split across user delegates in outer multiplex
Bug 8: with eventualreg's broadcast queue constantly full (4096
entries pending under workload), the outer multiplex GetBroadcasts
loop hands it the entire per-cycle UDP budget on every call. After
this commit's predecessor (Bug 7 fix) re-enabled UDP gossip, the
symptom became visible: kveventual_bytes_total{dir=tx} went 0/s while
eventualreg shipped at 39 KB/s. kveventual entries on peers froze at
the first few that landed before eventualreg's queue saturated.
Two-pass fair share:
1. Pass 1: each delegate gets limit/N. Guarantees no delegate is
starved by a busy peer.
2. Pass 2: remaining budget redistributed to whoever still has
pending entries. Recovers throughput when one delegate is idle.
Plus deterministic iteration order (sort by Kind) so behavior doesn't
depend on Go's map iteration randomness — a bug like this is much
harder to reproduce when the order is whatever the hash table happens
to enumerate.
Inner delegates already re-queue undrained entries (Bug 7 commit), so
shrinking the per-call budget is not data loss; subsequent gossip
ticks pick up the rest.
* fix(crdt,eventualreg): Drain emits a short frame when budget < MaxFrameBytes
Follow-up to the budget-aware Drain (Bug 7 / 5a8fcfae) and the fair
budget split (Bug 8 / 8b7da4ae). Verified on the chaos cluster: even
after the fairness pass, kveventual still drained at ~0 bytes/s while
eventualreg shipped ~39 KB/s. The remaining starvation is in Drain
itself.
The previous logic flushed a frame only when accumulated entries hit
perFrameLimit (= MaxFrameBytes - headerOverhead, ~1385 bytes). When
the outer multiplex hands a per-delegate share of ~699 bytes (limit/N
across two delegates), Drain would pack ~17 entries up to ~1385 bytes
and only THEN attempt to flush. The flush check (bytesEmitted+cost
<= byteBudget) then fails (1400 > 699), so the frame is discarded and
the loop breaks with NO frames emitted. The entries stay queued, the
next call sees the same state, and the delegate makes zero forward
progress no matter how often it is called.
Fix: introduce activeFrameCap = min(perFrameLimit, byteBudget -
bytesEmitted - headerOverhead). Frames flush as soon as they fill the
active cap — which is correctly ~684 bytes when the budget is 699
and 0 byte emitted. Result: a short frame is emitted, ~8 entries
clear the queue per call, and forward progress is bounded only by the
fair budget share.
Added TestBroadcastQueue_DrainSmallBudget pinning the 699-byte case.
* fix(observability): bootstrap rare safety-gate metrics
Bug 9: validate-chaos-impact.sh's hard gates for
runtime_name_reregistrations_total (W2.4) and kv_watch_dropped_total
(W6.3) silently report MISSING whenever no event has ever fired for
that metric. Without bootstrapping, "MISSING" looks identical to
"feature not deployed" and the safety net is effectively dormant.
Fix: bootstrap each metric to zero at telemetry init with the same
label set the recorder will later use, so:
- "metric absent" = subsystem not deployed (true MISSING)
- "metric at 0" = deployed and so far healthy
- "metric > N" = real signal, gate trips on threshold
Bootstraps added:
- globalreg/telemetry.go: runtime_name_reregistrations_total{scope=global}
(newTelemetry now takes localNode for the bootstrap label)
- eventualreg/telemetry.go: runtime_name_reregistrations_total{scope=eventual}
- kveventual/telemetry.go: kv_watch_dropped_total{space=_init, mode=eventual}
- kvraft/telemetry.go: kv_watch_dropped_total{space=_init, mode=raft}
Tests updated to pass localNode to globalreg.newTelemetry.
* fix(observability): expose BroadcastQueue dropped counter
Bug 10: BroadcastQueue.Push() rejects entries when the bounded queue is
full and increments q.dropped, but the counter has never been wired to
Prometheus. Under sustained workload (eventualreg queue stays at 4096
in the chaos cluster) data IS being dropped silently — operator sees
queue depth = cap and has to infer drops from receiver-side gaps.
Wire q.Dropped() into telemetry on every gossip drain:
- eventualreg/telemetry.go: setQueueDropped() + bootstrap
eventualreg_queue_dropped_total{node} at zero.
- eventualreg/service.go DrainBroadcasts: publish current Dropped()
every cycle.
- kveventual/space.go drainBroadcasts: same per-space, gauge label
{node, space} so per-space saturation is visible.
- kveventual/telemetry.go: bootstrap kveventual_queue_dropped_total
with the {_init} space at zero.
Gauge (not Counter) because q.dropped is already a cumulative monotonic
counter held in memory; setting the gauge each cycle makes Prometheus
rate() compute drop rate the same way it would for a counter, with no
delta-tracking complexity.
* fix(raft): implement RequestPreVote on peerStateTracker so pre-vote enables
Bug 12: hashicorp/raft logs at startup
"raft: pre-vote is disabled because it is not supported by the Transport"
because peerStateTracker (our Transport wrapper for per-peer dead
backoff) didn't implement WithPreVote — only AppendEntries, RequestVote,
and AppendEntriesPipeline.
Without pre-vote, a partitioned node keeps incrementing its term during
isolation (Raft requires it to bump term on every election timeout
that fails to reach quorum). On rejoin the cluster sees a term higher
than the current leader and forces a step-down — even though the
cluster was healthy. Observed in the chaos cluster: runtime-0 logged
24 election timeouts in 5 minutes, terms inflated 144 → 166.
Add RequestPreVote following the same per-peer dead-backoff pattern
as RequestVote. The inner is hraft.NetworkTransport which natively
implements WithPreVote in v1.7.3; the type assertion guards against
future regressions where someone wraps a non-pre-vote transport
without realizing it.
Test pin: TestPeerStateTracker_SatisfiesWithPreVote asserts the tracker
satisfies the WithPreVote interface and that RequestPreVote forwards
to the inner transport.
* fix(crdt): single-lock snapshot in ShardHash to remove nil-deref race
Bug 13: harness panicked under sustained workload with TCP push/pull
anti-entropy active (after Bug 11 fixed netpol egress to 7948):
panic: runtime error: invalid memory address or nil pointer dereference
...
system/crdt.(*State).ShardHash ... state.go:388
system/crdt.MakeDigest ... digest.go:28
system/kveventual.(*Service).MergeRemoteState ... service.go:300
Root cause: ShardHash held the shard RLock to snapshot keys, RELEASED
it across sort.Strings, then re-acquired it for per-key entry lookups.
Between the two locks, an Unregister could remove a key — sh.entries[k]
returned nil and `e.Node` on the next line dereferenced nil.
The original split-lock design avoided holding the lock during sort,
but sort.Strings on ~1500 keys is fast (microseconds) compared to the
risk; safer to take a one-shot snapshot of {key, node, counter, wall,
deleted, value} under a single RLock and hash from the slice. Entry
fields are cheap-copy primitives plus a value []byte — and entry
values are never mutated in place (Overwrite/Apply always swap in
*Entry copies via cloneBytes), so aliasing the slice after unlock is
safe.
Test pin: TestShardHash_ConcurrentDelete (with -race, 8 readers ×
ShardCount × 200ms hammering ShardHash while a writer churns
Overwrite/Unregister) reliably panicked on the old code; passes
cleanly on the new implementation.
* fix(membership): rejoin loop when memberlist becomes isolated
Bug 14: when all peers cycle through chaos at once (e.g., the 3 runtime
pods all hit container-kill in the same window), the harness's
memberlist drops every peer and ends up alone — and stays that way
forever. memberlist has no built-in auto-rejoin: peers coming back
have a fresh empty member list with no record of the harness and can't
reach it via gossip.
Symptom in the chaos cluster:
- harness logs show "node left" for runtime-0/1/2 in a ~50s window,
then NOTHING for the next 10+ minutes.
- harness workload prints members_in_first_group=1 (just self).
- gossip_user_getbroadcasts_calls_total = 0/s on harness (no other
peers to gossip to → memberlist's gossip() loop returns early).
- runtime peers' eventualreg/kveventual entries flatline at near-zero
because the harness's data never propagates.
Fix: add rejoinLoop, started from Service.Start when JoinAddrs is set.
Every 30s, if Members() == [self], re-runs memberlist.Join with the
configured addresses. Fast path is the no-op check; under normal chaos
where peers cycle individually, the loop never triggers.
Reaction time chosen to be slow on purpose — single-peer flapping
recovers via gossip in seconds; this loop only catches the "all
peers gone at once" edge case.
* fix(otel): align histogram buckets with prometheus DefBuckets
Bug 15: pg_op_p99_seconds reported 4.9s in the chaos cluster while the
underlying observations were sub-millisecond. Cause: dual export.
The runtime feeds metrics into both the in-pod Prometheus exporter
(client_golang, DefBuckets = 0.005..10s) AND the OTel exporter (push
to otel-collector → prometheus_remote_write). OTel's SDK default
histogram bucket boundaries are 5, 10, 25, 50, 75, ..., 10000 — wildly
different scale, intended for second-grade observations.
Both paths land in the same Prometheus instance with the same metric
name. Prometheus joins them into one series whose `le` label values
are the UNION of both bucket sets, but observations from each path go
only into THEIR side's buckets. histogram_quantile() then mis-attributes
the OTel-side buckets (le=25..10000) as outlier observations, dragging
the computed p99 to ~5s.
Fix: in service/otel/metrics.go's getOrCreateHistogram, supply
WithExplicitBucketBoundaries that mirror prometheus.DefBuckets exactly.
Both paths now write to the same bucket set; the merged series is a
true sum and percentiles compute correctly.
* fix(eventualreg): rename antientropy_duration_ms to _seconds, divide by 1000
Bug 16: eventualreg_antientropy_duration_ms_bucket received millisecond
values into seconds-grade DefBuckets [0.005..10s]. Real anti-entropy
rounds take 10-200ms, so every observation hit the +Inf bucket and the
histogram was effectively unobservable.
Surfaced after Bug 15 made histogram bucket configuration consistent
across export paths — once the wrong-magnitude problem stopped being
masked by spurious OTel-side buckets, the +Inf-only distribution
became visible.
Fix: convert at the observation site (durationMs/1000.0) and rename to
follow Prometheus convention (_seconds). The recordAntiEntropy
signature still accepts durationMs because the caller computes
time.Since(start).Milliseconds() — keeping that interface avoids a
ripple of unit changes through the call chain.
* fix(raft): per-role last_contact ceiling — 30s voter, 5min non-voter
At 60+ replicas under chaos the leader's heartbeat fan-out cannot keep
every non-voter follower's last_contact below 30s, so /livez 503s on
non-voters and kubelet cycles them — observed cascade at scale=75/100
in the break-test.
Voters still gate quorum and must be timely (30s). Non-voters are
replication-only; permanent isolation is detected via the gossip-side
health check (cluster.gossip), not raft heartbeats.
Adds Node.IsVoter() reading the local entry from GetConfiguration().
* fix(pg): raise pg.broadcast_recent /livez ceiling to 5min
Same shape as Bug 21 (raft.last_contact). Under continuous random
network-partition chaos, a pod can be cut off from gossip TX/RX for
tens of seconds at a time; the previous 30s ceiling tripped /livez 503
regularly and kubelet cycled the pod, cascading the cluster.
cluster.gossip + raft.last_contact (per role) already detect truly
disconnected nodes. This check now only fires when the pg subsystem
itself stalls — event loop wedged, all peers gone, retry queue saturated.
* chore(F-series): F1.5 + F3.1 + F3.3 — log/eventbus instrumentation
F1.5 eventbus: cap subscribers at 4096 (DefaultMaxSubscribers); reject
with ErrSubscribersCapReached; emit eventbus_subscribers_active
gauge + eventbus_subscribers_rejected_total counter; SetCollector
bridge from metrics boot. Cap surfaces a runaway leak instead of
silently OOMing the process.
F3.1 pg: replace per-event WARN at the 7 submit() backpressure-drop
sites with the existing pg_queue_dropped_total counter; gate the
submit() queue-full / approaching-capacity log on atomic state
transitions only. Under partition storm: 0 WARN/s/node observed
(previously thousands).
F3.3 cmd/internal/logger + system/logs: add zap.Hook-driven
runtime_log_emissions_total{level,component} counter, bound to
the metrics collector through the SetEmissionCollector bridge
after the boot subsystem is up.
* fix(raft): instrumentedTransport must explicitly implement RequestPreVote
Bug 23: peerStateTracker.RequestPreVote (Bug 12) type-asserts its inner
to hraft.WithPreVote so it can forward the call. The chain in
production is peerStateTracker -> *instrumentedTransport -> *NetworkTransport.
*NetworkTransport implements WithPreVote, but *instrumentedTransport
embeds the hraft.Transport INTERFACE — which the hashicorp/raft team
defined to NOT include RequestPreVote ("wasn't in the original
interface specification"). Embedded interface method promotion
exposes only the interface's method set, not the concrete value's,
so *instrumentedTransport did not satisfy WithPreVote and the
assertion at peer_state.go:129 failed on every pre-vote RPC.
Visible only post-Bug-21: previously kubelet was killing the leader
on /livez 503 before the election storm could be observed. Once Bug
21 stopped the cycling, the leader pod (wippy-runtime-0) accumulated
22 restarts in ~10min via the storm.
Fix: implement RequestPreVote explicitly on *instrumentedTransport,
forwarding via the WithPreVote assertion on its concrete inner.
Also adds raft_request_pre_vote_total / _duration_seconds telemetry
mirroring RequestVote.
Locked with TestInstrumentedTransport_PreservesWithPreVote so a
future refactor of either wrapper would surface the regression at
test time, not in a 60-pod cluster.
* feat(chaos): admin surface, raft expose, eventualreg shard-pull, gossip boot grace
- system/admin: minimal read-only HTTP surface for the chaos harness
(raft status/log-head/configuration via shared dispatcher, eventualreg
digest, membership members). Hard-requires its deps from app context.
- system/raft: expose CommitIndex() + Term(); clamp LogHead to the
commit index so cross-pod comparison reflects Raft's Log Matching
Property (uncommitted suffix may legitimately diverge after partition
heal). Read CommitIndex before LastIndex to make the clamp
race-resistant.
- system/eventualreg: wire the dormant shard-pull anti-entropy path.
MergeRemoteState now emits FrameTypeShardRequest to the divergent peer
over reliable TCP; receiver replies with FrameTypeShardResponse.
Closes the silent-drop convergence hole (BroadcastQueue overflow had
no recovery channel beyond GC). 5s per-peer cooldown + telemetry
counters for sent/suppressed/send_error and rx/tx shard responses.
- cluster/membership: SendUserMessage helper that wraps memberlist
SendReliable with the same kind/len multiplex header the broadcast
path uses — single hop for targeted request/response.
- boot/components/system/cluster: 60s grace window on the cluster.gossip
liveness health check. Non-bootstrap pods rejoining the cluster used
to exit 0 mid-rejoin when kubelet SIGTERM'd them after 3 probe
failures against an above-ceiling HealthScore; with the grace they
stay alive long enough to converge.
- boot/components/system/admin: hard-required deps (membership,
globalRaft, eventualReg). Fail-loud at boot instead of nil + 503.
- govet inline: reflect.Ptr → reflect.Pointer rename across the
interpolator/payload/registry packages.
- gosec G124: nolint annotation on outgoing client cookie at
service/http/client/handler.go — Secure/HttpOnly/SameSite are
server-set attributes ignored by (*http.Request).AddCookie.
* fix(globalreg): forwarded apply returns typed FSM result
The follower-side forward path returned (nil, nil) on success, dropping
the typed FSM response. Callers such as Register that asserted the
result to *RegisterResult then failed with "unexpected register response
type: <nil>" — successful registrations on the leader looked like
errors on the follower.
The response envelope is now versioned. v0 (legacy [8B corrID][errMsg])
remains decodable for mixed-version clusters. v1 prepends a magic byte
+ version byte, then a msgpack body carrying the command kind, an
optional error string, and an opaque typed-result blob. The leader
encodes *RegisterResult / *UnregisterResult / *RemoveResult via wire
structs (errors are round-tripped as strings); the follower decodes by
command kind and returns the typed pointer.
Adds globalreg_forwarded_apply_total{cmd,result} and
globalreg_forwarded_apply_latency_seconds{cmd}.
Empirical proof under tests/proof/run_follower_register.sh (kept
untracked alongside the existing rig): two follower probes contend 50
GLOBAL registrations each on the same name set. The script PASSes
(exit 0) with the fix and reproducibly FAILs with the exact pre-fix
error string when the typed-result return is reverted.
* refactor(globalreg,eventualreg): unify Lookup behind options
Collapse Lookup / LookupWithFence / LookupByPID / ValidateFence into a
single Registry.Lookup(ctx, name, opts...) returning a rich LookupResult.
The three legacy methods stay on the interface as // Deprecated forwarders
for one transition cycle; new globalreg.ValidateFence(ctx, reg, name, token)
helper drops the FSM-read indirection at the relay fence hot path. Same
options surface mirrored on eventualreg (WithFence yields token=0; ByPID
unsupported - eventual has no reverse index).
* refactor(process): name-based send resolves and fences internally
Introduce a single system/process.ResolveDestination(ctx, dest) Go helper
that parses raw PIDs, then consults globalreg (fence-bearing), eventualreg
and the local PIDRegistry in order. The Lua process module now delegates
every send/terminate/cancel/monitor/link path to it and drops the
per-LState fenceCaches sync.Map (stale-fence footgun after re-registration).
The legacy process.registry.lookup_with_fence and validate_fence are
moved under process.registry.debug.* and the top-level aliases keep
working but emit a one-shot deprecation banner per LState. spec.md
updated; relay router test mock fixed to satisfy the T2 helper path.
globalreg.RecordFenceRejection emits an info log so chaos rigs without
a metrics scrape endpoint can correlate stale-fence drops by name and
reason.
* feat(globalreg): ROOT all-live-node ack scope; CONSISTENT + EVENTUAL split
Adds the strict ROOT scope (Raft singleton + all-live-node ack on the
committed epoch within a deadline) alongside the existing CONSISTENT
(formerly Global) and EVENTUAL scopes. The Lua surface exposes the four
constants LOCAL / EVENTUAL / CONSISTENT / ROOT; Global stays as a
back-compat alias.
The FSM gains CmdRegisterPending/Ack/Expired/Unreserve and a pending
map alongside the active names map. Acks flow over the relay; the
leader applies them through Raft, promotes pending -> active inline
once the set covers RequiredNodes, and a leader-side timer commits
CmdRegisterExpired when the deadline elapses. The membership snapshot
is captured at the pending commit and replayed deterministically across
replicas. RegisterScope blocks until Active or Expired or ctx.Done.
Adds /admin/globalreg/pending plus globalreg_root_pending/active/expired/
ack/release counters and a globalreg_root_pending_in_flight gauge.
* feat(internode): per-class sub-protocol multiplexer
Adds a 1-byte sub-protocol Class to the internode frame header so the
wire is self-describing and receivers can dispatch by class without
inspecting payload. protocolVersion is bumped 0x01 -> 0x02 and
frameHeaderSize 5 -> 6.
The existing class enum (RaftControl, Gossip, PGBroadcast) is extended
with ClassRaftMesh (wire byte 0x03), reserved for the mesh-backed Raft
transport landing in T5.2.
- NodeConnection.Send takes a Class; Run delivers it back to the
handler.
- DrainMessages returns []Outbound preserving the original class so
requeue-on-disconnect honors QoS instead of forcing every pending
frame into ClassPGBroadcast.
- ConnectionManager.RegisterClassReceiver lets a sub-system claim a
Class so its inbound frames bypass the default onMessage callback;
unclaimed classes flow through the existing path unchanged.
- State-manager queues sized to numClasses=4, ClassRaftMesh defaults
to drop-oldest with cap=4096 (etcd raft semantics).
mux_test.go covers wire round-trip per class, concurrent class senders
not interleaving frames, invalid wire-class surfacing a structured
protocolError, and the live RegisterClassReceiver dispatch on a real
ConnectionManager pair.
Local validation:
- go build ./...: clean
- go vet ./...: clean
- golangci-lint run ./...: 0 issues
- go test -race ./cluster/internode/...: pass
- go test -race ./...: pass except the pre-existing flaky
TestSupervisor_ServiceTimeout (reproduces identically on the
pre-T5.1 head f9414fc4)
- tests/proof/run_chaos.sh: pre-existing failure on f9414fc4 too
(raft leader-election starvation under loopback chaos); orthogonal
to this change which touches only internode wire format and
per-class dispatch
* feat(raft): Raft over Wippy internode mesh (T5.2) + rot sweep
Adds meshStreamLayer riding cluster/internode's ClassRaftMesh; default
Raft path now uses NodeID addressing with no separate TCP listener.
Bundles companion fixes: desiredVoterCount(2,_)=2 to stop demoting a
healthy 2-voter quorum post-failure, supervisor test ctx loosened to
15s, mesh nil-map panic on Close + regression test, deprecation
markers on legacy raft config fields.
* test(globalreg,raft,kv): close PR241 reviewer test gaps
Adds four reviewer-asked tests:
- TestConsistentScope_ConcurrentCrossNode (R1.a): cross-node
concurrent CONSISTENT registration; one winner, one
ErrNameAlreadyRegistered, both observe the same authoritative PID.
- TestRunReconcileOnce_NodeLeftRemovesServer (R1.c): membership
handler emits exactly one RemoveServer for a vanished peer.
- TestKVRaft_CASLinearizableAcrossNodes (R1.d): two services share
one Raft log; 100 concurrent CAS-increments produce final counter
= 100, zero lost updates.
- TestKVEventual_ConvergenceAcrossNodes (R1.d): disjoint Puts
converge after one gossip pump round; deletes propagate as
tombstones; wall-floor GC permits resurrection.
The partition/rejoin catch-up rig (R1.b) lives under tests/proof/
(untracked, like the other chaos rigs) and verifies via
/admin/raft/{status,configuration} that node-3 catches up after a
SIGSTOP/SIGCONT cycle covering 50 CONSISTENT registrations against
the surviving 2-voter quorum.
* perf(cluster): cut idle CPU of pg/raft/gossip clusters
Profiling a 60-node cluster showed the raft leader burning ~23% of its
800m CPU limit while idle — O(N) fan-out: the leader replicated raft
AppendEntries to all 59 peers through a deep transport, with ~77% of
the cost being non-IO overhead.
Changes:
- raft: heartbeat/election 1s->3s, commit 50ms->500ms (fewer idle AEs)
- raft: cap membership to voters + a bounded standby pool, so the leader
fans AppendEntries out to ~9 peers regardless of cluster size
- internode: writeLoop drains the per-class queues directly — drop the
second queue, the nodeControlLoop hop, and the redundant 10ms ticker
- internode: emit the queue-depth gauge only when it changes
- gossip: memberlist gossip interval 100ms->500ms
Results (idle, fresh 60-pod k3d cluster, before -> after):
- raft leader CPU: 184m -> 14m (23% -> 1.75% of the 800m limit)
- followers: ~40m -> ~7-9m
- cluster mean: ~40m -> 6m
Verified: full `go test ./...` + -race green; golangci-lint clean;
run_raft_over_mesh.sh and run_pg.sh PASS; raft quorum, leader failover
and pg broadcast continuity intact; zero internode/pg drops.
* perf(pg): per-group RCU snapshot, O(N) -> O(M_g) on join/leave
Replace global stateSnapshot rebuilt on every mutation with per-group
snapshots in sync.Map. State mutators mark a dirty set; publishDirty
drains only touched groups at the end of each event-loop closure.
At 100k groups populated, Join+Leave drops from 43.5ms to 94us (~463x),
allocs from 41MB to 2.14KB. BenchmarkPGSnapshotGroup confirms O(M_g):
M=1->34ns, M=100->1.06us, M=10k->208us. Cluster smoke (9-node docker)
passes Phase 1: 89k joins / 30s, 100% in p99<=50ms bucket, 0 drops.
Includes BenchmarkPG* suite + tests/proof/run_pg_bench.sh benchstat
regression gate against tests/proof/pg_bench/baseline.txt.
* perf(pg): batch broadcast across remotes, O(k*R) -> O(R) per JoinGroups
broadcastJoin / broadcastLeave now take map[group][]pid.PID and send one
packet per remote regardless of how many groups the call touches. The
prior per-group fan-out emitted k*R packets for JoinGroups(k); now it
emits R. encodeJoinsPayload also drops per-pid []any interface boxing
in favor of map[string][]string.
JoinGroups(1000) at R=16 peers: ~16k synthetic sends (~160ms theoretical)
collapses to 16 actual sends, 724us measured -- ~220x. Single Join at
R=64 peers: 53us vs ~640us theoretical (~12x). Adds near-flat scaling
in R: jumping R=4 -> R=16 at k=1000 costs +2%.
Bonus: local hot paths get -5% allocs and basal Join+Leave -33% (3.07us
-> 2.06us) from the cheaper encoding. Handlers stay backward compatible
with the legacy {group, pids} / {pids, groups} payload shapes for
rolling upgrade.
Adds BenchmarkPGBroadcast_SingleJoin (R sweep) and BenchmarkPGBroadcast_
Batch (R,k sweep) plus tests/proof/pg_bench/bench.broadcast.txt for
benchstat. go test -race ./system/pg/... ./api/pg/... ./runtime/lua/
modules/pg/... clean. golangci-lint clean.
* perf(pg): pool done-chan, skip OTel span when tracing disabled
Two hot-path low-risk wins for Join/Leave/JoinGroups/LeaveGroups:
1. sync.Pool for the buffered cap-1 chan error used to signal the
event-loop closure back to the caller. Acquired per call, released
on the success or submit-failure path. The ctx-cancelled path does
not release (the orphaned closure may still write later); that is a
bounded leak since Stop is rare.
2. Tracing is now opt-in via an explicit non-nil TracerProvider passed
to newTelemetry. When nil, startSpan returns a shared no-op span
and every span-related allocation collapses to zero. The call sites
gate trace.WithAttributes(...) construction behind t.tracing too, so
the variadic SpanStartOption slice is no longer built for the noop
case.
Combined effect on the Go micro-bench (M4, no-tracer config):
JoinLeave_Basal 2.06us -> 1.71us (-17%)
JoinLeave_ManyGroups N=1000 4.08us -> 3.44us (-16%)
Join_Parallel (10 prod) 2.83us -> 2.52us (-11%)
Broadcast_SingleJoin R=1 3.98us -> 3.41us (-15%)
Allocations halve on the common paths:
JoinLeave_Basal 2080 B/op -> 1008 B/op (-52%)
Join_Parallel 3120 B/op -> 2085 B/op (-33%)
Broadcast_SingleJoin R=1 4236 B/op -> 3196 B/op (-25%)
go test -race ./system/pg/... ./api/pg/... ./runtime/lua/modules/pg/...
clean; golangci-lint clean. tests/proof/pg_bench/bench.lowrisk.txt
captures the new baseline; benchstat against bench.broadcast.txt for
the A/B.
* feat(cluster,lua): move Membership context to api/cluster + expose system.node{,s}
Two changes that close out the boot-local membership TODO and give Lua
processes first-class access to cluster identity.
api/cluster/context.go (new): WithMembership / GetMembership helpers
keyed on cluster.membership in the AppContext. Replaces the boot-local
membershipServiceKey + WithMembership pair that lived in
boot/components/system/cluster.go and was marked // todo move to api.
Because AppContext keys compare by pointer identity, the migration is
atomic: the boot key is removed and all six readers (admin, kvraft,
kveventual, eventualreg, pg, raft) switch to clusterapi.GetMembership
in the same change. The two readers that need the concrete
*membership.Service (kveventual, eventualreg, for RegisterUserDelegate
and the sender/inventory adapters) keep their type assertion after
resolving the interface.
runtime/lua/modules/system/node.go (new): system.node.id() returns the
local node id via relay.GetNode; system.nodes.list() returns
[{id, is_local, addr, meta}] from cluster.GetMembership, de-duped
against the local node, sorted local-first then by id. Both are gated
by security.IsAllowed("system.read", ...). Wired into module.go with
two new immutable subtables; module_test.go shape-checks them and
node_test.go covers the success / no-relay / multi-node / permission
matrix.
Also confirms TestMultiplex_TwoNodes_UserBroadcastDelivers in
cluster/membership is not flaky: 5 isolated runs all pass in 4.2-4.4s
on this host. The 11s observed earlier was busy-host noise within the
test's 30s context and 10s eventual-consistency wait.
grep -r membershipServiceKey returns zero. go test -race + golangci-lint
clean on the affected packages.
* feat(cluster): bounded-raft foundation, STRONG/CONSISTENT name planes, system SDK
Architecture: write plane (non-member -> derived raft member -> leader, NodeID
over internode), distribute+route plane (leader-driven gossip of committed
name->PID bindings to all nodes, LWW by raftIndex, local cache), STRONG
composes all-node ack + join barrier + leader-reachability monitor on top.
Write plane (#32): system/raft.DeriveMembers exports the existing pure
selection pipeline; non-member forwards to a derived member, which
re-forwards via authoritative Leader() with hop-cap 1, election-safe. All
addressing is NodeID over the internode mesh. No raft_port, no raft_member
marker, no leader-identity gossip.
Distribute+route plane (#33): system/globalreg/dissem.go leader-driven
UserDelegate (Kind 0xC1) on membership gossip; LWW by raftIndex. Local cache
+ cold-miss forward-resolve via a member; 30s anti-entropy; 15min wall-floor
tombstone GC. JoinNameEpoch snapshot seeds CONSISTENT bindings.
STRONG protocol (#20/#30/#31): all-node ack with local-non-presence
attestation + per-node exclusion table (PENDING->ACTIVE persists, indexed
release delivered to holders). Join-epoch barrier: leader snapshot of
PENDING UNION ACTIVE; revoke local conflicts before name-ready; epoch-scoped
acks. Leader-reachability monitor (debounced); gate closes on lost contact,
re-barrier on regain.
EVENTUAL CRDT (#21/#23/#24): per-origin ORSWOT; convergent join keyed by
(Priority, FNV64(name, origin)) with NO counter/PID/wall (transitivity).
Observed-remove tombstones; live beats cross-origin tombstone. NodeLeft reap
via per-origin in-place tombstone, broadcast with true origin (prior
local-origin stamping made the reap a no-op cluster-wide). Compact-ID intern
table reclaims via free-list + refcount; bounded under ephemeral identities.
nam…
wolfy-j
added a commit
that referenced
this pull request
Jul 4, 2026
- Keep api/net.ApplyOverlayPair as a deprecated thin wrapper over OverlayResolver so out-of-tree Go callers do not break (codex). - Network boot component fails loud (ErrFrameResolversMissing) instead of silently skipping registration when the registry is absent, so a wiring bug cannot degrade the overlay to clearnet (fable #1). - Remove the now-redundant SetMultiple(task.Context) from the wasm and lua Execute handlers: the function registry executor already applies task.Context to the same forked frame and is the sole caller (fable #3). - Drop the speculative FrameResolverOrderFSRoot constant and fs-root comment references; this PR migrates only the network overlay (fable #4). - Fix stale comments and the test section header (fable #5).
wolfy-j
added a commit
that referenced
this pull request
Jul 4, 2026
* refactor(runtime): generic frame-context resolver registry Frame-decorating options (the network overlay, and future ones like a filesystem root) were hand-wired into every dispatcher: the process manager and both the wasm and lua function lifecycles each called netapi.ApplyOverlayPair, leaking api/net into generic machinery and forcing every new option to touch every dispatcher. Introduce ctxapi.FrameResolvers: an ordered registry of (ctx, options) -> ([]Pair, error) resolvers, carried on the AppContext and populated once at boot. The two real choke points — the function registry executor (shared by lua and wasm) and the process manager Start — apply the whole set generically. Reads are lock-free (atomic snapshot, copy-on-write on register), mirroring the interceptor registry; the no-resolver path is ~1.4ns with zero allocations. Network migrates onto it: api/net.OverlayResolver replaces ApplyOverlayPair and is registered by the network boot component. The dispatchers no longer import api/net or api/fs; a new frame-context option is one resolver plus one Register line, with no dispatcher edits. * review: address codex + fable findings on frame resolvers - Keep api/net.ApplyOverlayPair as a deprecated thin wrapper over OverlayResolver so out-of-tree Go callers do not break (codex). - Network boot component fails loud (ErrFrameResolversMissing) instead of silently skipping registration when the registry is absent, so a wiring bug cannot degrade the overlay to clearnet (fable #1). - Remove the now-redundant SetMultiple(task.Context) from the wasm and lua Execute handlers: the function registry executor already applies task.Context to the same forked frame and is the sole caller (fable #3). - Drop the speculative FrameResolverOrderFSRoot constant and fs-root comment references; this PR migrates only the network overlay (fable #4). - Fix stale comments and the test section header (fable #5). * review: fail closed on missing frame resolver * review: harden frame resolver claims * test: isolate frame resolver claim tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira: EE2-1482
API:
where: l - zap logger
Scripts are a set of scripts to compile:
Where
scriptIdis an ID of script written in the previous step. Data (or second arg) is our argsmap[string]anyconverted using Lua engine.respis an READ-ONLY channel with the response (string). Automatically closed on finish, so reading can be done using the simple range expression: