[EE2-1481]: feature: VM pool by rustatian · Pull Request #1 · wippyai/runtime

rustatian · 2024-10-30T22:24:08Z

API:

To create a Pool:

p, err := NewLuaPool(l, scripts, WithModules(base64M.New()), WithPollTimeout(time.Second*2))

where: l - zap logger
Scripts are a set of scripts to compile:

scripts := make(map[string]*PoolCfg)
scripts[scriptId] = NewPoolCfg(2, script, "hello") // 2 - number of VMs to create for that script, script - tool, "hello" - name of the main function

Queue the task:

resp := p.Queue(NewPoolTask(scriptId, data))

Where scriptId is an ID of script written in the previous step. Data (or second arg) is our args map[string]any converted using Lua engine.

resp is an READ-ONLY channel with the response (string). Automatically closed on finish, so reading can be done using the simple range expression:

for range resp {}

Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>

… synchronous demonitor, monitor cleanup - Fix #1: Process exit leave events now include PID repeated per join occurrence (e.g. 3 joins -> leave event with [pid, pid, pid]), matching Erlang pg behavior for correct member count tracking. - Fix #2: handleSync uses differential comparison instead of full teardown+rebuild, emitting join/leave events only for actual PID changes. Eliminates spurious events on node reconnection/sync. Adds diffPIDs helper with correct multiplicity handling. - Fix #3: Monitor/Events unsubscribe is now synchronous - blocks until the event loop processes the removal, guaranteeing no events arrive after unsubscribe returns (matches Erlang pg:demonitor/2). - Fix #4: PG service now monitors subscriber PIDs via topology and automatically removes monitor subscriptions when the monitoring process dies (removeMonitorsByPID). Adds hasMonitorSubscriptions/ hasGroupMemberships helpers for correct demonitor lifecycle. All 147 Go tests pass, 0 golangci-lint issues, 47 Lua E2E tests pass.

* Add pg (process groups) module with review fixes Implement distributed named process groups following Erlang/OTP pg semantics. Fix CommandID collision with stream module, shared payload slice aliasing, and duplicated sendToMembers logic found during code review. ## Matching Semantics with Erlang pg | Feature | Erlang pg | This implementation | |---|---|---| | Named groups | Processes join arbitrary named groups | Same (Group = string) | | Multi-join | A process can join the same group multiple times and must leave the same number of times | Same (state tracks duplicate group entries) | | Auto-removal on exit | Processes are automatically removed from all groups when they terminate | Same (via monitorProcess / handleProcessExit) | | Cluster-wide membership | Group membership is visible across all connected nodes | Same (local + remote state, discover/sync protocol) | | Eventually consistent | No strong consistency guarantees; nodes converge via async messaging | Same (async broadcast of join/leave/sync messages) | | get_members / get_local_members | Separate queries for all-nodes vs local-node members | Same (GetMembers vs GetLocalMembers) | | which_groups | Returns all groups with at least one member | Same | ## Differences from Erlang pg | Feature | Erlang pg | This implementation | |---|---|---| | Scopes | Supports named scopes (multiple independent pg instances) | Single global scope, no scope parameter | | Monitor callbacks | pg:monitor/1,2 lets a process receive join/leave notifications | No equivalent, query-only | | Broadcast | No built-in broadcast; sending to members is left to the caller | Adds Broadcast and BroadcastLocal as first-class operations | | Consistency model | gen_server with ordered message processing per-scope | Single-goroutine event loop via actions channel (functionally equivalent) | ## Review Fixes - Move pg CommandIDs from 50-56 to 200-206 to avoid collision with stream module (50-57) - Add defensive payload slice copy in sendToMembers to prevent data corruption - Extract duplicated sendToMembers into shared broadcast.go helper * Fix pg dispatcher boot ordering: depend on pg system component The pg dispatcher was loading before the pg service was in context, causing GetProcessGroups(ctx) to return nil and silently skipping all command handler registration. Add explicit dependency on the pg boot component to ensure correct ordering. * Move PGDispatcher to system components to fix core-only boot PGDispatcher depends on the "pg" system component, but was included in dispatchers.All() which is part of core.All(). This broke TestCorePlugins because the "pg" node doesn't exist in core-only boot, causing dependency resolution to fail with a stuck node error. Move PGDispatcher to system.All() alongside PG() where it belongs, keeping the boot ordering guarantee while allowing core-only loading. * Add pg.events() for process group membership change notifications Implements Erlang PG monitor pattern: Lua processes can subscribe to join/leave events via pg.events(), which returns a subscription with a channel that delivers membership change events in real time. - Add MembershipEvent struct and event constants to api/pg - Add emitJoinEvent/emitLeaveEvent to pg service, wired into all join/leave paths including remote and process exit handlers - Add EventsYield and Subscription type to Lua pg module - Add unit tests, service event emission tests, and 4 integration tests using inline.Pool (97 pg module tests, 107 system/pg tests) * Implement full Erlang PG compliance: monitor, scopes, batch ops, RCU snapshots Close all gaps from PG compliance review: - Fix leaveRemote multi-join bug in state.go (Phase 1.1) - Add pg.monitor(group) with atomic snapshot+subscribe (Phase 2.1) - Update pg.events() to return groups snapshot as second value (Phase 2.1.5) - Add named scopes via pg.scope(name) with prefix isolation (Phase 3.1) - Add batch join/leave accepting table of groups (Phase 4.1) - Add which_local_groups() (Phase 4.2) - Add sub:close({flush=true}) via Channel.Drain() (Phase 4.3) - Lock-free reads via RCU atomic.Pointer snapshots (Phase 5.1) - Clear snapshot on Stop() so readers get nil after shutdown - Add publishSnapshot() to protocol test submit closures All 335 tests pass (139 system/pg, 17 api/pg, 136 module, 43 Lua E2E). * Fix Erlang PG semantics: leaveAllLocal preserves duplicates, LeaveGroups uses best-effort Two bugs found comparing against Erlang PG source (pg.erl): 1. leaveAllLocal was deduplicating groups, losing multi-join count for remote broadcast. When a process joins group A twice and exits, broadcastLeave now correctly sends A twice so remote nodes remove both occurrences. Event emission is still deduplicated (one event per unique group). 2. LeaveGroups was failing fast on first not-joined group, causing partial state mutation. Now uses best-effort semantics: tries all groups, skips non-members, returns ErrNotJoined only if NONE succeeded. All 134 Go tests pass, golangci-lint clean. * Fix 4 Erlang PG semantic gaps: event multiplicity, differential sync, synchronous demonitor, monitor cleanup - Fix #1: Process exit leave events now include PID repeated per join occurrence (e.g. 3 joins -> leave event with [pid, pid, pid]), matching Erlang pg behavior for correct member count tracking. - Fix #2: handleSync uses differential comparison instead of full teardown+rebuild, emitting join/leave events only for actual PID changes. Eliminates spurious events on node reconnection/sync. Adds diffPIDs helper with correct multiplicity handling. - Fix #3: Monitor/Events unsubscribe is now synchronous - blocks until the event loop processes the removal, guaranteeing no events arrive after unsubscribe returns (matches Erlang pg:demonitor/2). - Fix #4: PG service now monitors subscriber PIDs via topology and automatically removes monitor subscriptions when the monitoring process dies (removeMonitorsByPID). Adds hasMonitorSubscriptions/ hasGroupMemberships helpers for correct demonitor lifecycle. All 147 Go tests pass, 0 golangci-lint issues, 47 Lua E2E tests pass. * Fix spurious leave events in handleRemoteLeave and add tests leaveRemote now returns a map of actually-removed PIDs per group, and handleRemoteLeave only emits events for groups the PID was truly a member of. Adds 4 state, 2 protocol, 2 service, and 1 integration test to verify correct behavior. * Refactor PG to registry-based scope manager with service addressing Replace singleton ProcessGroups with per-scope ScopeService instances resolved via resource registry. Make dispatcher stateless by embedding service reference in commands. Replace synthetic PIDs with empty-UniqID service addresses routed by host-level relay dispatch. * fix: resolve internode connection and routing bugs for distributed PG Fix 12 bugs discovered during distributed PG cluster integration testing: - Fix PID generator using empty node ID (boot/components/core/infrastructure.go) - Fix address parsing in connectToNode producing invalid host:port addresses - Fix port mismatch when connection manager restarts with AutoPort - Fix race condition in NodeJoined event ordering between internode and PG - Fix router internode receiver silently ignored due to first-write-wins - Fix MsgPack codec missing WriteExt flag causing string/bytes ambiguity - Fix syncRemote deleting remote node entry on empty groups (discover loop) - Fix stale control loop race on node leave/rejoin - Fix inbound connections dropped from not-yet-managed nodes - Fix cleanupControlLoop deleting replacement loop instance - Fix ensureNodeManaged calling AddManagedNode on already-managed nodes - Fix AddManagedNode destroying active inbound connections * cleanup: pool release hygiene, log level downgrades, PID precomputation - Change Groups[:0] to nil in JoinGroupsCmd/LeaveGroupsCmd Release() to match every other pooled command type in api/ and avoid retaining backing arrays in the pool - Downgrade 4 chatty Info-level function-entry trace logs to Debug in internode connection manager (handleConnect, handleConnected, handleDisconnected, Node already managed) - Add Precomputed() to PIDs constructed in NewServicePackage to avoid repeated string rebuilds on every String() call * fix: use Done channel instead of sleep for deadline test on Windows time.Sleep(1ms) was insufficient for a 1ns context timeout on Windows due to ~15ms timer resolution, causing the test to get ErrNotFound instead of context.DeadlineExceeded. * Add Raft-based Global Registry for cluster-wide process naming Implements distributed global registry using Raft consensus: API: - api/globalreg/: Global registry API with Register, Unregister, Lookup - api/raft/: Raft consensus API for distributed state machine System: - system/raft/: Raft node implementation with membership handler - system/globalreg/: Sharded FSM + Service for global registry - FSM: Replicated state machine for name→PID mappings - Service: Leader forwarding, auto-cleanup on node departure - Commands: Register, Unregister, RemoveNode, Heartbeat Boot: - boot/components/system/raft.go: Raft component initialization - boot/components/system/all.go: Add Raft() to component list Lua: - runtime/lua/modules/process/module.go: Add GLOBAL registration mode Features: - Cluster-wide unique names via Raft linearizability - Automatic cleanup when nodes leave cluster - Leader forwarding for write operations - Retry logic for follower leader discovery * Add PG test harness and comprehensive e2e tests - system/pg/harness/: New test harness for multi-node PG testing - TestCluster: Multi-node test cluster setup - TestNode: Individual node wrapper with mock topology/router - Helper methods for join/leave/broadcast/assertions - EventCollector for event verification - e2e_test.go: 19 end-to-end test scenarios - Single and multi-node operations - Group membership, broadcasts, monitoring - Edge cases and error conditions - stress_test.go: 8 stress tests - 100+ processes, concurrent operations - Memory stability, burst traffic - Concurrent group creation - chaos_test.go: 10 chaos tests - Node failures and recovery - Network partitions - Rapid failure sequences - bench_test.go: 8 benchmark tests - Join/leave/get members operations - Parallel and atomic operations Note: Tests using mock router don't have cross-node sync. Tests requiring real inter-node communication should use integration test setup with real relay. * Add sharded global registry with 2PC distributed transactions Implements lock-free sharded registry architecture: - ShardCoordinator: Manages multiple shards using sync.Map - Shard: Individual FSM per shard with atomic counters - TransactionManager: 2PC coordinator for cross-shard atomicity - FNV-1a consistent hashing for name distribution - Lock-free reads via sync.Map and atomic operations - All 14 sharded registry tests passing - Fixes UnregisterMulti 2PC prepare phase for both register/unregister * Add Lua process module global registry integration tests Adds comprehensive tests for the Lua process module's global registry integration using a mock registry implementation: - Registration and lookup - Conflict detection - Concurrent access - Multiple names - Unregistration - Node cleanup - Linearizability - Idempotent re-registration - Stress testing 12 tests covering mock registry behavior for Lua integration. * Add SyncedCluster test harness with 23 new PG synced tests - Add SyncedCluster infrastructure with shared event bus - Export ForwardingRouter for cross-node message routing - Add SetMonitorError to mockTopology for error injection tests - Fix 4 pre-existing E2E test bugs (ref-counting, local-view assertions) - Add 23 new synced tests covering local behavior, error injection, concurrent operations, and edge cases * fix(runtime): prevent memory leak in fenceCaches sync.Map The fenceCaches sync.Map in runtime/lua/modules/process/module.go was growing unbounded because entries were never cleaned up in production. When an LState was closed, the cache entry remained forever. Fix: Register a cleanup callback via resource.Store when the cache is first created. The callback runs when the process ends, removing the entry from fenceCaches. This follows the established pattern used by pg, fs, sql modules. Changes: - Add resource.Store cleanup registration in getFenceCache() - Update fenceCaches comment to reflect actual cleanup mechanism * Add 17 new E2E tests and 9 SyncedCluster chaos/stress tests E2E test coverage additions: - LeaveIdempotency, JoinAfterLeave, BroadcastMultipleMembers - ConcurrentLeaves, MixedJoinLeave, DifferentPIDsSameNodeSameGroup - SamePIDMultipleGroups, PartialLeaveGroups, NodeRecoveryMembership - GroupNamesWithSpecialChars, EmptyStringGroup, VerifyTopologyMonitor - ServiceStopStart, BroadcastLocalEmptyGroup, LeaveOneOfManyGroups - Fixed existing tests to account for ref-counting behavior SyncedCluster chaos/stress tests: - RapidNodeFailure, ConcurrentNodeOperations, PartialNodeFailure - CascadeFailure, HighVolumeJoins, SustainedLoad - MemoryPattern, ConcurrentGroups, BurstOperations * Fix pre-existing chaos/stress tests for local-only semantics - Update chaos tests to count local members per-node instead of expecting cross-node sync via AssertGroupSize - Update stress tests to use local member counting - Remove WaitForSync calls that expected cross-node synchronization - Tests now correctly work with TestCluster (local-only) semantics - 7 chaos tests and 8 stress tests now passing * fix(globalreg,pg): dedupe registrations and persist fence tokens in snapshots - Fix idempotent register to not increment nameCount on re-registration - Persist AppliedAt fence tokens in snapshot/restore (adds 'a' codec field) - Fix JoinGroups to dedupe groups when checking maxGroups limit - Respect maxMembersPerGroup when same group appears multiple times - Add comprehensive tests for all edge cases * feat(pg): add circuit breaker, retry mechanism and queue backpressure Add resilience features to process groups: - Circuit breaker per node (3 failures / 10s reset) - Exponential backoff retry queue for failed broadcasts - Queue overflow protection with warnings at 75% capacity - Protocol and broadcast timeouts (5s default) - New errors: ErrQueueFull, ErrBroadcastTimeout, ErrCircuitOpen Fix race condition in retry queue startup. All tests pass with -race. Refs #pg-consistency * refactor: reorder struct fields and clean up comments Optimize struct field ordering for alignment, remove redundant inline comments, and rename unused context parameters to _. * chore: remove PG harness tests (moved to wippyai/pg-harness) The PG harness test suite (harness infrastructure + 247 tests across LIFE/CHAOS/LOAD/CONC/BP/PROTO sections) now lives in the standalone github.com/wippyai/pg-harness repository at the sibling path ../pg-harness. * fix(pg): race-free Start/Stop ctx swap via atomic.Pointer Service.Start() previously wrote s.ctx and s.cancel as plain field assignments while concurrent submitters read <-s.ctx.Done() in submit(), submitError(), eventLoop(), Join/Leave/Broadcast and every other hot path. go test -race -count=3 (exercised from the pg-testapp TestNODE_ RestartUnderPressure chaos scenario) reliably caught two distinct races: WARNING: DATA RACE Write at 0x...: pg.(*Service).Start() service.go:142 Previous read: pg.(*Service).submit() service.go:276 WARNING: DATA RACE Read at 0x...: sync/atomic.LoadPointer via Join() service.go:558 Previous write: context.WithCancel from Start() service.go:142 Fix: replace the two mutable fields with a single atomic.Pointer[serviceCtx] holding the Start-scoped ctx + cancel. The holder is initialised at construction with a pre-cancelled sentinel (closedServiceCtx) so pre-Start / post-Stop reads are lock-free and need no nil guard; Start() publishes a fresh pair with one atomic Store, Stop() atomically Swaps back to the sentinel and only then cancels the old ctx. All hot-path reads go through s.currentCtx(), a single atomic.Load. Mirrors the existing atomic.Pointer[stateSnapshot] pattern already used for lock-free membership reads. No lock added, no contention change. Verified: go test -race ./system/pg/... — clean (9.5 s) go test -race -count=3 ./harness/... (pg-testapp) — clean (73 s, 88 tests × 3) golangci-lint run ./system/pg/... — 0 issues * wip: in-flight changes across pg, globalreg, raft, topology, cluster Snapshot of uncommitted work on feature/pg-process-groups before pushing: - system/pg/ — protocol/retry/state/errors refinements + test tweaks - system/globalreg/ + sharded/ — commands, fsm, service, coordinator, shard, transaction, state - system/raft/membership.go - system/topology/pid_registry.go - cluster/internode/ — codec, state_manager - boot/components/system/cluster.go - runtime/lua/modules/pg/ — module, module_test, yields - otel-harness.md — reference catalog (new) Builds clean (go build ./...). * feat(raft): cap voters with non-voter overflow and odd-size ladder Adds a leader-driven membership reconciler that caps Raft voters at MaxVoters (default 5) along an odd ladder (1,3,5,5,5,7,...), keeping surplus eligible nodes as non-voters so they receive the log without inflating quorum cost. Selection is rank-based (raft_priority asc, NodeID asc) with failure-domain spreading and soft stickiness so existing voters aren't churned across reconcile passes. - api/raft: extend Service with AddNonvoter/DemoteVoter/ LeadershipTransfer; add ErrServerNotFound. - system/raft: split selection (pure) from handler (loops); decoupled subscriber + reconcile goroutines with debounce and coalesced signal; self-demote/self-remove triggers LeadershipTransfer first to avoid killing the local leader mid-pass; soft errors abort the pass. - cluster/membership: add UpdateMeta and clone NodeMeta under lock to fix a race in the gossip delegate. - boot: gossip raft_eligible/raft_priority/failure_domain hints; wire the reconciler with HandlerConfig from cluster config. - tests: pure unit tests for selection + handler with fake services and a real eventbus debounce check; //go:build integration tests spin up 3/5/7-node in-process clusters to verify the cap end-to-end against hashicorp/raft. * feat(telemetrytest): add in-memory recorder for telemetry unit tests * feat(pg): emit OTel metrics for join/leave/broadcast operations * feat(pg): emit OTel metrics for queue, circuit breaker, retry * feat(pg): emit OTel metrics for dispatcher, fence tokens, globalreg * feat(pg): emit OTel spans for join/leave/broadcast * feat(raft): emit OTel metrics for state, term, leader changes, elections * feat(raft): emit OTel metrics for commit/log lag and AppendEntries latency * feat(raft): emit OTel metrics+spans for voter ladder and snapshots * feat(gossip): emit OTel metrics for members, joins, messages * feat(gossip): emit OTel metrics+spans for probes, suspicion, convergence * feat(globalreg,relay): emit pg_fence_*/pg_globalreg_* OTel metrics Wire the runtime dashboards (pg_fence_token, pg_fence_rejection_total, pg_globalreg_size, pg_globalreg_dedupe_total) from where the work actually happens: - system/globalreg/telemetry.go: nil-safe recorders that emit pg_* metric series via api/metrics.Collector. The pg_ prefix is a metric naming choice (Grafana contract), not a package ownership statement. - system/globalreg/state.go: introduce registerOutcome enum so the FSM can distinguish fresh insert / idempotent dedupe / conflict, and add Len() for size sampling. - system/globalreg/fsm.go: emit fence-token gauge on inserted/resolved registrations, dedupe counter on idempotent re-registration, and size gauge after every state mutation (including Restore). - system/globalreg/service.go: take optional metrics.Collector + meter/tracer providers, install telemetry on the FSM, and expose RecordFenceRejection so the relay router can report rejections via callback without depending on metrics directly. - system/relay/router.go: add SetOnFenceReject callback hook fired at the existing fence-token rejection site; avoids plumbing telemetry through ~30 NewRouter call sites. - boot/components/system/raft.go: pass nil collector triple (matching the existing raft wiring) and install the fence-reject callback. - system/pg/telemetry.go: remove the unwired recorders left over from Task 4 (recordDispatcherInflight, recordFence*, recordGlobalreg*); metric names are unchanged so dashboards still resolve. * feat(boot): wire metrics collector and OTel providers into pg/raft/gossip/globalreg constructors * fix(cluster,raft): wire metrics dep + fix raft addr from gossip ip:port * feat(raft,gossip): wrap AE pipeline+RequestVote+InstallSnapshot, synthesize gossip_message_total * feat(observability): bootstrap rare counters, sticky local voter, AE pipeline wrap, 10s push interval * perf: 30s push interval, lower workload rate (sleep 500ms), match k8s CPU limits * docs(superpowers): no-crash runtime spec - bounded everything, silent nothing * docs(superpowers): no-crash runtime implementation plan (13 tasks, TDD) * test(internode): add baseline OOM reproducer (skipped placeholder) * feat(internode): per-class bounded queues with drop-oldest|drop-newest Replace unbounded *list.List per-peer message queue with fixed-capacity ring buffers per QoS class (RaftControl 4096 drop-oldest, Gossip 1024 drop-newest, PGBroadcast 2048 drop-newest+error). All drops emit internode_dropped_total{class,reason="queue_full"} via new telemetry layer. * fix(internode): meter requeue drops, distinct ErrUnknownClass, gauge on drain * feat(internode): plumb Class through SendToNode + topic-based mapping * fix(internode): requeue respects per-class cap (no duplication on reconnect) Replace RequeueMessages with RequeueMessagesClass, which respects the per-class ring buffer cap and applies drop-oldest/drop-newest semantics consistently during requeue. Callers in handleDisconnected, cleanup, and drainMessages conservatively requeue under ClassPGBroadcast since the original class is lost at the connection level. Old RequeueMessages removed entirely; existing tests migrated to RequeueMessagesClass with ClassRaftControl as the safe default. * feat(pg): heap-based capped retry queue with drop metrics * fix(pg): bucket attempt label to bound prometheus cardinality * feat(pg): observe ErrQueueFull from internode and count broadcast drops * perf(otel): bound batcher queue/batch + span limits, non-blocking on overflow * perf(raft): drop per-AppendEntries span on hot path (preserve metrics) * fix(raft): bound LeadershipTransfer goroutine pin under partition Extract awaitFutureWithTimeout helper with a buffered(1) channel so the helper goroutine can always send and exit once hraft's future resolves — even after the caller has already returned ErrTimeout. Under a network partition the goroutine stays pinned inside f.Error() until the partition heals or raft shuts down (ErrRaftShutdown); hraft.Future has no cancel API so this single-goroutine pin per stuck transfer is the honest bound. Remove pre-existing dead code defaultTimeout from config.go. * chore: go mod tidy (remove spurious goleak indirect) * fix(chaos): close 6 root-cause issues from chaos audit Bound every unbounded growth point we saw OOM the runtime under network-partition + stress chaos: - internode: NodeConnection.activeQueue capped (MaxConnectionQueueSize, default 4096) with drop-newest + internode_dropped_total {reason="conn_queue_full"}; orphan-sweep loop reconciles managed nodes against membership every 60s. - raft: stderr adapter limiter table is a real LRU (cap 256) keyed by first-64-byte prefix; previous map grew unbounded under chaos with many distinct prefixes. - pg: monitor entries owned by PIDs on a departed node are evicted on cluster.NodeLeft; circuit breaker map capped at 1024 with a defense-in-depth eviction counter. Detect dead raft peers and break the broken-pipe storm at the transport layer (not just at the log): - system/raft/peer_state.go (new) wraps hraft.Transport with a per-peer consecutive-failure tracker. After failureLimit=5 errors the peer is marked dead for an exponential backoff window (100ms→5s). Subsequent transport calls return errPeerDead immediately, never reaching the dead socket. - raft_peer_dead_total / _skip_total / _recovered_total + a backoff gauge expose the throttle. Make raft_leader_changes_total emit reliably: - LeaderCh is non-buffered with non-blocking writes; the bootstrap node misses its own first leader-elected event when the goroutine isn't reading yet. Added an initial seed (state == Leader) and a ticker reconciliation that compares actual state vs wasLeader and fires the missing transition. Verified: 20+ leader changes now recorded under partition where prior runs reported 0. Silence chaos-time log spam: - internode SendToNode (ErrNodeNotManaged) and onMessage delivery failure: log demoted, internode_dropped_total{reason=...} replaces the line. - cmd/internal/logger gets zap.SamplingConfig{Initial:100, Thereafter:100} on the non-Console default — defense-in-depth so a future hot WARN cannot OOM the runtime. - raft TCPTransport stderr piped through raftStderrAdapter (rate- limited per first-64-byte prefix via golang.org/x/time/rate). Activity-based liveness: - system/health/ (new) is a process-wide registry of liveness checks. - /livez handler in boot/components/prometheus aggregates checks; HTTP 503 if any reports unhealthy. - pg.Service registers pg.broadcast_recent.<host> backed by system/pg/activity.go atomic timestamp (touched on every TX/RX). - raft boot registers raft.last_contact (LastContact() vs ceiling). - cluster boot registers cluster.gossip (memberlist HealthScore). - Ceiling 30s catches a pod stuck on the wrong side of a partition before kubelet's failureThreshold expires. Thread the runtime config that the configmap was already setting: - raft snapshot_threshold/_interval/_retain, trailing_logs, heartbeat_timeout, election_timeout, commit_timeout now read from the boot config instead of being silently ignored. - prometheus.enabled / prometheus.address now bind /metrics + /livez on the configured port; pprof gated by WIPPY_DEBUG_PPROF=1. Reusable cluster wiring for non-boot consumers: - cluster/stack.go AssembleStack(cfg) returns a Node + Router + membership + internode wired the same way the boot does, so the pg-harness in cluster mode joins the gossip ring as a real peer. Tests: peer_state_test (4 cases including per-peer isolation), stderr_adapter_test (LRU bounded + LRU evicts oldest, not random), service_test (removeMonitorsByNode evicts only entries on the departed node, leaves others). golangci-lint clean across all modified packages. * feat(cluster): 3-tier name registry + raft hardening + KV drivers Three cohesive workstreams that lay the groundwork for hosting userland on top of the runtime cluster. 1. Three-tier process name registry. LOCAL and GLOBAL existed; this lands the EVENTUAL tier for ~100k user-session-class names that cannot pay Raft cost per registration. system/eventualreg implements an ORSWOT CRDT with 64-shard digests + anti-entropy via memberlist push/pull, tombstone GC with safe-counter tracking, OTel telemetry. The membership delegate at cluster/membership/membership.go now multiplexes multiple UserDelegates over the gossip channel (kind:1|len:4|payload framing) so eventualreg, kveventual, etc. coexist. Triangular shadow-check in PIDRegistry blocks LOCAL names from shadowing GLOBAL/EVENTUAL. 2. Re-registration flood bounds + raft rebalancing hardening. Watermarked reestablishMonitors avoids full FSM rescan on leader failover. Chunked applyRemoveNode (256 names/Apply) lets foreground writes interleave during node-departure cleanup. pg discover fan-out capped at 4 random peers so simultaneous restarts don't produce N^2 messages. New runtime_name_reregistrations_total{scope} counter feeds the soak gate. Raft additions: voter-op metrics + 60s churn-burst counter, proactive dead-voter eviction (peerStateTracker streak + gossip both red), domain-aware stickiness (same-domain victim preferred when swapping), defensive quorum-safety clamp, live max_voters reconfig. 3. Cluster KV drivers. New api/kv interface (Get/Put/Delete/CAS/Watch/Scan, Mode={Raft,Eventual}, ProviderRegistry). system/crdt extracted as a generic ORSWOT base; kveventual rides it for gossip KV with UserDelegate kind 0xE2 (vs eventualreg's 0xE1). kvraft runs its own hashicorp/raft replication group on a separate port via raft-boltdb/v2 storage; FSM commands Put/Delete/CAS/ReapTTL with periodic leader-driven TTL reap, per-namespace watch hub with overflow-drop + metric, snapshot/restore round-trip. Boot wiring composes both backends into a single ProviderRegistry. No new dependencies — uses libraries already in go.mod (hashicorp/raft, raft-boltdb/v2, bbolt, go-msgpack/v2). Tests: 60+ new unit tests across new packages. Full runtime suite green (272 packages); lint clean. Covers ORSWOT merge correctness, two-replica convergence, shard hash determinism, tombstone GC, FSM Put/Get/CAS/TTL, snapshot+restore round-trip, watch-hub fan-out + overflow. * fix(cluster): cluster-mode bugs found in chaos rig Five bugs surfaced when running runtime + pg-harness in the k3d monkey cluster under active network-partition + delay chaos. Each is fixed at the root, not papered over. 1. kveventual telemetry label-set mismatch. The bootstrap registered kveventual_space_open_total with {node} but recordSpaceOpen recorded with {node, space}, panicking the prometheus exporter on first space.Open. Same shape on _op_total/_bytes_total/_tombstones_gc_total where space-level recorders were missing the `node` label entirely. Bootstraps now use the FULL label set the recorder will later use; space.go now carries `node` and includes it in every label map. 2. runtime_name_reregistrations_total label-set divergence between globalreg.recordReregistration (`{scope}`) and eventualreg (`{node, scope}`). Same exporter panic. globalreg now takes node as a param so both writers register the same descriptor. 3. cluster.Stack.Start leaked the memberlist gossip port on Membership. Start failure. memberlist.Create binds the TCP listener BEFORE Join, so a Join failure left port 7948 bound and any retry hit "address already in use". Stack.Start now tears membership down on failure, making the harness retry path actually retryable. 4. kveventual_entries gauge was never set anywhere — dashboards blind to KV size. Added space.setEntriesGauge() called from Put/Delete/ applyFrame and bootstrapped with {node, space, state} label set. Verified end-to-end in the cluster: 7+ min soak under network-partition-50pct + network-delay-200ms-jitter-100ms with all 4 pods at 0 restarts and all hard correctness gates green. * fix(gossip): budget-aware Drain so user-broadcasts fit MTU Bug 7 root cause: BroadcastQueue.Drain() ignored memberlist's per-cycle byte budget and returned ALL pending frames. With the kveventual chaos workload pushing ~470 ops/sec, each gossip tick produced 30+ frames totaling 40+KB. memberlist packed them into one compound UDP packet that exceeded MTU and was silently dropped at the IP layer (or by the kernel UDP send buffer). Fixes: 1. crdt + eventualreg BroadcastQueue.Drain(headerOverhead, byteBudget): - Honor byteBudget — accumulate frames whose total cost (len(frame) + headerOverhead) fits the budget. - Re-queue undrained entries at the head of pending; next gossip cycle picks up where this one left off (no data loss, just pacing). 2. eventualreg.Service.DrainBroadcasts, kveventual.space.drainBroadcasts: - Plumb byteBudget through to inner BroadcastQueue. 3. eventualreg.Delegate.GetBroadcasts, kveventual.Service.GetBroadcasts: - Pass memberlist's `limit` (not _) to inner. - In kveventual: apportion remaining budget across spaces; once a gossip cycle is full, untouched spaces keep their entries pending. 4. cluster/membership outer multiplex GetBroadcasts: - Pass `limit` (not `limit-5`) to inner — the 5-byte mux header is part of muxOverhead, not a separate budget reservation. - Add gossip_user_getbroadcasts_overshoot_total counter that fires if an inner delegate exceeds its budget (defensive observability). 5. cluster/membership telemetry: - Bootstrap gossip_message_total{kind="user", direction=rx|tx} so dashboards/gates can distinguish "metric never registered" from "no user-broadcasts received this window." 6. New regression tests in crdt/delta_budget_test.go assert Drain honors a tight budget and re-queues; an absurdly tight budget produces zero frames and keeps the queue intact. 7. New cluster/membership/multiplex_test.go: two-node test verifies the multiplex routing actually delivers a user-broadcast end-to-end. * fix(membership): pipe memberlist logs through zap; never silence ERR When memberlist's stdlib logger was redirected to io.Discard in non- verbose mode, every "[ERR] Failed to decode user message", "[WARN] Was able to connect over TCP but UDP probes failed", and similar diagnostic got dropped on the floor. That is exactly what masked Bug 7 (network-policy blocking UDP/7946) for an entire soak: the cluster looked healthy from our perspective while the operator-visible failure surface was reduced to "convergence is slow." New behavior: - Always surface [ERR]/[WARN] memberlist lines via zap at matching level. - Drop [INFO]/[DEBUG] unless VeryVerbose. - Lines are buffered and split on newline so the underlying log.Logger's prefix bytes don't get fragmented into multiple zap records. Add memberlistLogWriter (cluster/membership/log_writer.go) and wire it in as mlConfig.LogOutput. * fix(gossip): fair budget split across user delegates in outer multiplex Bug 8: with eventualreg's broadcast queue constantly full (4096 entries pending under workload), the outer multiplex GetBroadcasts loop hands it the entire per-cycle UDP budget on every call. After this commit's predecessor (Bug 7 fix) re-enabled UDP gossip, the symptom became visible: kveventual_bytes_total{dir=tx} went 0/s while eventualreg shipped at 39 KB/s. kveventual entries on peers froze at the first few that landed before eventualreg's queue saturated. Two-pass fair share: 1. Pass 1: each delegate gets limit/N. Guarantees no delegate is starved by a busy peer. 2. Pass 2: remaining budget redistributed to whoever still has pending entries. Recovers throughput when one delegate is idle. Plus deterministic iteration order (sort by Kind) so behavior doesn't depend on Go's map iteration randomness — a bug like this is much harder to reproduce when the order is whatever the hash table happens to enumerate. Inner delegates already re-queue undrained entries (Bug 7 commit), so shrinking the per-call budget is not data loss; subsequent gossip ticks pick up the rest. * fix(crdt,eventualreg): Drain emits a short frame when budget < MaxFrameBytes Follow-up to the budget-aware Drain (Bug 7 / 5a8fcfae) and the fair budget split (Bug 8 / 8b7da4ae). Verified on the chaos cluster: even after the fairness pass, kveventual still drained at ~0 bytes/s while eventualreg shipped ~39 KB/s. The remaining starvation is in Drain itself. The previous logic flushed a frame only when accumulated entries hit perFrameLimit (= MaxFrameBytes - headerOverhead, ~1385 bytes). When the outer multiplex hands a per-delegate share of ~699 bytes (limit/N across two delegates), Drain would pack ~17 entries up to ~1385 bytes and only THEN attempt to flush. The flush check (bytesEmitted+cost <= byteBudget) then fails (1400 > 699), so the frame is discarded and the loop breaks with NO frames emitted. The entries stay queued, the next call sees the same state, and the delegate makes zero forward progress no matter how often it is called. Fix: introduce activeFrameCap = min(perFrameLimit, byteBudget - bytesEmitted - headerOverhead). Frames flush as soon as they fill the active cap — which is correctly ~684 bytes when the budget is 699 and 0 byte emitted. Result: a short frame is emitted, ~8 entries clear the queue per call, and forward progress is bounded only by the fair budget share. Added TestBroadcastQueue_DrainSmallBudget pinning the 699-byte case. * fix(observability): bootstrap rare safety-gate metrics Bug 9: validate-chaos-impact.sh's hard gates for runtime_name_reregistrations_total (W2.4) and kv_watch_dropped_total (W6.3) silently report MISSING whenever no event has ever fired for that metric. Without bootstrapping, "MISSING" looks identical to "feature not deployed" and the safety net is effectively dormant. Fix: bootstrap each metric to zero at telemetry init with the same label set the recorder will later use, so: - "metric absent" = subsystem not deployed (true MISSING) - "metric at 0" = deployed and so far healthy - "metric > N" = real signal, gate trips on threshold Bootstraps added: - globalreg/telemetry.go: runtime_name_reregistrations_total{scope=global} (newTelemetry now takes localNode for the bootstrap label) - eventualreg/telemetry.go: runtime_name_reregistrations_total{scope=eventual} - kveventual/telemetry.go: kv_watch_dropped_total{space=_init, mode=eventual} - kvraft/telemetry.go: kv_watch_dropped_total{space=_init, mode=raft} Tests updated to pass localNode to globalreg.newTelemetry. * fix(observability): expose BroadcastQueue dropped counter Bug 10: BroadcastQueue.Push() rejects entries when the bounded queue is full and increments q.dropped, but the counter has never been wired to Prometheus. Under sustained workload (eventualreg queue stays at 4096 in the chaos cluster) data IS being dropped silently — operator sees queue depth = cap and has to infer drops from receiver-side gaps. Wire q.Dropped() into telemetry on every gossip drain: - eventualreg/telemetry.go: setQueueDropped() + bootstrap eventualreg_queue_dropped_total{node} at zero. - eventualreg/service.go DrainBroadcasts: publish current Dropped() every cycle. - kveventual/space.go drainBroadcasts: same per-space, gauge label {node, space} so per-space saturation is visible. - kveventual/telemetry.go: bootstrap kveventual_queue_dropped_total with the {_init} space at zero. Gauge (not Counter) because q.dropped is already a cumulative monotonic counter held in memory; setting the gauge each cycle makes Prometheus rate() compute drop rate the same way it would for a counter, with no delta-tracking complexity. * fix(raft): implement RequestPreVote on peerStateTracker so pre-vote enables Bug 12: hashicorp/raft logs at startup "raft: pre-vote is disabled because it is not supported by the Transport" because peerStateTracker (our Transport wrapper for per-peer dead backoff) didn't implement WithPreVote — only AppendEntries, RequestVote, and AppendEntriesPipeline. Without pre-vote, a partitioned node keeps incrementing its term during isolation (Raft requires it to bump term on every election timeout that fails to reach quorum). On rejoin the cluster sees a term higher than the current leader and forces a step-down — even though the cluster was healthy. Observed in the chaos cluster: runtime-0 logged 24 election timeouts in 5 minutes, terms inflated 144 → 166. Add RequestPreVote following the same per-peer dead-backoff pattern as RequestVote. The inner is hraft.NetworkTransport which natively implements WithPreVote in v1.7.3; the type assertion guards against future regressions where someone wraps a non-pre-vote transport without realizing it. Test pin: TestPeerStateTracker_SatisfiesWithPreVote asserts the tracker satisfies the WithPreVote interface and that RequestPreVote forwards to the inner transport. * fix(crdt): single-lock snapshot in ShardHash to remove nil-deref race Bug 13: harness panicked under sustained workload with TCP push/pull anti-entropy active (after Bug 11 fixed netpol egress to 7948): panic: runtime error: invalid memory address or nil pointer dereference ... system/crdt.(*State).ShardHash ... state.go:388 system/crdt.MakeDigest ... digest.go:28 system/kveventual.(*Service).MergeRemoteState ... service.go:300 Root cause: ShardHash held the shard RLock to snapshot keys, RELEASED it across sort.Strings, then re-acquired it for per-key entry lookups. Between the two locks, an Unregister could remove a key — sh.entries[k] returned nil and `e.Node` on the next line dereferenced nil. The original split-lock design avoided holding the lock during sort, but sort.Strings on ~1500 keys is fast (microseconds) compared to the risk; safer to take a one-shot snapshot of {key, node, counter, wall, deleted, value} under a single RLock and hash from the slice. Entry fields are cheap-copy primitives plus a value []byte — and entry values are never mutated in place (Overwrite/Apply always swap in *Entry copies via cloneBytes), so aliasing the slice after unlock is safe. Test pin: TestShardHash_ConcurrentDelete (with -race, 8 readers × ShardCount × 200ms hammering ShardHash while a writer churns Overwrite/Unregister) reliably panicked on the old code; passes cleanly on the new implementation. * fix(membership): rejoin loop when memberlist becomes isolated Bug 14: when all peers cycle through chaos at once (e.g., the 3 runtime pods all hit container-kill in the same window), the harness's memberlist drops every peer and ends up alone — and stays that way forever. memberlist has no built-in auto-rejoin: peers coming back have a fresh empty member list with no record of the harness and can't reach it via gossip. Symptom in the chaos cluster: - harness logs show "node left" for runtime-0/1/2 in a ~50s window, then NOTHING for the next 10+ minutes. - harness workload prints members_in_first_group=1 (just self). - gossip_user_getbroadcasts_calls_total = 0/s on harness (no other peers to gossip to → memberlist's gossip() loop returns early). - runtime peers' eventualreg/kveventual entries flatline at near-zero because the harness's data never propagates. Fix: add rejoinLoop, started from Service.Start when JoinAddrs is set. Every 30s, if Members() == [self], re-runs memberlist.Join with the configured addresses. Fast path is the no-op check; under normal chaos where peers cycle individually, the loop never triggers. Reaction time chosen to be slow on purpose — single-peer flapping recovers via gossip in seconds; this loop only catches the "all peers gone at once" edge case. * fix(otel): align histogram buckets with prometheus DefBuckets Bug 15: pg_op_p99_seconds reported 4.9s in the chaos cluster while the underlying observations were sub-millisecond. Cause: dual export. The runtime feeds metrics into both the in-pod Prometheus exporter (client_golang, DefBuckets = 0.005..10s) AND the OTel exporter (push to otel-collector → prometheus_remote_write). OTel's SDK default histogram bucket boundaries are 5, 10, 25, 50, 75, ..., 10000 — wildly different scale, intended for second-grade observations. Both paths land in the same Prometheus instance with the same metric name. Prometheus joins them into one series whose `le` label values are the UNION of both bucket sets, but observations from each path go only into THEIR side's buckets. histogram_quantile() then mis-attributes the OTel-side buckets (le=25..10000) as outlier observations, dragging the computed p99 to ~5s. Fix: in service/otel/metrics.go's getOrCreateHistogram, supply WithExplicitBucketBoundaries that mirror prometheus.DefBuckets exactly. Both paths now write to the same bucket set; the merged series is a true sum and percentiles compute correctly. * fix(eventualreg): rename antientropy_duration_ms to _seconds, divide by 1000 Bug 16: eventualreg_antientropy_duration_ms_bucket received millisecond values into seconds-grade DefBuckets [0.005..10s]. Real anti-entropy rounds take 10-200ms, so every observation hit the +Inf bucket and the histogram was effectively unobservable. Surfaced after Bug 15 made histogram bucket configuration consistent across export paths — once the wrong-magnitude problem stopped being masked by spurious OTel-side buckets, the +Inf-only distribution became visible. Fix: convert at the observation site (durationMs/1000.0) and rename to follow Prometheus convention (_seconds). The recordAntiEntropy signature still accepts durationMs because the caller computes time.Since(start).Milliseconds() — keeping that interface avoids a ripple of unit changes through the call chain. * fix(raft): per-role last_contact ceiling — 30s voter, 5min non-voter At 60+ replicas under chaos the leader's heartbeat fan-out cannot keep every non-voter follower's last_contact below 30s, so /livez 503s on non-voters and kubelet cycles them — observed cascade at scale=75/100 in the break-test. Voters still gate quorum and must be timely (30s). Non-voters are replication-only; permanent isolation is detected via the gossip-side health check (cluster.gossip), not raft heartbeats. Adds Node.IsVoter() reading the local entry from GetConfiguration(). * fix(pg): raise pg.broadcast_recent /livez ceiling to 5min Same shape as Bug 21 (raft.last_contact). Under continuous random network-partition chaos, a pod can be cut off from gossip TX/RX for tens of seconds at a time; the previous 30s ceiling tripped /livez 503 regularly and kubelet cycled the pod, cascading the cluster. cluster.gossip + raft.last_contact (per role) already detect truly disconnected nodes. This check now only fires when the pg subsystem itself stalls — event loop wedged, all peers gone, retry queue saturated. * chore(F-series): F1.5 + F3.1 + F3.3 — log/eventbus instrumentation F1.5 eventbus: cap subscribers at 4096 (DefaultMaxSubscribers); reject with ErrSubscribersCapReached; emit eventbus_subscribers_active gauge + eventbus_subscribers_rejected_total counter; SetCollector bridge from metrics boot. Cap surfaces a runaway leak instead of silently OOMing the process. F3.1 pg: replace per-event WARN at the 7 submit() backpressure-drop sites with the existing pg_queue_dropped_total counter; gate the submit() queue-full / approaching-capacity log on atomic state transitions only. Under partition storm: 0 WARN/s/node observed (previously thousands). F3.3 cmd/internal/logger + system/logs: add zap.Hook-driven runtime_log_emissions_total{level,component} counter, bound to the metrics collector through the SetEmissionCollector bridge after the boot subsystem is up. * fix(raft): instrumentedTransport must explicitly implement RequestPreVote Bug 23: peerStateTracker.RequestPreVote (Bug 12) type-asserts its inner to hraft.WithPreVote so it can forward the call. The chain in production is peerStateTracker -> *instrumentedTransport -> *NetworkTransport. *NetworkTransport implements WithPreVote, but *instrumentedTransport embeds the hraft.Transport INTERFACE — which the hashicorp/raft team defined to NOT include RequestPreVote ("wasn't in the original interface specification"). Embedded interface method promotion exposes only the interface's method set, not the concrete value's, so *instrumentedTransport did not satisfy WithPreVote and the assertion at peer_state.go:129 failed on every pre-vote RPC. Visible only post-Bug-21: previously kubelet was killing the leader on /livez 503 before the election storm could be observed. Once Bug 21 stopped the cycling, the leader pod (wippy-runtime-0) accumulated 22 restarts in ~10min via the storm. Fix: implement RequestPreVote explicitly on *instrumentedTransport, forwarding via the WithPreVote assertion on its concrete inner. Also adds raft_request_pre_vote_total / _duration_seconds telemetry mirroring RequestVote. Locked with TestInstrumentedTransport_PreservesWithPreVote so a future refactor of either wrapper would surface the regression at test time, not in a 60-pod cluster. * feat(chaos): admin surface, raft expose, eventualreg shard-pull, gossip boot grace - system/admin: minimal read-only HTTP surface for the chaos harness (raft status/log-head/configuration via shared dispatcher, eventualreg digest, membership members). Hard-requires its deps from app context. - system/raft: expose CommitIndex() + Term(); clamp LogHead to the commit index so cross-pod comparison reflects Raft's Log Matching Property (uncommitted suffix may legitimately diverge after partition heal). Read CommitIndex before LastIndex to make the clamp race-resistant. - system/eventualreg: wire the dormant shard-pull anti-entropy path. MergeRemoteState now emits FrameTypeShardRequest to the divergent peer over reliable TCP; receiver replies with FrameTypeShardResponse. Closes the silent-drop convergence hole (BroadcastQueue overflow had no recovery channel beyond GC). 5s per-peer cooldown + telemetry counters for sent/suppressed/send_error and rx/tx shard responses. - cluster/membership: SendUserMessage helper that wraps memberlist SendReliable with the same kind/len multiplex header the broadcast path uses — single hop for targeted request/response. - boot/components/system/cluster: 60s grace window on the cluster.gossip liveness health check. Non-bootstrap pods rejoining the cluster used to exit 0 mid-rejoin when kubelet SIGTERM'd them after 3 probe failures against an above-ceiling HealthScore; with the grace they stay alive long enough to converge. - boot/components/system/admin: hard-required deps (membership, globalRaft, eventualReg). Fail-loud at boot instead of nil + 503. - govet inline: reflect.Ptr → reflect.Pointer rename across the interpolator/payload/registry packages. - gosec G124: nolint annotation on outgoing client cookie at service/http/client/handler.go — Secure/HttpOnly/SameSite are server-set attributes ignored by (*http.Request).AddCookie. * fix(globalreg): forwarded apply returns typed FSM result The follower-side forward path returned (nil, nil) on success, dropping the typed FSM response. Callers such as Register that asserted the result to *RegisterResult then failed with "unexpected register response type: <nil>" — successful registrations on the leader looked like errors on the follower. The response envelope is now versioned. v0 (legacy [8B corrID][errMsg]) remains decodable for mixed-version clusters. v1 prepends a magic byte + version byte, then a msgpack body carrying the command kind, an optional error string, and an opaque typed-result blob. The leader encodes *RegisterResult / *UnregisterResult / *RemoveResult via wire structs (errors are round-tripped as strings); the follower decodes by command kind and returns the typed pointer. Adds globalreg_forwarded_apply_total{cmd,result} and globalreg_forwarded_apply_latency_seconds{cmd}. Empirical proof under tests/proof/run_follower_register.sh (kept untracked alongside the existing rig): two follower probes contend 50 GLOBAL registrations each on the same name set. The script PASSes (exit 0) with the fix and reproducibly FAILs with the exact pre-fix error string when the typed-result return is reverted. * refactor(globalreg,eventualreg): unify Lookup behind options Collapse Lookup / LookupWithFence / LookupByPID / ValidateFence into a single Registry.Lookup(ctx, name, opts...) returning a rich LookupResult. The three legacy methods stay on the interface as // Deprecated forwarders for one transition cycle; new globalreg.ValidateFence(ctx, reg, name, token) helper drops the FSM-read indirection at the relay fence hot path. Same options surface mirrored on eventualreg (WithFence yields token=0; ByPID unsupported - eventual has no reverse index). * refactor(process): name-based send resolves and fences internally Introduce a single system/process.ResolveDestination(ctx, dest) Go helper that parses raw PIDs, then consults globalreg (fence-bearing), eventualreg and the local PIDRegistry in order. The Lua process module now delegates every send/terminate/cancel/monitor/link path to it and drops the per-LState fenceCaches sync.Map (stale-fence footgun after re-registration). The legacy process.registry.lookup_with_fence and validate_fence are moved under process.registry.debug.* and the top-level aliases keep working but emit a one-shot deprecation banner per LState. spec.md updated; relay router test mock fixed to satisfy the T2 helper path. globalreg.RecordFenceRejection emits an info log so chaos rigs without a metrics scrape endpoint can correlate stale-fence drops by name and reason. * feat(globalreg): ROOT all-live-node ack scope; CONSISTENT + EVENTUAL split Adds the strict ROOT scope (Raft singleton + all-live-node ack on the committed epoch within a deadline) alongside the existing CONSISTENT (formerly Global) and EVENTUAL scopes. The Lua surface exposes the four constants LOCAL / EVENTUAL / CONSISTENT / ROOT; Global stays as a back-compat alias. The FSM gains CmdRegisterPending/Ack/Expired/Unreserve and a pending map alongside the active names map. Acks flow over the relay; the leader applies them through Raft, promotes pending -> active inline once the set covers RequiredNodes, and a leader-side timer commits CmdRegisterExpired when the deadline elapses. The membership snapshot is captured at the pending commit and replayed deterministically across replicas. RegisterScope blocks until Active or Expired or ctx.Done. Adds /admin/globalreg/pending plus globalreg_root_pending/active/expired/ ack/release counters and a globalreg_root_pending_in_flight gauge. * feat(internode): per-class sub-protocol multiplexer Adds a 1-byte sub-protocol Class to the internode frame header so the wire is self-describing and receivers can dispatch by class without inspecting payload. protocolVersion is bumped 0x01 -> 0x02 and frameHeaderSize 5 -> 6. The existing class enum (RaftControl, Gossip, PGBroadcast) is extended with ClassRaftMesh (wire byte 0x03), reserved for the mesh-backed Raft transport landing in T5.2. - NodeConnection.Send takes a Class; Run delivers it back to the handler. - DrainMessages returns []Outbound preserving the original class so requeue-on-disconnect honors QoS instead of forcing every pending frame into ClassPGBroadcast. - ConnectionManager.RegisterClassReceiver lets a sub-system claim a Class so its inbound frames bypass the default onMessage callback; unclaimed classes flow through the existing path unchanged. - State-manager queues sized to numClasses=4, ClassRaftMesh defaults to drop-oldest with cap=4096 (etcd raft semantics). mux_test.go covers wire round-trip per class, concurrent class senders not interleaving frames, invalid wire-class surfacing a structured protocolError, and the live RegisterClassReceiver dispatch on a real ConnectionManager pair. Local validation: - go build ./...: clean - go vet ./...: clean - golangci-lint run ./...: 0 issues - go test -race ./cluster/internode/...: pass - go test -race ./...: pass except the pre-existing flaky TestSupervisor_ServiceTimeout (reproduces identically on the pre-T5.1 head f9414fc4) - tests/proof/run_chaos.sh: pre-existing failure on f9414fc4 too (raft leader-election starvation under loopback chaos); orthogonal to this change which touches only internode wire format and per-class dispatch * feat(raft): Raft over Wippy internode mesh (T5.2) + rot sweep Adds meshStreamLayer riding cluster/internode's ClassRaftMesh; default Raft path now uses NodeID addressing with no separate TCP listener. Bundles companion fixes: desiredVoterCount(2,_)=2 to stop demoting a healthy 2-voter quorum post-failure, supervisor test ctx loosened to 15s, mesh nil-map panic on Close + regression test, deprecation markers on legacy raft config fields. * test(globalreg,raft,kv): close PR241 reviewer test gaps Adds four reviewer-asked tests: - TestConsistentScope_ConcurrentCrossNode (R1.a): cross-node concurrent CONSISTENT registration; one winner, one ErrNameAlreadyRegistered, both observe the same authoritative PID. - TestRunReconcileOnce_NodeLeftRemovesServer (R1.c): membership handler emits exactly one RemoveServer for a vanished peer. - TestKVRaft_CASLinearizableAcrossNodes (R1.d): two services share one Raft log; 100 concurrent CAS-increments produce final counter = 100, zero lost updates. - TestKVEventual_ConvergenceAcrossNodes (R1.d): disjoint Puts converge after one gossip pump round; deletes propagate as tombstones; wall-floor GC permits resurrection. The partition/rejoin catch-up rig (R1.b) lives under tests/proof/ (untracked, like the other chaos rigs) and verifies via /admin/raft/{status,configuration} that node-3 catches up after a SIGSTOP/SIGCONT cycle covering 50 CONSISTENT registrations against the surviving 2-voter quorum. * perf(cluster): cut idle CPU of pg/raft/gossip clusters Profiling a 60-node cluster showed the raft leader burning ~23% of its 800m CPU limit while idle — O(N) fan-out: the leader replicated raft AppendEntries to all 59 peers through a deep transport, with ~77% of the cost being non-IO overhead. Changes: - raft: heartbeat/election 1s->3s, commit 50ms->500ms (fewer idle AEs) - raft: cap membership to voters + a bounded standby pool, so the leader fans AppendEntries out to ~9 peers regardless of cluster size - internode: writeLoop drains the per-class queues directly — drop the second queue, the nodeControlLoop hop, and the redundant 10ms ticker - internode: emit the queue-depth gauge only when it changes - gossip: memberlist gossip interval 100ms->500ms Results (idle, fresh 60-pod k3d cluster, before -> after): - raft leader CPU: 184m -> 14m (23% -> 1.75% of the 800m limit) - followers: ~40m -> ~7-9m - cluster mean: ~40m -> 6m Verified: full `go test ./...` + -race green; golangci-lint clean; run_raft_over_mesh.sh and run_pg.sh PASS; raft quorum, leader failover and pg broadcast continuity intact; zero internode/pg drops. * perf(pg): per-group RCU snapshot, O(N) -> O(M_g) on join/leave Replace global stateSnapshot rebuilt on every mutation with per-group snapshots in sync.Map. State mutators mark a dirty set; publishDirty drains only touched groups at the end of each event-loop closure. At 100k groups populated, Join+Leave drops from 43.5ms to 94us (~463x), allocs from 41MB to 2.14KB. BenchmarkPGSnapshotGroup confirms O(M_g): M=1->34ns, M=100->1.06us, M=10k->208us. Cluster smoke (9-node docker) passes Phase 1: 89k joins / 30s, 100% in p99<=50ms bucket, 0 drops. Includes BenchmarkPG* suite + tests/proof/run_pg_bench.sh benchstat regression gate against tests/proof/pg_bench/baseline.txt. * perf(pg): batch broadcast across remotes, O(k*R) -> O(R) per JoinGroups broadcastJoin / broadcastLeave now take map[group][]pid.PID and send one packet per remote regardless of how many groups the call touches. The prior per-group fan-out emitted k*R packets for JoinGroups(k); now it emits R. encodeJoinsPayload also drops per-pid []any interface boxing in favor of map[string][]string. JoinGroups(1000) at R=16 peers: ~16k synthetic sends (~160ms theoretical) collapses to 16 actual sends, 724us measured -- ~220x. Single Join at R=64 peers: 53us vs ~640us theoretical (~12x). Adds near-flat scaling in R: jumping R=4 -> R=16 at k=1000 costs +2%. Bonus: local hot paths get -5% allocs and basal Join+Leave -33% (3.07us -> 2.06us) from the cheaper encoding. Handlers stay backward compatible with the legacy {group, pids} / {pids, groups} payload shapes for rolling upgrade. Adds BenchmarkPGBroadcast_SingleJoin (R sweep) and BenchmarkPGBroadcast_ Batch (R,k sweep) plus tests/proof/pg_bench/bench.broadcast.txt for benchstat. go test -race ./system/pg/... ./api/pg/... ./runtime/lua/ modules/pg/... clean. golangci-lint clean. * perf(pg): pool done-chan, skip OTel span when tracing disabled Two hot-path low-risk wins for Join/Leave/JoinGroups/LeaveGroups: 1. sync.Pool for the buffered cap-1 chan error used to signal the event-loop closure back to the caller. Acquired per call, released on the success or submit-failure path. The ctx-cancelled path does not release (the orphaned closure may still write later); that is a bounded leak since Stop is rare. 2. Tracing is now opt-in via an explicit non-nil TracerProvider passed to newTelemetry. When nil, startSpan returns a shared no-op span and every span-related allocation collapses to zero. The call sites gate trace.WithAttributes(...) construction behind t.tracing too, so the variadic SpanStartOption slice is no longer built for the noop case. Combined effect on the Go micro-bench (M4, no-tracer config): JoinLeave_Basal 2.06us -> 1.71us (-17%) JoinLeave_ManyGroups N=1000 4.08us -> 3.44us (-16%) Join_Parallel (10 prod) 2.83us -> 2.52us (-11%) Broadcast_SingleJoin R=1 3.98us -> 3.41us (-15%) Allocations halve on the common paths: JoinLeave_Basal 2080 B/op -> 1008 B/op (-52%) Join_Parallel 3120 B/op -> 2085 B/op (-33%) Broadcast_SingleJoin R=1 4236 B/op -> 3196 B/op (-25%) go test -race ./system/pg/... ./api/pg/... ./runtime/lua/modules/pg/... clean; golangci-lint clean. tests/proof/pg_bench/bench.lowrisk.txt captures the new baseline; benchstat against bench.broadcast.txt for the A/B. * feat(cluster,lua): move Membership context to api/cluster + expose system.node{,s} Two changes that close out the boot-local membership TODO and give Lua processes first-class access to cluster identity. api/cluster/context.go (new): WithMembership / GetMembership helpers keyed on cluster.membership in the AppContext. Replaces the boot-local membershipServiceKey + WithMembership pair that lived in boot/components/system/cluster.go and was marked // todo move to api. Because AppContext keys compare by pointer identity, the migration is atomic: the boot key is removed and all six readers (admin, kvraft, kveventual, eventualreg, pg, raft) switch to clusterapi.GetMembership in the same change. The two readers that need the concrete *membership.Service (kveventual, eventualreg, for RegisterUserDelegate and the sender/inventory adapters) keep their type assertion after resolving the interface. runtime/lua/modules/system/node.go (new): system.node.id() returns the local node id via relay.GetNode; system.nodes.list() returns [{id, is_local, addr, meta}] from cluster.GetMembership, de-duped against the local node, sorted local-first then by id. Both are gated by security.IsAllowed("system.read", ...). Wired into module.go with two new immutable subtables; module_test.go shape-checks them and node_test.go covers the success / no-relay / multi-node / permission matrix. Also confirms TestMultiplex_TwoNodes_UserBroadcastDelivers in cluster/membership is not flaky: 5 isolated runs all pass in 4.2-4.4s on this host. The 11s observed earlier was busy-host noise within the test's 30s context and 10s eventual-consistency wait. grep -r membershipServiceKey returns zero. go test -race + golangci-lint clean on the affected packages. * feat(cluster): bounded-raft foundation, STRONG/CONSISTENT name planes, system SDK Architecture: write plane (non-member -> derived raft member -> leader, NodeID over internode), distribute+route plane (leader-driven gossip of committed name->PID bindings to all nodes, LWW by raftIndex, local cache), STRONG composes all-node ack + join barrier + leader-reachability monitor on top. Write plane (#32): system/raft.DeriveMembers exports the existing pure selection pipeline; non-member forwards to a derived member, which re-forwards via authoritative Leader() with hop-cap 1, election-safe. All addressing is NodeID over the internode mesh. No raft_port, no raft_member marker, no leader-identity gossip. Distribute+route plane (#33): system/globalreg/dissem.go leader-driven UserDelegate (Kind 0xC1) on membership gossip; LWW by raftIndex. Local cache + cold-miss forward-resolve via a member; 30s anti-entropy; 15min wall-floor tombstone GC. JoinNameEpoch snapshot seeds CONSISTENT bindings. STRONG protocol (#20/#30/#31): all-node ack with local-non-presence attestation + per-node exclusion table (PENDING->ACTIVE persists, indexed release delivered to holders). Join-epoch barrier: leader snapshot of PENDING UNION ACTIVE; revoke local conflicts before name-ready; epoch-scoped acks. Leader-reachability monitor (debounced); gate closes on lost contact, re-barrier on regain. EVENTUAL CRDT (#21/#23/#24): per-origin ORSWOT; convergent join keyed by (Priority, FNV64(name, origin)) with NO counter/PID/wall (transitivity). Observed-remove tombstones; live beats cross-origin tombstone. NodeLeft reap via per-origin in-place tombstone, broadcast with true origin (prior local-origin stamping made the reap a no-op cluster-wide). Compact-ID intern table reclaims via free-list + refcount; bounded under ephemeral identities. nam…

- Keep api/net.ApplyOverlayPair as a deprecated thin wrapper over OverlayResolver so out-of-tree Go callers do not break (codex). - Network boot component fails loud (ErrFrameResolversMissing) instead of silently skipping registration when the registry is absent, so a wiring bug cannot degrade the overlay to clearnet (fable #1). - Remove the now-redundant SetMultiple(task.Context) from the wasm and lua Execute handlers: the function registry executor already applies task.Context to the same forked frame and is the sole caller (fable #3). - Drop the speculative FrameResolverOrderFSRoot constant and fs-root comment references; this PR migrates only the network overlay (fable #4). - Fix stale comments and the test section header (fable #5).

* refactor(runtime): generic frame-context resolver registry Frame-decorating options (the network overlay, and future ones like a filesystem root) were hand-wired into every dispatcher: the process manager and both the wasm and lua function lifecycles each called netapi.ApplyOverlayPair, leaking api/net into generic machinery and forcing every new option to touch every dispatcher. Introduce ctxapi.FrameResolvers: an ordered registry of (ctx, options) -> ([]Pair, error) resolvers, carried on the AppContext and populated once at boot. The two real choke points — the function registry executor (shared by lua and wasm) and the process manager Start — apply the whole set generically. Reads are lock-free (atomic snapshot, copy-on-write on register), mirroring the interceptor registry; the no-resolver path is ~1.4ns with zero allocations. Network migrates onto it: api/net.OverlayResolver replaces ApplyOverlayPair and is registered by the network boot component. The dispatchers no longer import api/net or api/fs; a new frame-context option is one resolver plus one Register line, with no dispatcher edits. * review: address codex + fable findings on frame resolvers - Keep api/net.ApplyOverlayPair as a deprecated thin wrapper over OverlayResolver so out-of-tree Go callers do not break (codex). - Network boot component fails loud (ErrFrameResolversMissing) instead of silently skipping registration when the registry is absent, so a wiring bug cannot degrade the overlay to clearnet (fable #1). - Remove the now-redundant SetMultiple(task.Context) from the wasm and lua Execute handlers: the function registry executor already applies task.Context to the same forked frame and is the sole caller (fable #3). - Drop the speculative FrameResolverOrderFSRoot constant and fs-root comment references; this PR migrates only the network overlay (fable #4). - Fix stale comments and the test section header (fable #5). * review: fail closed on missing frame resolver * review: harden frame resolver claims * test: isolate frame resolver claim tests

feature: VM pool

8d462e6

Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>

rustatian added the enhancement New feature or request label Oct 30, 2024

rustatian requested a review from wolfy-j October 30, 2024 22:24

rustatian self-assigned this Oct 30, 2024

rustatian changed the title ~~[DRAFT, DO NOT MERGE]: feature: VM pool~~ [EE2-1481, DRAFT, DO NOT MERGE]: feature: VM pool Oct 31, 2024

chore: finilize api

8b4bce2

Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>

rustatian changed the title ~~[EE2-1481, DRAFT, DO NOT MERGE]: feature: VM pool~~ [EE2-1481]: feature: VM pool Oct 31, 2024

chore: add CI

433ed1c

Signed-off-by: Valery Piashchynski <piashchynski.valery@gmail.com>

rustatian merged commit a88bfbb into main Oct 31, 2024

rustatian deleted the feature/vm-pool branch October 31, 2024 14:44

rustatian removed their assignment Jul 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EE2-1481]: feature: VM pool#1

[EE2-1481]: feature: VM pool#1
rustatian merged 3 commits into
mainfrom
feature/vm-pool

rustatian commented Oct 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rustatian commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rustatian commented Oct 30, 2024 •

edited

Loading