feat(state): Phase 9 — per-node telemetry, cluster aggregator, routing feedback#83
Merged
Conversation
abb91db to
742da4a
Compare
…g feedback Add state_node_telemetry with EWMA (configurable half-life), power-of-2 latency histogram (22 buckets, O(1) record, p50/p90/p99), telemetry_snapshot POD struct (35+ fields), JSON serialize/deserialize, and OOB bus publishing on __telemetry.<cluster>.<node> topics. Add state_telemetry_aggregator: subscribes to __telemetry.<cluster_id> on the OOB bus, ingests peer snapshots, computes cluster_telemetry_summary (aggregated counters, max latencies, summed rates), stale-peer detection (configurable threshold, default 5 s), evaluate_routing_feedback(policy) returning isolate/release node lists based on p99 latency and outbox drop thresholds. Extend state_distributed_admin with attach_telemetry(), telemetry() accessor, and to_text() telemetry section. Tests: 18 cases in state_node_telemetry_test (EWMA, histogram, JSON round-trip, sampling, latency recording, rate computation, publish/subscribe), 13 cases in state_telemetry_aggregator_test (aggregation, stale detection, routing feedback, admin integration). All pass.
Replace hyphenated node identifiers (node-1, node-2, node-3) with valid C identifiers (node1, node2, node3) to comply with the isValidStateName() validation added in Phase 7.
742da4a to
05470ba
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the core of Phase 9: Network Analytics & Live Telemetry from the distributed state roadmap.
New classes
state_node_telemetry— per-node sampler with:__builtin_clzll, p50/p90/p99 queries)telemetry_snapshotPOD struct (35+ fields: counters, rates, latency percentiles, tree/cluster shape)sample()reads attached shard, transport, and bus counterspublish_snapshot()serializes to JSON and publishes via OOB bus on__telemetry.<cluster>.<node>topicsstate_telemetry_aggregator— cluster-level rollup:__telemetry.<cluster_id>on the OOB bus viaattach_bus()cluster_telemetry_summary(aggregated counters, max latencies, summed rates)evaluate_routing_feedback(policy)returns isolate/release node lists based on p99 latency and outbox drop thresholdsto_text()for human-readable cluster telemetry dumpsModified classes
state_distributed_admin—attach_telemetry(),telemetry()accessor,to_text()now appends a[telemetry]section when an aggregator is attachedTests
state_node_telemetry_test— 18 test cases (EWMA convergence, histogram percentiles, JSON round-trip, sampling with shard/transport/bus, latency recording, rate computation, publish/subscribe via bus)state_telemetry_aggregator_test— 13 test cases (aggregation, stale detection, routing feedback isolation/release, admin integration)C-identifier compliance (67ae6f8)
node-1,node-2,node-3) with valid C identifiers (node1,node2,node3) in both test files to comply with theisValidStateName()validation from Phase 7.Remaining Phase 9 work (future PRs)