Skip to content

feat(rpcgrpc/mux-bw): surface pump-phase route failures as MuxRouteFailure events#2756

Merged
0pcom merged 1 commit into
skycoin:developfrom
0pcom:feat/mux-bw-route-failure-event
May 21, 2026
Merged

feat(rpcgrpc/mux-bw): surface pump-phase route failures as MuxRouteFailure events#2756
0pcom merged 1 commit into
skycoin:developfrom
0pcom:feat/mux-bw-route-failure-event

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 21, 2026

Summary

Until this PR, a route's pump goroutine in `muxBwPumpRoute` that hit an error from `PingOnceWithEcho` (typically `route_group closed` / `i/o timeout` / `no route` after the route was established) silently logged the error to the visor's debug stream and returned. Nothing surfaced on the gRPC stream. The operator's only signal was `MuxBandwidthSample.active_routes` decrementing with no attribution.

This made the 2026-05-21 mux-via-intermediates measurements ambiguous: routes would establish (visible in `MuxRouteEstablished` events), then 0 bytes flowed for the pump duration. Consumer had no way to see WHY. Distinguishing "established but pump never delivered a byte" from "delivered N bytes then died" matters — they imply different root causes (rules-not-propagated vs intermediate flake vs peer concurrency limit).

What's added

  • `MuxRouteFailure` proto with `route_index`, `error_message`, `bytes_sent_before_failure`, `bytes_received_before_failure`, `elapsed_ns` (time-since-pump-start)
  • `route_failure = 15` oneof variant on `MuxBandwidthEvent`
  • `buildRouteFailureEvent` helper — split from the inline emit so field-population is unit-testable without a full VisorAPI mock
  • Server emit: in `muxBwPumpRoute`, when `PingOnceWithEcho` returns err AND `ctx.Err() == nil` (i.e. not a clean shutdown), emit the event before exiting. The existing `s.log.Debugf` is preserved for visor-side audit.
  • Human-mode CLI: `[+Xs] R3 ✗ pump-failed before=N/M err=...` line
  • TUI render: flips per-route `established=false` + `failed=true` so the per-route table reflects the new state, plus an event-feed line
  • JSON-mode classifier: returns `"route_failure"` so the NDJSON envelope carries the new discriminator

Test plan

  • `go build ./...`
  • `go test ./pkg/visor/rpcgrpc/... ./cmd/skywire-cli/commands/visor/ping/...` — green
  • `TestBuildRouteFailureEvent` — 2 subtests pin field shapes for both the populates-everything case and the zero-bytes-first-call-failure case (dominant 2026-05-21 repro)
  • `make format` clean
  • e2e validation against a live mux-bw run with intermediates — gated on auto-deploy of this PR + Fix dmsgpty #117/[M2] Multiple errors on visor restart #122 reaching the test visors

Operator impact

  • `cli visor ping mux-bw --routes N --min-hops 2`: when a pump fails, the consumer now sees `route_failure` events with the cause, not just a silent `active_routes` decrement
  • NDJSON consumers (Gamma's treeprobe-style harnesses): new `"type":"route_failure"` events with int64-string fields per protobuf convention

Refs the 2026-05-21 mux-via-intermediates investigation. Co-claim with Gamma's slice (b) — folded into one event type covering both setup-time and pump-time failures (well, pump-time; setup-time stays on the existing `MuxRouteEstablished{failed:true}` path).

…ilure events

Until this PR, when a route's pump goroutine in muxBwPumpRoute hit an
error from PingOnceWithEcho (typically "route_group closed", "i/o
timeout", or "no route" after the route was established), the error
was silently logged at debug level via s.log.Debugf and the goroutine
returned. Nothing surfaced on the gRPC stream. The operator's only
signal was MuxBandwidthSample.active_routes dropping by 1 with no
attribution.

This made the 2026-05-21 mux-via-intermediates measurement impossible
to interpret cleanly: routes would establish successfully (visible in
MuxRouteEstablished events), then 0 bytes would flow for the pump
duration, and the consumer had no way to see WHY. Distinguishing
"established but pump never delivered a byte" from "delivered N bytes
then died" matters because they imply different root causes (rules
not propagated vs intermediate flake vs peer-side concurrency).

Adds:
- MuxRouteFailure proto message with route_index, error_message,
  bytes_sent_before_failure, bytes_received_before_failure, and
  elapsed_ns (time-since-pump-start).
- New oneof variant route_failure = 15 on MuxBandwidthEvent.
- buildRouteFailureEvent helper that constructs the proto from the
  pump-route state — split out from the inline emit site so the
  field-population is unit-testable without a full VisorAPI mock.
- Emit site in muxBwPumpRoute: when PingOnceWithEcho returns err
  AND ctx.Err() is nil (i.e. not a clean shutdown), emit the event
  before exiting. Keeps the s.log.Debugf for visor-side audit too.
- Human-mode CLI renderer: "[+Xs] R3 ✗ pump-failed before=N/M err=..."
- TUI renderer: flips route's established=false + failed=true so the
  per-route table reflects the new state, plus an event-feed entry.
- JSON-mode classifier returns "route_failure" so the NDJSON envelope
  carries the new type discriminator.

Test: TestBuildRouteFailureEvent pins the field shapes against both
the populates-everything case and the zero-bytes-first-call-failure
case (the dominant repro on the 2026-05-21 measurement).

Refs the 2026-05-21 mux-via-intermediates investigation; Gamma's
slice (b) "established-then-died distinct event" + Beta's slice (a)
"surface pump errors" landed together as one event type.
@0pcom
Copy link
Copy Markdown
Collaborator Author

0pcom commented May 21, 2026

LGTM (Gamma review — slice (b) author).

The MuxRouteFailure event closes the observability gap we hit empirically last night — the "established but no bytes" case in N=8 --min-hops 2 runs was invisible without this. Folding (b) into the same event is the right call: pump-died and established-then-died are the same logical thing, just with bytes_sent_before_failure == 0 vs > 0.

Specific things I like:

  • Helper buildRouteFailureEvent split out for unit testability — avoids needing a VisorAPI mock.
  • The ctx.Err() == nil guard correctly skips emit on clean pump-duration shutdown (otherwise every successful run would emit a spurious failure when pump deadline fires mid-flight in PingOnceWithEcho).
  • Test for both first-call-zero-bytes and bytes-then-died cases covers the two distinct operator-attribution scenarios.
  • CLI human-mode renders inline; NDJSON discriminator route_failure keeps consumer parsers clean.

Nit (not blocking): elapsed_ns is computed inside buildRouteFailureEvent via time.Since(pumpStart), which adds a few µs for the call itself. For diagnostic precision this is sub-ms noise vs second-scale pump windows — fine. If a future consumer wants strict ordering between sample.elapsed_ns and route_failure.elapsed_ns, capturing the elapsed at the err-branch entry would tighten that.

Ready to merge when CI greens.

@0pcom 0pcom merged commit c27575e into skycoin:develop May 21, 2026
4 checks passed
0pcom added a commit that referenced this pull request May 21, 2026
… + MinHops (#2757)

Beta's #2756 MuxRouteFailure event surfaced the smoking gun on
mux-bw --routes N --min-hops 2: one route would establish but its
pump loop immediately failed with

  no ping connection for <pk>#0, call DialPing first

while MuxRouteEstablished named route_index = 2 for the same
route. The lookup PingRouteRef.Index was 0 even though the
pump goroutine had RouteIndex = 2.

ROOT CAUSE
The rpcgrpc PingConf → visor PingConfig adapter in
pkg/visor/init_apps.go's visorPingAdapter forwards RouteIndex
correctly for DialPing (line 238) and PingOnce (line 249), but
PingOnceWithEcho's adapter (lines 316-322) was missing the
field — same for MinHops. Aux-route pumps therefore degraded to
a primary-route (Index=0) lookup, finding nothing because
DialPing had registered the conn at the matching aux Index.

FIX
Add `RouteIndex: conf.RouteIndex` and `MinHops: conf.MinHops` to
the visorPingAdapter.PingOnceWithEcho conversion. Three-line
adapter parity fix.

DMSG adapter (DmsgPingOnceWithEcho) intentionally untouched —
v.dmsgPing.conns is keyed by PK alone, not by PingRouteRef, so
no DMSG path consumes RouteIndex.

TEST
TestPingAdapter_PingOnceWithEcho_ForwardsRouteIndex pins the
adapter contract: with no ping connection registered, the visor's
PingOnceWithEcho returns

  no ping connection for %s#%d, call DialPing first

The "%d" portion is conf.RouteIndex post-adapter. The test calls
the adapter with RouteIndex in {0, 1, 2, 7} and asserts the error
message contains the matching `#<idx>,` — so a regression that
drops RouteIndex (or stuffs in MinHops by accident) surfaces in
the error string directly. No mock VisorAPI, no fixtures.

EMPIRICAL CHAIN
This is the third bug in the mux-bw --min-hops measurement chain
to surface via the wire-event observability landed in #2746#2749#2751#2752#2750#2753#2754#2756. The pattern
holds: each event-surface fix lights up the next bug downstream.
The operator's "mux > direct" hypothesis test should now finally
be measurable end-to-end once #2756 (route_failure event) +
this PR auto-deploy.
0pcom added a commit that referenced this pull request May 21, 2026
* fix(visor/ping): narrow ping.mu/dmsgPing.mu critical section to map lookup

DOMINANT BOTTLENECK for mux-bw bandwidth measurements. The visor's
Ping/PingOnce/PingOnceWithEcho (and dmsg twins) held v.ping.mu (a
single visor-global *sync.Mutex) for the ENTIRE wire roundtrip:

  v.ping.mu.Lock()
  defer v.ping.mu.Unlock()
  pingEntry, ok := v.ping.conns[ref]
  // ... ~287ms of wire I/O at 2-hop with 32 KB payloads ...

mux-bw's N pump goroutines all call PingOnceWithEcho on DIFFERENT
PingRouteRefs. They each look up their OWN conn via the map; the
wire I/O is independent. But the global mutex serialized them
through one ~287ms slot each. So:

  - Aggregate throughput across N routes didn't scale with N
  - Per-route avg pinned at ~351 kbps even though single-call peak
    was ~1.7 Mbps (1 RTT × 32 KB)
  - --probe-rtt latency probes during a loaded pump measured
    "probe-mutex-wait + network RTT" instead of network RTT,
    swamping the queueing-delay signal at short hop counts
  - Bidirectional simultaneous mux-bw measurements showed
    mutual-starvation that LOOKED like shared-link contention
    but was actually mutex contention on each side's ping state

ROOT CAUSE

The mutex's actual job is to protect v.ping.conns (the map) from
concurrent insert (DialPing) and delete (StopPingRoute). The
wire I/O on the chosen conn does NOT need the map mutex held —
each mux-bw pump goroutine owns its own conn via its RouteIndex,
no aliasing.

FIX

Shrink the critical section to just the map lookup:

  v.ping.mu.Lock()
  pingEntry, ok := v.ping.conns[ref]
  v.ping.mu.Unlock()
  if !ok { ... }
  // ... wire I/O on pingEntry.conn WITHOUT holding the mutex ...

Applied to Ping, PingOnce, PingOnceWithEcho, DmsgPing,
DmsgPingOnce, DmsgPingOnceWithEcho. BandwidthTest already had the
correct narrow scope.

CONCURRENT-CLOSE SEMANTICS

Pre-fix: StopPing concurrent with PingOnceWithEcho serialized via
the mutex — they took turns, no race. Post-fix: StopPing can close
the conn while PingOnceWithEcho is doing wire I/O. The Read/Write
on the closed conn returns ErrClosed cleanly. mux-bw's pump loop
already handles read/write errors by exiting the pump goroutine;
the resulting failure is surfaced via Beta's MuxRouteFailure event
(#2756) so the operator sees the cause instead of an indefinite
block.

The same-PingRouteRef-from-multiple-goroutines case (always
undefined behavior on the underlying net.Conn) is unchanged —
callers must serialize themselves. mux-bw enforces one
goroutine per RouteIndex natively.

TESTS

pkg/visor/ping_mu_concurrency_test.go:
  - TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes:
    200 concurrent calls with distinct PingRouteRefs (no registered
    conns) complete in << 1s. A regression that re-introduces
    wire-I/O-under-lock would either time out or take orders of
    magnitude longer.
  - TestPingMu_NotHeldDuringConnAbsentCallpath: after
    PingOnceWithEcho returns, the mutex must be immediately
    acquirable from another goroutine. Catches the defer-on-entry
    pattern directly.

EMPIRICAL PREDICTION

Once this auto-deploys, the operator's "mux > direct" hypothesis
becomes testable WITHOUT --hops-via intermediate pinning. Per-
route avg should rise from ~351 kbps toward the single-call peak
of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly
linearly (modulo per-intermediate quality variance) instead of
flat-lining at single-route throughput.

CHAIN

The mux-bw measurement loop has now closed:
  #2745 per-route teardown
  #2746 disjoint-intermediate routing
  #2749 plumb --min-hops through DialPing
  #2750 stop poisoning dst breaker from intermediate failures
  #2751 tryDirectPingDial gate on MinHops
  #2752 honor caller SetupTimeout
  #2756 MuxRouteFailure pump-phase event
  #2757 PingOnceWithEcho adapter forwards RouteIndex
  this PR: ping.mu doesn't serialize parallel pumps

If the mux > direct hypothesis is real, it should be visible
in measurements after this lands.

* fix(visor/ping): errcheck on discarded PingOnceWithEcho returns in test

golangci-lint errcheck flagged both test sites that discard the
4-tuple return of v.PingOnceWithEcho via _, _, _, _. The discards
are intentional — the concurrency test asserts on wall-clock
serialization behavior, not on per-call success; the mutex-release
test asserts that the lock is acquirable post-return regardless of
whether the call itself succeeded or returned ErrNoPingConnection.

Add //nolint:errcheck comments explaining the intent. No behavior
change; CI lint pass.
0pcom added a commit that referenced this pull request May 21, 2026
…ssthrough (#2762)

Two related fixes that emerged from the post-#2761 measurement work:

1. SILENT PUMP STALL (#127) — pkg/visor/api_ping.go

   PingOnceWithEcho set read deadlines (30s) but NO write deadlines.
   If the underlying TCP send buffer filled (route under load, slow
   intermediate, etc.) conn.Write would block indefinitely. Empirical
   symptom: mux-bw runs with route_established success + 0 bytes
   pumped + 0 MuxRouteFailure events emitted, because the pump
   goroutine was stuck inside Write rather than returning with an
   error.

   In my Phase 6 sweep against Gamma post-#2761: 10 of 12 cells
   showed exactly this pattern (idle baseline n=40-46 probes,
   loaded n=0-2, sent=recv=0B for the full pump duration).

   Fix: SetWriteDeadline mirrors the SetReadDeadline calls already
   in place (30s). On deadline, Write returns net.ErrDeadlineExceeded
   and the pump's existing error-return path fires a MuxRouteFailure
   event (#2756) — making the stall observable instead of silent.

2. PROBE HARDCODED RouteIndex 0 (#130) — Gamma's find at 20:20Z

   muxBwProbeLoop's probeConf didn't set RouteIndex, so every probe
   used PingRouteRef{PK, Index: 0}. When N>1 and the first
   established route was at a non-zero index (Gamma's Phase D:
   only R2 of N=8 came up), every probe failed with "no ping
   connection". Same class as #2757's adapter bug but in the probe
   path.

   Fix: muxBwProbeLoop takes a routeIndex parameter. Both callers
   (idle baseline + loaded probe) pick the first established route
   and pass its index. The idle phase needs the same selection
   logic as the loaded phase — previously it just ran the loop
   blindly, which would also hit RouteIndex 0 even when no
   route 0 had been established.

For #127, the deferred SetReadDeadline reset paths gain the
matching SetWriteDeadline reset so the conn returns to its
deadline-free state on all return paths.

For #130, the loaded-phase probe selection logic at line 257-265
is preserved verbatim; the only change is passing probeTarget.index
into the goroutine. The idle-phase selection gets new "find first
established route" logic that mirrors the loaded-phase pattern.

Build / gofmt / golangci-lint clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant