fix(visor/router): plumb mux-bw --min-hops through DialPing → router#2749
Merged
Conversation
Gamma's diagnostic on 2026-05-20 found that --min-hops on `cli visor ping mux-bw` was a no-op: visor journal during a run showed all N parallel routes using the same direct stcpr transport. The +24/42/69% bandwidth gains we measured at N=2/4/8 were smux-stream parallelism over ONE transport — not the intermediate-route diversity the operator's hypothesis test intended. The TODO comment at server_mux_bandwidth.go:290 in skycoin#2737 documented this gap; it was deferred and never landed. ROOT CAUSE The chain visor.PingConfig → rpcgrpc.PingConf → appnet.PingContext → router.DialRoutes had no per-call MinHops field. mux-bw set req.MinHops at the proto layer but the value died at the rpcgrpc.PingConf boundary because there was no corresponding field — and the route layer's only MinHops source was the visor-global routing.min_hops config (typically 1). Worse, router.DialRoutes contains an explicit fast-path downgrade: when a direct transport to the destination exists, r.conf.MinHops is temporarily set to 1 for the duration of the dial. So even with visor-global min_hops=2, a peer with a direct stcpr would still get routed direct. FIX (this PR) pkg/router/router.go + DialOptions.MinHops int Per-call min-hops constraint. > 0 = override Config.MinHops for THIS dial only; 0 = inherit. pkg/router/router_dial.go - DialRoutes: * The direct-transport fast-path downgrade (`if isTpdExist(rPK) { conf.MinHops = 1 }`) now only fires when opts.MinHops <= 1. A caller passing opts.MinHops >= 2 has explicitly demanded a multi-hop path; silently routing direct would defeat the constraint. - fetchBestRoutes: * rfclient.RouteOptions{MinHops:...} uses opts.MinHops in preference to r.conf.MinHops when > 0. - calculateLocalRoutes: * Skip the 1-hop (direct) probe when dialOpts.MinHops > 1. The intermediate route search below becomes the only candidate generator. pkg/app/appnet/networker.go + PingContextWithMinHops(ctx, pk, addr, minHops) helper. Resolves the skynet networker, builds DialOptions with MinHops, calls PingContextWithOpts. Falls back to plain PingContext when networker isn't SkywireNetworker or minHops<=0 — same pattern as PingContextWithTransport / PingContextWithRoute. pkg/visor/api.go + PingConfig.MinHops int pkg/visor/api_ping.go - DialPing's dial branch now picks the right call: * explicit ForwardHops/ReverseHops → PingContextWithRoute * explicit TransportID → PingContextWithTransport * MinHops > 0 → PingContextWithMinHops (NEW) * default → PingContext pkg/visor/init_apps.go - visorPingAdapter.DialPing / PingOnce pass conf.MinHops through the rpcgrpc → visor boundary. pkg/visor/rpcgrpc/server.go + PingConf.MinHops int (mirrors visor.PingConfig.MinHops) pkg/visor/rpcgrpc/server_mux_bandwidth.go - muxBwSetupRoute populates PingConf.MinHops from cfg.MinHops (which was already pulled from req.MinHops; just no path to the dial layer existed before). The fix does NOT address Gamma's second finding — the "destination circuit breaker open" error blocking route setup through intermediates. That's their (B) investigation slot; my (A) plumbing makes the constraint reach the router, theirs makes the route setup actually complete. EFFECT POST-MERGE `cli visor ping mux-bw <peer> --routes 4 --min-hops 2` will: - dial 4 routes, each with DialOptions.MinHops=2 - the route-finder is asked for paths with ≥2 hops - the direct-transport downgrade is suppressed - the visor journal logs should show each route picking a different intermediate (subject to circuit-breaker issues Gamma is investigating separately) When the circuit-breaker issue is also resolved, the operator's "mux via intermediates > direct" hypothesis can be cleanly tested for the first time. No new unit tests in this PR — the plumbing fix is verified end- to-end by running mux-bw against a peer and inspecting the visor journal for distinct stcpr transports per route. Existing visor + router test suites pass. Build / gofmt / golangci-lint clean.
This was referenced May 20, 2026
0pcom
added a commit
that referenced
this pull request
May 20, 2026
…2751) #2749 plumbed MinHops through visor.PingConfig → rpcgrpc.PingConf → appnet.PingContextWithMinHops → router.DialOptions.MinHops, and router.DialRoutes correctly suppresses its direct-transport fast-path downgrade when opts.MinHops > 1. But there's an earlier shortcut in pkg/app/appnet/skywire_networker.go:359 (PingContextWithOpts) that bypasses router.DialRoutes entirely: if directConn, ok := r.tryDirectPingDial(addr, opts); ok { return &SkywireConn{ Conn: directConn, freePort: freePort, }, nil // <-- nrg=nil, RouteHopDetails() empty } conn, err := r.r.PingRoute(ctx, pk, ..., opts) tryDirectPingDial dials the destination's appDirectMux directly, returning a SkywireConn with no NoiseRouteGroup attached. So: 1. Even with opts.MinHops >= 2, when the destination has a direct transport reachable via the appDirectMux, the dial succeeds via the shortcut and the MinHops constraint is silently defeated — same class of bug Gamma diagnosed pre-#2749, just one layer up. 2. The returned SkywireConn has nrg=nil. That's why MuxRouteEstablished.hops is empty in Gamma's post-#2749 run even though aggregate throughput appeared to scale: routes were still going direct, MuxBandwidthDone reflected the smux-stream parallelism over one transport, and consumers inspecting the chosen path saw nothing because SkywireConn.RouteHopDetails() falls through nrg. FIX: skip tryDirectPingDial when opts.MinHops > 1. if opts.MinHops <= 1 { if directConn, ok := r.tryDirectPingDial(addr, opts); ok { ... } } // Falls through to r.r.PingRoute, which returns a real nrg. EFFECT POST-MERGE `mux-bw --routes N --min-hops 2` actually routes through intermediates end-to-end: - tryDirectPingDial skipped - r.r.PingRoute → router.DialRoutes with opts.MinHops=2 - returned conn is a *router.NoiseRouteGroup wrapped in SkywireConn{nrg: ...} - SetForwardHops was called by router_dial.go:186 after route setup, so nrg.forwardHops is populated - SkywireConn.RouteHopDetails() returns the hops - visor.GetPingRouteDetailsAt(ref) returns the hops - MuxRouteEstablished.hops is populated Gamma's hops_count=0 observation should disappear post-merge, and the operator's hypothesis test can finally use the actual chosen intermediates to verify route diversity from the wire. Same single-line guard (opts.MinHops <= 1) — no proto change, no new API, no behavior change for callers that don't set MinHops. Build / golangci-lint clean.
0pcom
added a commit
that referenced
this pull request
May 20, 2026
…a hop (#2750) Disjoint-mux scenario (N parallel routes through N different intermediates) exposed a circuit-breaker attribution bug: when an intermediate hop failed to dial during id_reservation, the destination's breaker accumulated the hit. Five bad intermediates tripped the destination's breaker even though the destination was reachable via the other N-5 healthy intermediates. All subsequent setup attempts to the dst then fast-failed for circuitOpenDuration. Symmetric to the source-side fix already in place (lines 423-426 pre-change): source-side dial failures don't poison the dst breaker either. Intermediates needed the same treatment. Changes - pkg/router/setupmetrics/stats.go: in finish() default branch (intermediate hop failed), set reason=ReasonIntermediateUnreachable, blameDst=false, and record the breaker hit against the intermediate's PK (not the dst's). New AllowIntermediate accessor mirrors AllowDestination; both share an allowPK helper. Adds ReasonIntermediateUnreachable FailureReason. - pkg/router/setupnode.go: CreateRouteGroup also consults AllowIntermediate for each hop on the forward path. A known-bad intermediate now short-circuits the setup with the same ErrCircuitOpen sentinel + the intermediate's PK in the reason string, saving the ~10s id_reservation timeout. - pkg/router/setupmetrics/stats_test.go: 2 new tests. TestCollector_CircuitBreaker_IntermediateUnreachable asserts that 5+ intermediate-fails do NOT trip the dst breaker, DO trip the intermediate's breaker, are reclassified to intermediate_unreachable, and don't blame the dst Failed counter. TestCollector_CircuitBreaker_IntermediateBreakerNotPoisoningDst asserts the asymmetry that makes disjoint-mux work: bad intermediate denied, good intermediate + dst still allowed. Empirical motivation Captured in PR #2746 followup investigation: mux-bw --routes 8 --min-hops 2 against Beta failed all 8 routes with "destination circuit breaker open" despite route-finder returning 7+ valid candidates through different intermediates. Each route's intermediate-dial failure ticked the dst breaker; once 5 in a 5-minute window had hit, the dst was locked out and the remaining ~3 healthy candidates were never even tried. Pairs with #2746 (disjoint mux) and #2749 (--min-hops plumbing): once mux-bw actually routes through intermediates, intermediate flakiness no longer destroys the dst breaker semantics.
0pcom
added a commit
that referenced
this pull request
May 21, 2026
The local route-calc fallback in calculateLocalRoutes only knows how to build 1-hop (direct) and 2-hop routes. When the global route-finder fails to produce a path (transient timeout, stale TPD, ErrNoSuitableTransport), DialRoutes falls back to calculateLocalRoutes — which would silently return a 2-hop route even when the caller demanded min-hops >= 3. Empirically reproduced: cli visor ping mux-bw <peer> --min-hops 3 → hops_count=2 cli visor ping mux-bw <peer> --min-hops 4 → hops_count=2 cli visor ping mux-bw <peer> --min-hops 5 → hops_count=5 (correct) The route-finder service does honor min-hops correctly when queried directly (`cli route find <peer> -n 3 -x 3` returns proper 3-hop paths). The bug is in the local fallback: it returns whatever 2-hop path it can build, regardless of opts. FIX: short-circuit calculateLocalRoutes with an error when dialOpts.MinHops > 2. The router-level retry loop above will then either re-attempt the route-finder query or return the error to the caller — both of which are correct behaviors when the caller explicitly demands a multi-hop path the local cache can't satisfy. Effect: `mux-bw --min-hops N` with N >= 3 either gets a route that actually has >= N hops (when the route-finder service is healthy) or fails cleanly with a useful error — no more silent constraint violations. Build / gofmt / golangci-lint clean. The earlier MinHops > 1 guard (suppressing the direct-1-hop probe) added in #2749 is unchanged; this PR layers a second guard for the 2-hop case.
0pcom
added a commit
that referenced
this pull request
May 21, 2026
…ters (#2754) * fix(router): calculateLocalRoutes must honor MinHops > 2 The local route-calc fallback in calculateLocalRoutes only knows how to build 1-hop (direct) and 2-hop routes. When the global route-finder fails to produce a path (transient timeout, stale TPD, ErrNoSuitableTransport), DialRoutes falls back to calculateLocalRoutes — which would silently return a 2-hop route even when the caller demanded min-hops >= 3. Empirically reproduced: cli visor ping mux-bw <peer> --min-hops 3 → hops_count=2 cli visor ping mux-bw <peer> --min-hops 4 → hops_count=2 cli visor ping mux-bw <peer> --min-hops 5 → hops_count=5 (correct) The route-finder service does honor min-hops correctly when queried directly (`cli route find <peer> -n 3 -x 3` returns proper 3-hop paths). The bug is in the local fallback: it returns whatever 2-hop path it can build, regardless of opts. FIX: short-circuit calculateLocalRoutes with an error when dialOpts.MinHops > 2. The router-level retry loop above will then either re-attempt the route-finder query or return the error to the caller — both of which are correct behaviors when the caller explicitly demands a multi-hop path the local cache can't satisfy. Effect: `mux-bw --min-hops N` with N >= 3 either gets a route that actually has >= N hops (when the route-finder service is healthy) or fails cleanly with a useful error — no more silent constraint violations. Build / gofmt / golangci-lint clean. The earlier MinHops > 1 guard (suppressing the direct-1-hop probe) added in #2749 is unchanged; this PR layers a second guard for the 2-hop case. * fix(rpcgrpc): mux-bw probe-after-pump race + per-level LevelDone counters Two small unrelated fixes against the rpcgrpc handlers; bundling because the diffs are tiny and ship in the same package. 1. mux-bw probe-after-pump race In muxBwProbeLoop, the `for { select { ... ticker.C ... } }` loop can land in the ticker-fired branch in the same scheduler window as the pump's ctx.Done — by the time PingOnce fires, the pump goroutine's defer has already torn down the route via StopPingRoute. Result: 1-2 trailing "no ping connection for ... call DialPing first" probe errors at every mux-bw run end. Cosmetic but noisy in the operator's NDJSON output. Fix: re-check ctx.Err() at the top of the ticker case so the probe bails out cleanly when the pump is already shutting down. 2. LevelDone counters never populated for cache hits / failures In server_ping_tree.go, levelDoneFor() emitted only the candidate-count under Attempted; Succeeded / Failed / SkippedCached were always zero because no per-level counters were tracked anywhere. Operator-visible: tree-stream's level_done human row read "attempted=509 succeeded=0 failed=0 skipped_cached=0" even when 484 entries actually came back from the transport-summary cache. Fix: introduce a levelStats struct (per-level atomic int32 triple), accumulate alongside the existing pingTreeTotals counters in pingTreePingLevel, return it from the level function, and read into PingTreeLevelDone.{Succeeded,Failed,SkippedCached}. Both call sites (level-1 + level-N loop) updated; dry-run path populates SkippedCached locally. Build / gofmt / golangci-lint clean. Existing rpcgrpc tests pass.
5 tasks
0pcom
added a commit
that referenced
this pull request
May 21, 2026
… + MinHops (#2757) Beta's #2756 MuxRouteFailure event surfaced the smoking gun on mux-bw --routes N --min-hops 2: one route would establish but its pump loop immediately failed with no ping connection for <pk>#0, call DialPing first while MuxRouteEstablished named route_index = 2 for the same route. The lookup PingRouteRef.Index was 0 even though the pump goroutine had RouteIndex = 2. ROOT CAUSE The rpcgrpc PingConf → visor PingConfig adapter in pkg/visor/init_apps.go's visorPingAdapter forwards RouteIndex correctly for DialPing (line 238) and PingOnce (line 249), but PingOnceWithEcho's adapter (lines 316-322) was missing the field — same for MinHops. Aux-route pumps therefore degraded to a primary-route (Index=0) lookup, finding nothing because DialPing had registered the conn at the matching aux Index. FIX Add `RouteIndex: conf.RouteIndex` and `MinHops: conf.MinHops` to the visorPingAdapter.PingOnceWithEcho conversion. Three-line adapter parity fix. DMSG adapter (DmsgPingOnceWithEcho) intentionally untouched — v.dmsgPing.conns is keyed by PK alone, not by PingRouteRef, so no DMSG path consumes RouteIndex. TEST TestPingAdapter_PingOnceWithEcho_ForwardsRouteIndex pins the adapter contract: with no ping connection registered, the visor's PingOnceWithEcho returns no ping connection for %s#%d, call DialPing first The "%d" portion is conf.RouteIndex post-adapter. The test calls the adapter with RouteIndex in {0, 1, 2, 7} and asserts the error message contains the matching `#<idx>,` — so a regression that drops RouteIndex (or stuffs in MinHops by accident) surfaces in the error string directly. No mock VisorAPI, no fixtures. EMPIRICAL CHAIN This is the third bug in the mux-bw --min-hops measurement chain to surface via the wire-event observability landed in #2746 → #2749 → #2751 → #2752 → #2750 → #2753 → #2754 → #2756. The pattern holds: each event-surface fix lights up the next bug downstream. The operator's "mux > direct" hypothesis test should now finally be measurable end-to-end once #2756 (route_failure event) + this PR auto-deploy.
6 tasks
0pcom
added a commit
that referenced
this pull request
May 21, 2026
* fix(visor/ping): narrow ping.mu/dmsgPing.mu critical section to map lookup
DOMINANT BOTTLENECK for mux-bw bandwidth measurements. The visor's
Ping/PingOnce/PingOnceWithEcho (and dmsg twins) held v.ping.mu (a
single visor-global *sync.Mutex) for the ENTIRE wire roundtrip:
v.ping.mu.Lock()
defer v.ping.mu.Unlock()
pingEntry, ok := v.ping.conns[ref]
// ... ~287ms of wire I/O at 2-hop with 32 KB payloads ...
mux-bw's N pump goroutines all call PingOnceWithEcho on DIFFERENT
PingRouteRefs. They each look up their OWN conn via the map; the
wire I/O is independent. But the global mutex serialized them
through one ~287ms slot each. So:
- Aggregate throughput across N routes didn't scale with N
- Per-route avg pinned at ~351 kbps even though single-call peak
was ~1.7 Mbps (1 RTT × 32 KB)
- --probe-rtt latency probes during a loaded pump measured
"probe-mutex-wait + network RTT" instead of network RTT,
swamping the queueing-delay signal at short hop counts
- Bidirectional simultaneous mux-bw measurements showed
mutual-starvation that LOOKED like shared-link contention
but was actually mutex contention on each side's ping state
ROOT CAUSE
The mutex's actual job is to protect v.ping.conns (the map) from
concurrent insert (DialPing) and delete (StopPingRoute). The
wire I/O on the chosen conn does NOT need the map mutex held —
each mux-bw pump goroutine owns its own conn via its RouteIndex,
no aliasing.
FIX
Shrink the critical section to just the map lookup:
v.ping.mu.Lock()
pingEntry, ok := v.ping.conns[ref]
v.ping.mu.Unlock()
if !ok { ... }
// ... wire I/O on pingEntry.conn WITHOUT holding the mutex ...
Applied to Ping, PingOnce, PingOnceWithEcho, DmsgPing,
DmsgPingOnce, DmsgPingOnceWithEcho. BandwidthTest already had the
correct narrow scope.
CONCURRENT-CLOSE SEMANTICS
Pre-fix: StopPing concurrent with PingOnceWithEcho serialized via
the mutex — they took turns, no race. Post-fix: StopPing can close
the conn while PingOnceWithEcho is doing wire I/O. The Read/Write
on the closed conn returns ErrClosed cleanly. mux-bw's pump loop
already handles read/write errors by exiting the pump goroutine;
the resulting failure is surfaced via Beta's MuxRouteFailure event
(#2756) so the operator sees the cause instead of an indefinite
block.
The same-PingRouteRef-from-multiple-goroutines case (always
undefined behavior on the underlying net.Conn) is unchanged —
callers must serialize themselves. mux-bw enforces one
goroutine per RouteIndex natively.
TESTS
pkg/visor/ping_mu_concurrency_test.go:
- TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes:
200 concurrent calls with distinct PingRouteRefs (no registered
conns) complete in << 1s. A regression that re-introduces
wire-I/O-under-lock would either time out or take orders of
magnitude longer.
- TestPingMu_NotHeldDuringConnAbsentCallpath: after
PingOnceWithEcho returns, the mutex must be immediately
acquirable from another goroutine. Catches the defer-on-entry
pattern directly.
EMPIRICAL PREDICTION
Once this auto-deploys, the operator's "mux > direct" hypothesis
becomes testable WITHOUT --hops-via intermediate pinning. Per-
route avg should rise from ~351 kbps toward the single-call peak
of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly
linearly (modulo per-intermediate quality variance) instead of
flat-lining at single-route throughput.
CHAIN
The mux-bw measurement loop has now closed:
#2745 per-route teardown
#2746 disjoint-intermediate routing
#2749 plumb --min-hops through DialPing
#2750 stop poisoning dst breaker from intermediate failures
#2751 tryDirectPingDial gate on MinHops
#2752 honor caller SetupTimeout
#2756 MuxRouteFailure pump-phase event
#2757 PingOnceWithEcho adapter forwards RouteIndex
this PR: ping.mu doesn't serialize parallel pumps
If the mux > direct hypothesis is real, it should be visible
in measurements after this lands.
* fix(visor/ping): errcheck on discarded PingOnceWithEcho returns in test
golangci-lint errcheck flagged both test sites that discard the
4-tuple return of v.PingOnceWithEcho via _, _, _, _. The discards
are intentional — the concurrency test asserts on wall-clock
serialization behavior, not on per-call success; the mutex-release
test asserts that the lock is acquirable post-return regardless of
whether the call itself succeeded or returned ErrNoPingConnection.
Add //nolint:errcheck comments explaining the intent. No behavior
change; CI lint pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gamma's diagnostic on 2026-05-20 found that `cli visor ping mux-bw --min-hops 2` was a no-op — visor journal during a run showed all N parallel routes using the same direct stcpr transport. The +24/42/69% bandwidth gains we measured at N=2/4/8 were smux-stream parallelism over ONE transport, not the intermediate-route diversity the operator's hypothesis test intended.
Root cause
The chain `visor.PingConfig` → `rpcgrpc.PingConf` → `appnet.PingContext` → `router.DialRoutes` had no per-call MinHops field. mux-bw set `req.MinHops` at the proto layer but it died at the `rpcgrpc.PingConf` boundary — there was no corresponding field — and the route layer's only MinHops source was the visor-global `routing.min_hops` config (typically 1).
Worse, `router.DialRoutes` contained an explicit fast-path downgrade: when a direct transport to the destination exists, `r.conf.MinHops` is temporarily set to 1 for the duration of the dial. So even with visor-global min_hops=2, a peer with a direct stcpr would still get routed direct.
Fix (this PR)
Effect post-merge
`cli visor ping mux-bw --routes 4 --min-hops 2` will:
Coordination
Why no unit test
Verifying end-to-end requires a router fixture (transport manager + setup nodes + route-finder client) that doesn't exist in the test suite. Pragmatic validation: Beta + Gamma run `mux-bw --routes 4 --min-hops 2` against each other post-merge and grep the visor journal for distinct stcpr transports per route. Existing test suites still pass.
Test plan