Skip to content

feat(router): DisjointMux + ExcludeIntermediatePKs in DialOptions#2746

Merged
0pcom merged 4 commits into
skycoin:developfrom
0pcom:feat/disjoint-mux
May 20, 2026
Merged

feat(router): DisjointMux + ExcludeIntermediatePKs in DialOptions#2746
0pcom merged 4 commits into
skycoin:developfrom
0pcom:feat/disjoint-mux

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 20, 2026

Summary

Operator's directive 2026-05-20: 'multiplexed routes should not share any intermediate keys'. Re-scoped from a mux-bw-only fix to a router-layer affordance that benefits every --routes N caller (cli skynet, cli dmsg cat, cli skysocks-client, mux-bw, future apps) without requiring per-app changes.

Per cross-agent design 2026-05-20 18:50Z (operator + Alpha + Beta + me, 4-Q alignment):

  • DisjointMux is opt-in, not automatic (operator: "have the option, don't make automatic")
  • Aux routes 2..N are disjoint; primary unconstrained (primary route may still pick the direct edge — preserves "lowest-latency first leg + aux for additive bandwidth" shape)
  • Asymmetric forward/reverse routing deferred to a separate design (today's DialOptions assumes symmetric Min/Max counts)

Changes

pkg/router/router.go — DialOptions additions:

  • ExcludeIntermediatePKs []cipher.PubKey — the lower-level exclude mechanism; populated by the mux loop, also exposed for callers that want full control
  • DisjointMux bool — the high-level opt-in; when true on the primary dial, the mux loop seeds + accumulates the exclude-PK set across routes

pkg/router/router_dial.go:

  • fetchBestRoutes now post-filters the route-finder response via new pickDisjointPath helper. Walks paired forward/reverse paths until it finds a pair whose intermediate hops don't overlap with ExcludeIntermediatePKs. Falls back to local route calc when every candidate is excluded.
  • Back-compat hot path: when ExcludeIntermediatePKs is empty, returns paths[0] exactly as before — no behavior change for callers that didn't opt in.
  • calculateLocalRoutes 2-hop loop skips intermediate candidates whose PK is in the exclude set.
  • establishMuxRoutes (the mux loop):
    • Seeds excludePKs from the primary route's intermediates via intermediatesOfRouteGroup (reads NoiseRouteGroup.tps + filters endpoints).
    • Accumulates each aux route's intermediates into excludePKs via intermediatesOfHops so subsequent routes don't reuse them.
    • Only active when primary DialOptions.DisjointMux=true — remains nil/inert otherwise.

Test plan

  • go build ./pkg/router/... clean
  • go vet ./pkg/router/... clean
  • gofmt -l clean
  • go test ./pkg/router/ 37.7s — full suite passes including 11 new subtests:
    • TestPathTouchesIntermediate_* (empty / direct 1-hop / 2-hop with excluded mid / dst-in-exclude-is-not-intermediate)
    • TestPickDisjointPath_* (no-exclude returns first / skips excluded intermediate / all-excluded returns ok=false / reverse-also-filtered)
    • TestIntermediatesOfHops (empty / direct / 2-hop / 3-hop)

Follow-ups (separate PRs, scope-isolated)

  1. mux-bw CLI flag wiring: --strict-disjoint default true for cli visor ping mux-bw (measurement tool needs diversity for the operator's hypothesis test); default false for cli skynet start, cli dmsg cat, cli skysocks-client (back-compat). Threads through MuxBandwidthRequestPingConfig.ExcludeIntermediatePKs (or similar) → appnet.PingContextDialOptions. Touches proto + generated RPC code — distinct review surface.

  2. Per-route bytes in MuxBandwidthSample: extends the sample event with a per_route []PerRouteSample array so consumers can see which route is the bandwidth bottleneck (or whether one slow path drags the average). Proto change.

  3. Per-route intermediate-PKs in MuxBandwidthDone: post-hoc verification of the disjoint-routing claim. Proto change.

Empirical motivation

Pre-PR mux-bw sweep (cell A direct + cell B mux-via-intermediates at N=2/4/8) showed +24/+42/+69% bandwidth scaling. Sub-linear — operator hypothesized intermediate-PK sharing as the cause. This PR is the test of that hypothesis: with DisjointMux on, scaling should be closer to linear (per-route smux flow-control becomes the next ceiling, addressed by Beta's parallel #2731-trail PR).

0pcom added 4 commits May 20, 2026 18:57
Operator's 'multiplexed routes should not share intermediate keys'
constraint, lifted from a mux-bw-only fix to a router-layer
affordance that benefits every --routes N caller (cli skynet start,
cli dmsg cat --routes, cli skysocks-client --mux-routes, future apps)
without requiring any per-app change.

Per design discussion 2026-05-20 18:50Z + Alpha + Beta alignment:
  - DisjointMux is opt-in, not automatic (operator: "have the option,
    don't make automatic")
  - Aux routes (2..N) are disjoint; primary route is unconstrained
  - Asymmetric forward/reverse routing deferred to separate design

Changes:

pkg/router/router.go — DialOptions additions:
  + ExcludeIntermediatePKs []cipher.PubKey (the lower-level exclude
    mechanism; populated by the mux loop, also exposed for callers
    that want full control)
  + DisjointMux bool (the high-level opt-in; when true on the primary
    dial, the mux loop in establishMuxRoutes seeds + accumulates the
    exclude-PK set across routes)

pkg/router/router_dial.go:
  fetchBestRoutes:
    + pickDisjointPath helper post-filters the route-finder response;
      walks paired forward/reverse paths until it finds a pair whose
      intermediate hops do not overlap with ExcludeIntermediatePKs.
      Falls back to local route calc when every candidate is excluded.
    + back-compat hot path: when ExcludeIntermediatePKs is empty (the
      common single-route case), returns paths[0] as before — no
      behavior change for callers that didn't opt in.
  calculateLocalRoutes:
    + 2-hop loop skips intermediate candidates whose PK is in
      ExcludeIntermediatePKs (1-hop direct path is unaffected since
      it has no intermediate).
  establishMuxRoutes (the mux loop):
    + seeds excludePKs from the primary route's intermediates via
      intermediatesOfRouteGroup helper (reads NoiseRouteGroup's
      transports + filters endpoints).
    + accumulates each aux route's intermediates into excludePKs via
      intermediatesOfHops so subsequent routes don't reuse them.
    + only active when the primary DialOptions.DisjointMux is true;
      remains nil/inert otherwise so today's behavior is unchanged
      for non-opted-in callers.

Tests (pkg/router/router_dial_disjoint_test.go, 11 subtests):
  - pathTouchesIntermediate: empty / direct 1-hop / 2-hop with
    excluded mid / dst-in-exclude-is-not-intermediate
  - pickDisjointPath: no-exclude returns first / skips excluded
    intermediate / all-excluded returns ok=false / reverse-also-
    filtered
  - intermediatesOfHops: empty / direct / 2-hop / 3-hop

The CLI flag wiring for mux-bw (--strict-disjoint default true) and
non-measurement CLIs (default false) plus per-route bytes visibility
in MuxBandwidthSample land in a follow-up PR — those touch the proto
+ generated RPC code, distinct review surface.
CI lint-shell on PR skycoin#2746 surfaced the lurking SC2012 in
ci_scripts/tree-probe.sh:102 (added by skycoin#2734, merged today). The
filenames in question are alphanumeric so the warning is a false
positive for this specific pattern, but shellcheck flags it
unconditionally — switching to find keeps the lane green.
…intMux opt-in

Operator correction 2026-05-20 19:30Z: multiplexed routes should NOT
share intermediate visors as the default — that was the original
intent. Overriding with overlapping intermediates is the
user-specified exception (via explicit ForwardHops/ReverseHops),
not the default. The earlier "have the option, don't make automatic"
framing referred to keeping the explicit-route override available,
not to keeping disjoint-routing opt-in.

Changes from prior commit on this PR:
- Drop DialOptions.DisjointMux field — disjoint behavior is now
  unconditional when MuxRoutes > 1
- Mux loop in establishMuxRoutes always seeds excludePKs from the
  primary route's intermediates and always accumulates each aux
  route's intermediates into the exclude set
- ExcludeIntermediatePKs stays as the underlying mechanism — callers
  outside the mux loop can pre-populate if they have constraints
- Comment updates to reflect "automatic for mux loop, override via
  explicit hops"

Also fix pre-existing go vet copylocks error from PR skycoin#2739 (newly
surfaced when this PR's CI ran the full vet pass):
  cmd/skywire-cli/commands/visor/ping/mux_bandwidth_tui.go:
    pass *MuxBandwidthRequest by pointer (was by value, copying
    the embedded protoimpl.MessageState sync.Mutex)

Tests + build + vet all green.
CI surfaced TestFormatEntryLine/long_error_message_truncates regression
on PR skycoin#2746. formatEntryLine renders failed entries as a single row but
didn't cap the embedded error message; a 200-char setupErr ballooned
the row to 315 chars + missing the "..." marker the test asserts.

Cap errMsg at 80 chars + ellipsis. The row's fixed-width prefix is
~113 chars (1 glyph + 66 remotePK + 36 tpID + 6 tpType + separators)
which leaves ~85 chars of single-line budget for the payload before
horizontal-scroll territory. The test asserts total line width < 200.

Tests now pass; not from this PR's diff but caught by the shared CI
lane on develop tip — fixing it on this branch keeps the lane green
so the actual changeset (DialOptions disjoint-mux) lands cleanly.
@0pcom 0pcom merged commit 5af5aa7 into skycoin:develop May 20, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 20, 2026
…a hop (#2750)

Disjoint-mux scenario (N parallel routes through N different
intermediates) exposed a circuit-breaker attribution bug: when an
intermediate hop failed to dial during id_reservation, the
destination's breaker accumulated the hit. Five bad intermediates
tripped the destination's breaker even though the destination was
reachable via the other N-5 healthy intermediates. All subsequent
setup attempts to the dst then fast-failed for circuitOpenDuration.

Symmetric to the source-side fix already in place (lines 423-426
pre-change): source-side dial failures don't poison the dst breaker
either. Intermediates needed the same treatment.

Changes
- pkg/router/setupmetrics/stats.go: in finish() default branch
  (intermediate hop failed), set reason=ReasonIntermediateUnreachable,
  blameDst=false, and record the breaker hit against the intermediate's
  PK (not the dst's). New AllowIntermediate accessor mirrors
  AllowDestination; both share an allowPK helper. Adds
  ReasonIntermediateUnreachable FailureReason.

- pkg/router/setupnode.go: CreateRouteGroup also consults
  AllowIntermediate for each hop on the forward path. A known-bad
  intermediate now short-circuits the setup with the same
  ErrCircuitOpen sentinel + the intermediate's PK in the reason
  string, saving the ~10s id_reservation timeout.

- pkg/router/setupmetrics/stats_test.go: 2 new tests.
  TestCollector_CircuitBreaker_IntermediateUnreachable asserts that
  5+ intermediate-fails do NOT trip the dst breaker, DO trip the
  intermediate's breaker, are reclassified to intermediate_unreachable,
  and don't blame the dst Failed counter.
  TestCollector_CircuitBreaker_IntermediateBreakerNotPoisoningDst
  asserts the asymmetry that makes disjoint-mux work: bad
  intermediate denied, good intermediate + dst still allowed.

Empirical motivation
Captured in PR #2746 followup investigation: mux-bw --routes 8
--min-hops 2 against Beta failed all 8 routes with "destination
circuit breaker open" despite route-finder returning 7+ valid
candidates through different intermediates. Each route's
intermediate-dial failure ticked the dst breaker; once 5 in a
5-minute window had hit, the dst was locked out and the remaining
~3 healthy candidates were never even tried.

Pairs with #2746 (disjoint mux) and #2749 (--min-hops plumbing): once
mux-bw actually routes through intermediates, intermediate flakiness
no longer destroys the dst breaker semantics.
0pcom added a commit that referenced this pull request May 21, 2026
… routing (#2755)

Operator's observation: route-finder service randomization is the
dominant source of variance in mux-bw measurements (26x throughput
swing across intermediates for an identical request shape). Local
route calc is deterministic by construction — same input, same
output. But until now calculateLocalRoutes only supported 1-hop
(direct) and 2-hop routes. Anything deeper fell through to the
route-finder service (and its randomness).

This PR extends calculateLocalRoutes to find N-hop paths via BFS
over the transport-discovery graph, bounded by Config.MaxHops.
Combined with --local-route (SetForceLocalRoutes(true)) on the
caller side, this gives the operator a deterministic selector
for multi-hop runs:

  cli visor ping mux-bw <peer> --local-route --min-hops 3
    → same 3-hop path every call, picked deterministically by
      remote-PK string sort at each BFS expansion step.

Combined with DialOptions.ExcludeIntermediatePKs (Gamma's #2746),
the operator can run controlled N=2/4/8 disjoint-mux experiments
where each route is reproducibly chosen — the per-intermediate
variance becomes selectable rather than random.

IMPLEMENTATION

  - Drop the previous "if MinHops > 2: error out" guard.
  - Drop the previous 2-hop-only intermediate search loop.
  - BFS state: { pk, path []Hop }, level = path length.
  - Seed queue with localTps (1-hop paths to direct neighbors),
    excluding any in ExcludeIntermediatePKs.
  - At each level, check if any queued node is dst and the path
    is >= MinHops. If yes, return that node's path (this is the
    shortest acceptable path — BFS guarantees it).
  - Expand each node by walking transportsByEdge[node.pk],
    skipping already-visited nodes (no cycles).
  - Deterministic ordering: sort seed and each level's children
    by remote-PK string before queueing.
  - Stop when a node hits MaxHops or the queue empties.

  reverseHops + hopPath helpers added for forward-route flipping
  and log-line rendering respectively. Existing 1-hop and self-
  ping fast paths above the BFS are unchanged.

EFFECT

  - mux-bw --local-route --min-hops 5 now actually finds a 5-hop
    path locally without touching the route-finder service.
  - Same call repeated returns the same path → reproducible
    cross-call measurements.
  - --min-hops 3, 4 work too (was previously erroring out per
    #2753 since the 2-hop fallback couldn't satisfy them).
  - Backwards compat for callers that don't set MinHops: minHops
    defaults to 2 (same as previous behavior); BFS finds the
    same 2-hop path the loop would have.

NOTES

  - MaxHops is taken from r.conf.MaxHops (router-wide config,
    defaults to 50 per router.go's maxHops constant — bounded
    enough to prevent runaway BFS over a large global graph).
    Future enhancement: per-call MaxHops in DialOptions if the
    operator wants tighter control.
  - The previous "2-hop only" comment in the function docstring
    is updated to reflect N-hop BFS.

Build / gofmt / golangci-lint clean. Existing router tests pass
(37s wall — full pkg/router test suite).
0pcom added a commit that referenced this pull request May 21, 2026
… + MinHops (#2757)

Beta's #2756 MuxRouteFailure event surfaced the smoking gun on
mux-bw --routes N --min-hops 2: one route would establish but its
pump loop immediately failed with

  no ping connection for <pk>#0, call DialPing first

while MuxRouteEstablished named route_index = 2 for the same
route. The lookup PingRouteRef.Index was 0 even though the
pump goroutine had RouteIndex = 2.

ROOT CAUSE
The rpcgrpc PingConf → visor PingConfig adapter in
pkg/visor/init_apps.go's visorPingAdapter forwards RouteIndex
correctly for DialPing (line 238) and PingOnce (line 249), but
PingOnceWithEcho's adapter (lines 316-322) was missing the
field — same for MinHops. Aux-route pumps therefore degraded to
a primary-route (Index=0) lookup, finding nothing because
DialPing had registered the conn at the matching aux Index.

FIX
Add `RouteIndex: conf.RouteIndex` and `MinHops: conf.MinHops` to
the visorPingAdapter.PingOnceWithEcho conversion. Three-line
adapter parity fix.

DMSG adapter (DmsgPingOnceWithEcho) intentionally untouched —
v.dmsgPing.conns is keyed by PK alone, not by PingRouteRef, so
no DMSG path consumes RouteIndex.

TEST
TestPingAdapter_PingOnceWithEcho_ForwardsRouteIndex pins the
adapter contract: with no ping connection registered, the visor's
PingOnceWithEcho returns

  no ping connection for %s#%d, call DialPing first

The "%d" portion is conf.RouteIndex post-adapter. The test calls
the adapter with RouteIndex in {0, 1, 2, 7} and asserts the error
message contains the matching `#<idx>,` — so a regression that
drops RouteIndex (or stuffs in MinHops by accident) surfaces in
the error string directly. No mock VisorAPI, no fixtures.

EMPIRICAL CHAIN
This is the third bug in the mux-bw --min-hops measurement chain
to surface via the wire-event observability landed in #2746#2749#2751#2752#2750#2753#2754#2756. The pattern
holds: each event-surface fix lights up the next bug downstream.
The operator's "mux > direct" hypothesis test should now finally
be measurable end-to-end once #2756 (route_failure event) +
this PR auto-deploy.
0pcom added a commit that referenced this pull request May 21, 2026
* fix(visor/ping): narrow ping.mu/dmsgPing.mu critical section to map lookup

DOMINANT BOTTLENECK for mux-bw bandwidth measurements. The visor's
Ping/PingOnce/PingOnceWithEcho (and dmsg twins) held v.ping.mu (a
single visor-global *sync.Mutex) for the ENTIRE wire roundtrip:

  v.ping.mu.Lock()
  defer v.ping.mu.Unlock()
  pingEntry, ok := v.ping.conns[ref]
  // ... ~287ms of wire I/O at 2-hop with 32 KB payloads ...

mux-bw's N pump goroutines all call PingOnceWithEcho on DIFFERENT
PingRouteRefs. They each look up their OWN conn via the map; the
wire I/O is independent. But the global mutex serialized them
through one ~287ms slot each. So:

  - Aggregate throughput across N routes didn't scale with N
  - Per-route avg pinned at ~351 kbps even though single-call peak
    was ~1.7 Mbps (1 RTT × 32 KB)
  - --probe-rtt latency probes during a loaded pump measured
    "probe-mutex-wait + network RTT" instead of network RTT,
    swamping the queueing-delay signal at short hop counts
  - Bidirectional simultaneous mux-bw measurements showed
    mutual-starvation that LOOKED like shared-link contention
    but was actually mutex contention on each side's ping state

ROOT CAUSE

The mutex's actual job is to protect v.ping.conns (the map) from
concurrent insert (DialPing) and delete (StopPingRoute). The
wire I/O on the chosen conn does NOT need the map mutex held —
each mux-bw pump goroutine owns its own conn via its RouteIndex,
no aliasing.

FIX

Shrink the critical section to just the map lookup:

  v.ping.mu.Lock()
  pingEntry, ok := v.ping.conns[ref]
  v.ping.mu.Unlock()
  if !ok { ... }
  // ... wire I/O on pingEntry.conn WITHOUT holding the mutex ...

Applied to Ping, PingOnce, PingOnceWithEcho, DmsgPing,
DmsgPingOnce, DmsgPingOnceWithEcho. BandwidthTest already had the
correct narrow scope.

CONCURRENT-CLOSE SEMANTICS

Pre-fix: StopPing concurrent with PingOnceWithEcho serialized via
the mutex — they took turns, no race. Post-fix: StopPing can close
the conn while PingOnceWithEcho is doing wire I/O. The Read/Write
on the closed conn returns ErrClosed cleanly. mux-bw's pump loop
already handles read/write errors by exiting the pump goroutine;
the resulting failure is surfaced via Beta's MuxRouteFailure event
(#2756) so the operator sees the cause instead of an indefinite
block.

The same-PingRouteRef-from-multiple-goroutines case (always
undefined behavior on the underlying net.Conn) is unchanged —
callers must serialize themselves. mux-bw enforces one
goroutine per RouteIndex natively.

TESTS

pkg/visor/ping_mu_concurrency_test.go:
  - TestPingOnceWithEcho_DoesNotSerializeAcrossRouteIndexes:
    200 concurrent calls with distinct PingRouteRefs (no registered
    conns) complete in << 1s. A regression that re-introduces
    wire-I/O-under-lock would either time out or take orders of
    magnitude longer.
  - TestPingMu_NotHeldDuringConnAbsentCallpath: after
    PingOnceWithEcho returns, the mutex must be immediately
    acquirable from another goroutine. Catches the defer-on-entry
    pattern directly.

EMPIRICAL PREDICTION

Once this auto-deploys, the operator's "mux > direct" hypothesis
becomes testable WITHOUT --hops-via intermediate pinning. Per-
route avg should rise from ~351 kbps toward the single-call peak
of ~1.7 Mbps, and N=2..8 disjoint mux should aggregate roughly
linearly (modulo per-intermediate quality variance) instead of
flat-lining at single-route throughput.

CHAIN

The mux-bw measurement loop has now closed:
  #2745 per-route teardown
  #2746 disjoint-intermediate routing
  #2749 plumb --min-hops through DialPing
  #2750 stop poisoning dst breaker from intermediate failures
  #2751 tryDirectPingDial gate on MinHops
  #2752 honor caller SetupTimeout
  #2756 MuxRouteFailure pump-phase event
  #2757 PingOnceWithEcho adapter forwards RouteIndex
  this PR: ping.mu doesn't serialize parallel pumps

If the mux > direct hypothesis is real, it should be visible
in measurements after this lands.

* fix(visor/ping): errcheck on discarded PingOnceWithEcho returns in test

golangci-lint errcheck flagged both test sites that discard the
4-tuple return of v.PingOnceWithEcho via _, _, _, _. The discards
are intentional — the concurrency test asserts on wall-clock
serialization behavior, not on per-call success; the mutex-release
test asserts that the lock is acquirable post-return regardless of
whether the call itself succeeded or returned ErrNoPingConnection.

Add //nolint:errcheck comments explaining the intent. No behavior
change; CI lint pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant