fix(visor): write deadlines on PingOnceWithEcho + probe RouteIndex passthrough#2762
Merged
0pcom merged 1 commit intoMay 21, 2026
Merged
Conversation
…ssthrough Two related fixes that emerged from the post-skycoin#2761 measurement work: 1. SILENT PUMP STALL (skycoin#127) — pkg/visor/api_ping.go PingOnceWithEcho set read deadlines (30s) but NO write deadlines. If the underlying TCP send buffer filled (route under load, slow intermediate, etc.) conn.Write would block indefinitely. Empirical symptom: mux-bw runs with route_established success + 0 bytes pumped + 0 MuxRouteFailure events emitted, because the pump goroutine was stuck inside Write rather than returning with an error. In my Phase 6 sweep against Gamma post-skycoin#2761: 10 of 12 cells showed exactly this pattern (idle baseline n=40-46 probes, loaded n=0-2, sent=recv=0B for the full pump duration). Fix: SetWriteDeadline mirrors the SetReadDeadline calls already in place (30s). On deadline, Write returns net.ErrDeadlineExceeded and the pump's existing error-return path fires a MuxRouteFailure event (skycoin#2756) — making the stall observable instead of silent. 2. PROBE HARDCODED RouteIndex 0 (skycoin#130) — Gamma's find at 20:20Z muxBwProbeLoop's probeConf didn't set RouteIndex, so every probe used PingRouteRef{PK, Index: 0}. When N>1 and the first established route was at a non-zero index (Gamma's Phase D: only R2 of N=8 came up), every probe failed with "no ping connection". Same class as skycoin#2757's adapter bug but in the probe path. Fix: muxBwProbeLoop takes a routeIndex parameter. Both callers (idle baseline + loaded probe) pick the first established route and pass its index. The idle phase needs the same selection logic as the loaded phase — previously it just ran the loop blindly, which would also hit RouteIndex 0 even when no route 0 had been established. For skycoin#127, the deferred SetReadDeadline reset paths gain the matching SetWriteDeadline reset so the conn returns to its deadline-free state on all return paths. For skycoin#130, the loaded-phase probe selection logic at line 257-265 is preserved verbatim; the only change is passing probeTarget.index into the goroutine. The idle-phase selection gets new "find first established route" logic that mirrors the loaded-phase pattern. Build / gofmt / golangci-lint clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related fixes from the post-#2761 measurement work.
1. Silent pump stall — `pkg/visor/api_ping.go` (#127)
`PingOnceWithEcho` set read deadlines (30s) but NO write deadlines. If the underlying TCP send buffer filled (route under load, slow intermediate, etc.) `conn.Write` would block indefinitely. Empirical symptom: mux-bw runs with route_established success + 0 bytes pumped + 0 MuxRouteFailure events emitted, because the pump goroutine was stuck inside Write rather than returning with an error.
Phase 6 sweep against Gamma post-#2761: 10 of 12 cells showed exactly this — idle baseline n=40-46 probes, loaded n=0-2, sent=recv=0B for the full pump duration.
Fix: `SetWriteDeadline` mirrors the `SetReadDeadline` calls already in place (30s). On deadline, Write returns `net.ErrDeadlineExceeded` and the pump's existing error-return path fires a `MuxRouteFailure` event (#2756) — making the stall observable instead of silent.
2. Probe hardcoded RouteIndex 0 — Gamma's find at 20:20Z (#130)
`muxBwProbeLoop`'s `probeConf` didn't set `RouteIndex`, so every probe used `PingRouteRef{PK, Index: 0}`. When N>1 and the first established route was at a non-zero index (Gamma's Phase D: only R2 of N=8 came up), every probe failed with "no ping connection". Same class as #2757's adapter bug but in the probe path.
Fix: `muxBwProbeLoop` takes a `routeIndex` parameter. Both callers (idle baseline + loaded probe) pick the first established route and pass its index. The idle phase needs the same selection logic as the loaded phase — previously it ran blindly, also hitting RouteIndex 0.
Test plan