Skip to content

feat(tunnel): event-driven drain with adaptive long-poll#173

Merged
therealaleph merged 1 commit intotherealaleph:mainfrom
dazzling-no-more:feature/event-driven-drain
Apr 25, 2026
Merged

feat(tunnel): event-driven drain with adaptive long-poll#173
therealaleph merged 1 commit intotherealaleph:mainfrom
dazzling-no-more:feature/event-driven-drain

Conversation

@dazzling-no-more
Copy link
Copy Markdown
Contributor

Summary

Replaces the tunnel-node's fixed-sleep batch drain (150 ms + optional 200 ms retry) with a Notify-driven wait that wakes on the first byte from any session in the batch. The same primitive enables long-polling for idle sessions: empty-poll batches hold the response open until upstream pushes data or the long-poll deadline elapses, turning push notifications and chat messages into ~RTT delivery instead of waiting for the client's next tick.

Changes

tunnel-node

  • New Notify on SessionInner, fired by reader_task on each buffer extend and on EOF.
  • New wait_for_any_drainable(inners, deadline) — signal-driven wait with self-filtering watchers (a stale permit left by a previous batch's spawn-race shortcut is consumed harmlessly without spuriously waking the caller).
  • handle_batch phase 2 picks an adaptive deadline:
    • ACTIVE_DRAIN_DEADLINE (350 ms) when the batch had writes or new connections, plus a 30 ms STRAGGLER_SETTLE after the first wake to catch neighboring responses.
    • LONGPOLL_DEADLINE (5 s) when the batch is a pure poll — empty data ops only.
  • Removed the legacy two-pass sleep+drain+retry block.

tunnel-client

  • tunnel_loop no longer skips empty polls and the idle read-timeout cap drops from 30 s → 500 ms — the server-side long-poll now controls cadence.
  • Backward-compat detection: an empty-in/empty-out round trip that returns under LEGACY_DETECT_THRESHOLD (1500 ms) sets a sticky server_no_longpoll flag on the mux; subsequent sessions revert to the pre-long-poll cadence (30 s read timeout, skip-empty-when-idle) so legacy tunnel-nodes don't get hammered with continuous empty polls.

Trade-offs

LONGPOLL_DEADLINE is documented as a knob, not a constant. Lower values (e.g. 2 s) make typing-burst flows snappier; higher values minimize round-trips for push-only sessions. 5 s is a middle ground — a thinking pause between keystrokes can tax the next keystroke by up to that value, since tunnel_loop is strictly serial per session.

Test plan

  • cargo test --manifest-path tunnel-node/Cargo.toml — 17 tests pass, including:
    • 6 unit tests for wait_for_any_drainable (notify wake, any-of-N, deadline, stale-permit, eof, empty-list)
    • reader_task_notifies_on_incoming_bytes — end-to-end notify wiring through a real TCP pair
    • 3 integration tests on handle_batch: pure-poll wakes on push, active batch caps at active deadline (< 600 ms), Some("") payload engages long-poll
  • cargo test — full workspace, 92 client tests pass including new no_longpoll_cache_is_sticky

@therealaleph
Copy link
Copy Markdown
Owner

Substantive review — this is good work. Verified locally and merging.

Tests:

  • cargo test --lib — 92 passes (was 91; +1 for no_longpoll_cache_is_sticky)
  • cargo test -p mhrv-tunnel-node — 17 passes (was 6; +11 for the new wait_for_any_drainable primitive, reader_task_notifies_on_incoming_bytes end-to-end wiring through a TCP pair, and the handle_batch integration tests for active-vs-long-poll cap selection)
  • Test-merged against current main (post-v1.4.1 + tunnel-docker workflow changes) — clean auto-merge

Design verified:

  • Notify::notify_one() on each reader_task buffer extend + on EOF — correct primitive for "wake on first byte from any of N sessions"
  • wait_for_any_drainable self-filters stale permits left by a previous batch's spawn-race shortcut — important to avoid spurious wakes that would burn the new 500 ms cadence on no actual data
  • LEGACY_DETECT_THRESHOLD (1500 ms) sits comfortably between the legacy fixed-sleep drain (~350 ms) and the new long-poll floor (~5 s). An empty round trip at any other RTT than those two distinct regimes shouldn't false-trigger either way
  • server_no_longpoll is sticky per-mux (AtomicBool) so legacy tunnel-nodes only get one "fast empty reply" probe per session lifetime, not per request
  • The 30 s → 500 ms cadence drop on the client side is exactly the change that makes long-poll viable; without it, the client's own idle-skip would still gate response delivery on the next data write

Trade-offs flagged in the body acknowledged:

  • 5 s LONGPOLL_DEADLINE is the right starting default. Lower would help typing-burst flows; higher minimizes round-trips for push-only sessions. Worth tracking the empirical distribution of "time between session pokes" once this lands and tuning if there's signal.
  • The STRAGGLER_SETTLE (30 ms) after first wake catches neighboring responses without holding the active batch open uselessly. Good.

Manual verification path (not blocking merge):

  • Start a tunnel-node, point a v1.5.0 client at it, watch a sustained Telegram session — incoming messages should now show up in roughly RTT instead of waiting for the client's poll tick. The win will be visible as "messages stop feeling laggy."
  • Pointing a v1.5.0 client at a pre-feat(tunnel): event-driven drain with adaptive long-poll #173 tunnel-node should hit the legacy-detect path on its first empty poll and revert to the old cadence, no hammering — verifiable by tracing logs once we plumb a log line for the detect (worth a follow-up PR).

Merging into v1.5.0 release. Thanks again for the depth.


[reply via Anthropic Claude | reviewed by @therealaleph]

@therealaleph therealaleph merged commit c392a33 into therealaleph:main Apr 25, 2026
therealaleph added a commit that referenced this pull request Apr 25, 2026
… notes

Ships PR #173 (event-driven drain) plus three operational improvements:

PR #173 — long-poll tunnel mode. The tunnel-node's batch drain
switched from a fixed 150 ms sleep to an event-driven Notify wait;
idle sessions long-poll up to 5 s and wake on the first byte from
upstream. Push notifications and chat messages now arrive in roughly
RTT instead of waiting for the next client poll tick. Backward compat
with pre-#173 tunnel-nodes is automatic via a sticky AtomicBool that
detects fast empty replies and reverts to the legacy cadence.
92 client tests + 17 tunnel-node tests pass, including end-to-end
TCP-pair verification of the notify wiring.

Docker image for tunnel-node. Adds a hardened Dockerfile (BuildKit
cache mounts, non-root runtime user, ca-certificates for HTTPS
upstreams) and a .dockerignore to keep build context small. New
`tunnel-docker` job in the release workflow builds + pushes
multi-arch (linux/amd64 + linux/arm64) to
ghcr.io/therealaleph/mhrv-tunnel-node with `:latest`, `:1.5`, and
`:1.5.0` tags on every release. Setting up Full Tunnel mode goes
from "rustup + cargo build on a 1 GB VPS" (which fails on memory
half the time) to a one-liner. tunnel-node/README.md updated with
prebuilt-image + docker-compose recipes.

Brief Persian release note in Telegram caption. The release-post
caption now leads with a `<blockquote>`-wrapped FA bullet headlines
extracted from `docs/changelog/v<ver>.md`, above the existing two
links (repo + release). Markdown links → Telegram HTML <a> for
clickability. Cap-budget-aware truncation at bullet boundaries
keeps total caption under Telegram's 1024-char limit. Headlines-only
rather than full bullets so multiple "what's new" items fit
comfortably (the full bullets remain on the GH release page and as
the optional --with-changelog reply-threaded message).

GitHub Releases page bodies now lead with the changelog content
(Persian section + `---` + English) instead of just a Full Changelog
comparison link. The auto comparison link is appended at the bottom
via `append_body: true` rather than removed.

Workflow changes:
- New `permissions: packages: write` at the workflow level (required
  for ghcr push via docker/login-action).
- New `tunnel-docker` job needs `build` (not the full matrix) to
  serialize the QEMU buildx layer with the matrix cache.
- Release job composes the body from `docs/changelog/v${VER}.md`
  in a pre-step that handles both tag-push and workflow_dispatch
  paths (uses inputs.version || github.ref_name like the rest of
  the workflow).

Tested locally:
- `cargo test` — 92 lib tests pass
- `cargo test -p mhrv-tunnel-node` — 17 tests pass
- `docker build` of tunnel-node Dockerfile — 32 MB image, runs as
  non-root, /health returns "ok", auth rejection works correctly,
  legitimate requests open sessions to remote hosts
- Telegram script `--dry-run` mode added; rendered captions for
  v1.4.0, v1.4.1, v1.5.0 all fit under 900 chars

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
therealaleph added a commit that referenced this pull request Apr 25, 2026
feat(tunnel): event-driven drain with adaptive long-poll
therealaleph added a commit that referenced this pull request Apr 25, 2026
… notes

Ships PR #173 (event-driven drain) plus three operational improvements:

PR #173 — long-poll tunnel mode. The tunnel-node's batch drain
switched from a fixed 150 ms sleep to an event-driven Notify wait;
idle sessions long-poll up to 5 s and wake on the first byte from
upstream. Push notifications and chat messages now arrive in roughly
RTT instead of waiting for the next client poll tick. Backward compat
with pre-#173 tunnel-nodes is automatic via a sticky AtomicBool that
detects fast empty replies and reverts to the legacy cadence.
92 client tests + 17 tunnel-node tests pass, including end-to-end
TCP-pair verification of the notify wiring.

Docker image for tunnel-node. Adds a hardened Dockerfile (BuildKit
cache mounts, non-root runtime user, ca-certificates for HTTPS
upstreams) and a .dockerignore to keep build context small. New
`tunnel-docker` job in the release workflow builds + pushes
multi-arch (linux/amd64 + linux/arm64) to
ghcr.io/therealaleph/mhrv-tunnel-node with `:latest`, `:1.5`, and
`:1.5.0` tags on every release. Setting up Full Tunnel mode goes
from "rustup + cargo build on a 1 GB VPS" (which fails on memory
half the time) to a one-liner. tunnel-node/README.md updated with
prebuilt-image + docker-compose recipes.

Brief Persian release note in Telegram caption. The release-post
caption now leads with a `<blockquote>`-wrapped FA bullet headlines
extracted from `docs/changelog/v<ver>.md`, above the existing two
links (repo + release). Markdown links → Telegram HTML <a> for
clickability. Cap-budget-aware truncation at bullet boundaries
keeps total caption under Telegram's 1024-char limit. Headlines-only
rather than full bullets so multiple "what's new" items fit
comfortably (the full bullets remain on the GH release page and as
the optional --with-changelog reply-threaded message).

GitHub Releases page bodies now lead with the changelog content
(Persian section + `---` + English) instead of just a Full Changelog
comparison link. The auto comparison link is appended at the bottom
via `append_body: true` rather than removed.

Workflow changes:
- New `permissions: packages: write` at the workflow level (required
  for ghcr push via docker/login-action).
- New `tunnel-docker` job needs `build` (not the full matrix) to
  serialize the QEMU buildx layer with the matrix cache.
- Release job composes the body from `docs/changelog/v${VER}.md`
  in a pre-step that handles both tag-push and workflow_dispatch
  paths (uses inputs.version || github.ref_name like the rest of
  the workflow).

Tested locally:
- `cargo test` — 92 lib tests pass
- `cargo test -p mhrv-tunnel-node` — 17 tests pass
- `docker build` of tunnel-node Dockerfile — 32 MB image, runs as
  non-root, /health returns "ok", auth rejection works correctly,
  legitimate requests open sessions to remote hosts
- Telegram script `--dry-run` mode added; rendered captions for
  v1.4.0, v1.4.1, v1.5.0 all fit under 900 chars

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants