Skip to content

fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029

Merged
therealaleph merged 3 commits into
therealaleph:mainfrom
rezaisrad:fix/full-mode-warm-race-924
May 10, 2026
Merged

fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029
therealaleph merged 3 commits into
therealaleph:mainfrom
rezaisrad:fix/full-mode-warm-race-924

Conversation

@rezaisrad
Copy link
Copy Markdown
Contributor

Summary

Fixes #924, the canonical tracking thread for an 18+ duplicate cluster of "Full-tunnel mode broken since v1.9.15" reports.

What affected users see:

  • batch timed out after 30s and tunnel connect_data error … batch timed out in mhrv-rs logs
  • curl: (28) SSL connection timeout against 127.0.0.1:8086 (the SOCKS5 listener)
  • Apps-Script mode keeps working — narrows the regression to the Full-mode data plane: SOCKS5 → domain-fronted relay → Apps Script → tunnel-node → internet
  • Only available workaround: the "force_http1": true kill switch added by PR feat: HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback #799 itself

Root cause:

Bisect (see below) lands on PR #799, which added HTTP/2 multiplexing on the relay leg. That change is correct in steady state — one multiplexed TCP/TLS socket replacing the legacy 8–80-socket pool fixes the head-of-line blocking that #781/#773 were filed for. The bug is one line of startup ordering inside warm():

  • v1.9.14: prewarmed the h1 socket pool unconditionally — by the time the user's first request arrived, 6–16 sockets were ready
  • v1.9.15 (PR feat: HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback #799): gated the h1 prewarm loop behind ensure_h2().awaitensure_h2() is bounded by H2_OPEN_TIMEOUT_SECS (8 s) but can take the full 8 s on a cold first connection, and during that window the h1 pool is empty

Fallout chain when a request arrives in the warm-up window:

  1. ensure_h2() is still holding h2_open_lock (or has yet to populate the cell) — request gets Err((Relay("h2 unavailable"), No)) immediately and falls back to h1
  2. h1 path calls acquire() → empty pool → open() → fresh TCP+TLS to connect_host:443
  3. Same network conditions that stalled the h2 handshake also stall the h1 handshake; cold open exceeds the 30 s batch_timeout enforced in tunnel_client.rs::dispatch_full_tunnel
  4. Handshake never gets to send the POST → tunnel-node never sees the request → user sees a 30 s timeout that "works on apps_script mode" can't explain

This PR:

  • Restores v1.9.14 ordering — h1 prewarm runs in parallel with the h2 handshake instead of after it
  • Fallback pool is ready whether or not h2 succeeds
  • h2 fast path is unchanged; multiplexing still wins in steady state
  • run_pool_refill trims the h1 pool back to POOL_MIN_H2_FALLBACK (2) once h2 lands
  • Two commits: the first is the race fix; the second adds bounded timeouts on open() and synchronous detection of mid-session h2 deaths, closing two adjacent paths where the same stall pattern could recur

Bisect

Bisected git log v1.9.14..v1.9.15 (4 commits, 2 functional). The kill switch added by PR #799 itself made it possible to confirm the SHA at runtime instead of running a full git bisect run:

Build Probe result
v1.9.14 baseline ✅ 200 OK in 17.9 s
v1.9.15 default (h2 enabled) curl: (28) SSL connection timeout after 12 s
v1.9.15 + "force_http1": true ✅ 200 OK in 17.7 s

Bisected commit: 0e678630a8bde9ccb02b9d40dd2de4bc65bdfbf4 ("HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback", PR #799).

Symptoms

Unique log lines on the failing default-v1.9.15 run (comm -23 bad good):

DEBUG h2 batch request pre-send failure: relay error: h2 unavailable — falling back to h1
DEBUG h2 open timed out after 8s — falling back to h1
WARN  batch timed out after 30s (script AKfycbwj…, 1 ops)
ERROR tunnel connect_data error for api.ipify.org:443: batch timed out
  • h1 fallback also fails on default v1.9.15
  • Manual curl to the same Apps Script deployment with the same AUTH_KEY cleanly drives a tunnel-node session
  • The h1 path itself is fine; it's just empty when it's needed

Changes

Commit 1: fix: warm h1 pool in parallel with h2 (#924)

  • Spawn the h2 prewarm in a separate task so the h1 prewarm loop runs concurrently
  • Full n h1 sockets (6–16 depending on num_scripts.clamp) are warm before user traffic, even when the h2 handshake is slow or hits its 8 s timeout
  • Trade-off: extra TLS handshakes at startup
    • run_pool_refill trims the pool back to POOL_MIN_H2_FALLBACK (2) within 5 s when h2 ends up the active fast path
    • Surplus h1 sockets age out via POOL_TTL_SECS rather than being kept alive

Commit 2: fix: bound h1 open() with timeout, detect dead h2 cells synchronously

Two changes, both domain_fronter.rs-only:

  • H1_OPEN_TIMEOUT_SECS = 8 wraps the TCP+TLS handshake in open()
  • H2Cell.dead: Arc<AtomicBool> flipped by the connection driver task when Connection::await ends (GOAWAY, network error, normal close)
    • ensure_h2's fast path and run_pool_refill's pool-target check both consult the flag
    • A known-dead cell is rejected within ≤5 s instead of waiting for H2_CONN_TTL_SECS (540 s) to expire or for a request to discover the breakage via ready() failure

API impact:

  • h2_handshake_post_tls's return type changes to (SendRequest, Arc<AtomicBool>)
  • One existing test (h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1) tweaks its Ok arm to match — panic message unchanged

Testing

Unit + integration

  • 209 lib tests pass (was 208; added ensure_h2_rejects_dead_cell_within_ttl)
  • 36 tunnel-node tests pass
cargo test --lib
cargo test --manifest-path tunnel-node/Cargo.toml

Live end-to-end

Topology:

  • macOS arm64 client
  • → Tailscale exit-node Hetzner Helsinki
  • → Google edge 216.239.38.120
  • → Apps Script (web app deployment)
  • → cloudflared quick-tunnel
  • → tunnel-node v1.9.18 in local Docker
  • → real internet
Scenario Result
5 cold restarts (warm-up race window) ✅ 5/5 — 9.6 s, 13.0 s, 22.5 s, 16.4 s, 10.6 s
Concurrent burst (5 simultaneous SOCKS5 streams) ✅ 5/5 — all completed in 10.29 s (h2 multiplex)
Default full.json baseline (no force_http1) ✅ 200 OK in 13.3 s
force_http1: true (sanity, h2 disabled) ✅ 200 OK in 17.7 s

A/B vs PR #903 (per-session pipelining, currently approved-but-unmerged)

Branch Pass rate (5 alternating cold restarts) Mean probe time
pr/903 alone 5/5 8.86 s
pr/903 + this PR 5/5 6.78 s

If #903 lands first, this PR will need a small rebase but the overlap is mechanical.

Reproducing

A public-facing tunnel-node is required (Apps Script lives in Google's cloud, can't reach localhost).

  1. Stand up tunnel-node + a public URL:

    docker run -d --name mhrv-tunnel-test \
      -p 127.0.0.1:8080:8080 \
      -e TUNNEL_AUTH_KEY="$(openssl rand -hex 24)" \
      ghcr.io/therealaleph/mhrv-tunnel-node:latest
    cloudflared tunnel --no-autoupdate --protocol http2 --url http://127.0.0.1:8080

    (Or use a real VPS / Cloud Run / GitHub Actions — see assets/github-actions-tunnel/.)

  2. Deploy assets/apps_script/CodeFull.gs as a Web App with:

    • TUNNEL_SERVER_URL pointing at the URL from step 1
    • TUNNEL_AUTH_KEY matching the docker run -e value
    • A fresh AUTH_KEY
    • "Execute as: Me", "Who has access: Anyone"
  3. Write config.full.json with:

    • "mode": "full"
    • "google_ip": "216.239.38.120"
    • "front_domain": "www.google.com"
    • The deployment ID as script_id
    • The matching auth_key
  4. Reproduce the regression on v1.9.15:

    git worktree add ../mhrv-v1915 v1.9.15
    cd ../mhrv-v1915
    cargo build --bin mhrv-rs
    ./target/debug/mhrv-rs --config /path/to/config.full.json &
    curl --socks5-hostname 127.0.0.1:8086 --max-time 30 https://api.ipify.org?format=json
    # → curl: (28) SSL connection timeout (within ~12 s)
  5. Verify on this branch:

    git checkout fix/full-mode-warm-race-924
    cargo build --bin mhrv-rs
    ./target/debug/mhrv-rs --config /path/to/config.full.json &
    curl --socks5-hostname 127.0.0.1:8086 --max-time 30 https://api.ipify.org?format=json
    # → 200 OK with {"ip": "..."} body, typically 5–17 s

Probe-time variance comes from the Tailscale / Apps Script / cloudflared chain, not the proxy itself. On a direct VPS path, expect 3–5 s.

Out of scope

  • H2_OPEN_TIMEOUT_SECS (8 s) occasionally hitting on a cold first connection. Variable network behaviour (Apps Script container warm-up, Google edge cache miss, exit-node tunneling latency), not a code bug. The warm-race + dead-flag + open-timeout fixes in this PR mean a slow h2 init is gracefully handled by the warm h1 pool.
  • ngrok-free.dev / trycloudflare DNS-blocking at TCI (Iran ISP). Tracked separately in مشکل حالت فول تانل vps از نسخه ۱.۹.۱۵ به بعد #924's discussion.
  • Mid-session h2 stability ("works for hours then dies" reports). The dead-flag fix closes one mechanism (silent stale-cell reuse); broader long-running stability tracked separately.

@github-actions github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 10, 2026
Comment thread src/domain_fronter.rs Outdated
// h1 pool above is already populated either way. JoinError
// collapses to "h2 not alive" — same as if ensure_h2 returned
// None — so we still log a useful line.
let h2_alive = h2_handle.await.unwrap_or(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A panic in the spawned ensure_h2() task is now invisible — unwrap_or(false) swallows
JoinError and we just log "pool pre-warmed with N connection(s)" as if h2 simply wasn't
available. Pre-PR a panic here would have crashed the warm task and shown up; now it's
indistinguishable from a clean ALPN refusal.

Suggest:

let h2_alive = match h2_handle.await {
    Ok(v) => v,
    Err(e) => {
        tracing::warn!("h2 prewarm task failed to join: {}", e);
        false
    }
};

Comment thread src/domain_fronter.rs
// a fake address in `fronter_for_test`, so the open will fail
// — and ensure_h2 returns None. The point is: the stale gen=1
// SendRequest was NOT served.
let result = fronter.ensure_h2().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relies on 127.0.0.1:443 being closed so open_h2's TCP-connect fails fast
with ECONNREFUSED. On most dev/CI hosts that holds, but if a developer has HTTPS bound on
localhost (mkcert/Caddy/local nginx) the connect will succeed and the TLS handshake will
stall up to H2_OPEN_TIMEOUT_SECS = 8s before the assertion finally passes — flaky and
slow.

The test only cares that a dead-flagged cell isn't served, not that the open path actually
runs. Pre-set the failure-backoff so ensure_h2 short-circuits before any network I/O:

*fronter.h2_open_failed_at.lock().await = Some(Instant::now());

That isolates the assertion to the new dead-flag check and makes the test deterministic
regardless of what's running on localhost.

Bonus: a follow-up assertion that *fronter.h2_cell.lock().await is None after the call
would lock down the "cleanup on failed reopen" invariant too.

Comment thread src/domain_fronter.rs Outdated
/// prewarm behind `ensure_h2()` so the h1 pool stayed empty during
/// the h2 init window.
///
/// Staggered 500 ms apart so we don't burst N TLS handshakes at the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph predates the parallel-prewarm change. With h2 now spawned alongside h1[0],
startup actually fires two TLS handshakes back-to-back (h2 handshake + h1[0]) before any
500 ms stagger kicks in — the stagger only applies between h1[i] and h1[i+1] for i ≥ 1.

The new behavior is intentional and acceptable, but a future reader chasing "why are we
seeing 2 handshakes simultaneously at boot?" will be misled by the current wording.
Suggest one sentence noting that the spawned h2 handshake races h1[0], and the 500 ms
stagger only applies to h1[1..n].

@therealaleph therealaleph merged commit 4d3e621 into therealaleph:main May 10, 2026
1 check passed
therealaleph added a commit that referenced this pull request May 10, 2026
Bumps Cargo.toml v1.9.19 → v1.9.20 and ships the changelog. Headline
fix: the v1.9.15 Full-mode regression that's been tracking in #924 for
~3 weeks is resolved by @rezaisrad's PR #1029. Bisect-quality root
cause (h1 prewarm gated behind h2 handshake, both stall on cold start
under the same network conditions). Affected users can drop the
`force_http1: true` workaround now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
therealaleph pushed a commit that referenced this pull request May 10, 2026
Skip H2 for `tunnel_batch_request_to` (the Full-mode batch path).
Tunnel batches already coalesce N ops into one HTTP request — H2 stream
multiplexing has nothing to multiplex. The H2 try/fallback path on this
specific code path introduced three regressions vs v1.9.14:

1. **Long-poll stalls**: idle polls completed at 16-17s (`LONGPOLL_DEADLINE`
   + network latency) instead of timing out at 10s on H1. Each poll held
   an Apps Script execution slot 60% longer.
2. **Silent batch drops**: `RequestSent::Maybe` failures dropped the
   entire batch with no retry — a failure mode H1 doesn't have.
3. **Pool starvation**: `POOL_MIN_H2_FALLBACK = 2` trimmed the H1 pool
   from 8 → 2 once H2 connected, but tunnel batches still used H1 and
   needed the full pool.

H2 multiplexing is **kept active for relay mode** (non-full) where each
browser request is a separate HTTP call that genuinely benefits from
stream multiplexing. r0ar's controlled A/B test in #962 confirmed h2
is strictly better than `force_http1: true` for apps_script-mode users,
and that path is unchanged here.

## Changes

- `tunnel_batch_request_to`: remove H2 try/fallback/NonRetryable block,
  go straight to H1 pool `acquire()` (-54 lines).
- `run_pool_refill`: always maintain `POOL_MIN = 8`. Remove the
  `POOL_MIN_H2_FALLBACK = 2` trim that was starving tunnel batches
  (-12 lines).

## A/B results (Pixel 6 Pro, 30 batch samples each)

| Metric | H2 (stock v1.9.20) | H1 (this PR) | v1.9.14 (baseline) |
|---|---|---|---|
| 16-17s batches | **8-10/30** | **0/30** | **0/30** |
| 10s timeouts | 0 | 4/30 | 5/30 |
| Active RTTs | 1.4-2.4s | 1.3-2.2s | 1.4-2.3s |

Restores v1.9.14 tunnel performance while keeping all v1.9.15+
improvements (H2 for relay, zero-copy mux, block DoH/QUIC, TLS pool
tuning, PR #1029's warm-race fix).

## Verified locally on top of v1.9.20

- `cargo test --lib --release`: 209/209 ✅ (matches v1.9.20 baseline)
- `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅

## Interaction with PR #1029 (just shipped in v1.9.20)

PR #1029 added `H2Cell.dead: Arc<AtomicBool>` for synchronous dead-cell
detection. With this PR removing the H2 path for tunnel batches, the
dead-cell flag is no longer consulted on the tunnel batch path — that's
intentional (the flag now scopes to relay mode, which is the path it
was protecting in practice).

Reviewed via Anthropic Claude.

Co-Authored-By: yyoyoian-pixel <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix fix: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

مشکل حالت فول تانل vps از نسخه ۱.۹.۱۵ به بعد

3 participants