feat(relay): prioritize mux dispatch and expose script health#1388
Open
maybeknott wants to merge 4 commits into
Open
feat(relay): prioritize mux dispatch and expose script health#1388maybeknott wants to merge 4 commits into
maybeknott wants to merge 4 commits into
Conversation
Classify tunnel mux messages by dispatch urgency before applying the coalescing wait. Plain connection opens, connect-and-send opens, and data-bearing TCP or UDP operations now bypass the short batching delay once any already queued work has been drained. Empty polling operations and close notices remain batch-friendly so idle long-poll cadence and cleanup traffic can still piggyback without forcing extra Apps Script batches. The change leaves batch serialization, response indexing, payload-size limits, operation-count limits, deployment selection, and Apps Script quota accounting unchanged. It only decides whether the mux should wait for additional operations before processing the current group, reducing avoidable latency for interactive flows while preserving batching behavior for low-urgency traffic. Add focused unit coverage for immediate opening and payload-carrying messages, batchable empty polls and closes, and mixed groups where one urgent operation should short-circuit the wait.
Apps Script quota is consumed per relay invocation, but a plain round-robin selector has no memory of how heavily this client has used each deployment inside the recent quota window. When multiple script IDs are configured, continuing to select an already saturated deployment while another configured deployment is still locally underused wastes available capacity and increases the chance of quota-related relay stalls. DomainFronter now keeps a per-script local ledger of selection timestamps in a rolling 24-hour window. Before choosing a script ID, the selector prunes expired observations and prefers non-blacklisted deployments whose local call count remains below the free-tier request budget. Both the single-request selector and the parallel fan-out selector use the same ledger so Apps Script batches and relay fan-out draw from the same local capacity model. The ledger records selections at dispatch time. That deliberately accounts for concurrent fan-out attempts and for requests that may still complete server-side after the Rust future is dropped. The ledger is a local steering signal rather than an authoritative Google quota reading: if every non-blacklisted deployment is locally saturated, the selector still returns a deployment instead of creating a client-side outage. This preserves connectivity for paid Workspace quotas, shared deployments whose external usage is invisible to this process, and cases where the local estimate is conservative. Selection remains decoupled from the existing failure blacklist. Blacklisted deployments are still skipped first; the rolling quota ledger only orders otherwise healthy deployments by locally observed capacity. If all deployments are blacklisted, the existing earliest-cooldown recovery path is preserved and the selected deployment is recorded in the ledger. The guide now describes the local rolling 24-hour ledger in the Full Mode deployment-scaling section, including the fact that it steers away from deployments this client has already driven near the free-tier request budget. Unit coverage exercises saturated deployment skipping, expired observation pruning, all-saturated connectivity fallback, and parallel selection preferring unsaturated deployments.
A single cooldown duration is too coarse for Apps Script deployment failures. Quota exhaustion and account-level authorization failures recover on a much longer cadence than transient Google edge or Apps Script backend failures. Treating both classes the same either probes exhausted deployments too aggressively or removes transiently unhealthy deployments for longer than necessary. Relay failure handling now classifies script failures into two explicit quarantine classes. HTTP 429, HTTP 403, and response bodies that match quota or service-invocation limit text are treated as hard quota/account failures and quarantined for 24 hours. Google or Apps Script transient 5xx responses are treated as temporary relay failures and use the existing short cooldown window. The transient class is deliberately narrow. Generic upstream 5xx bodies such as a destination-origin bad gateway do not quarantine a script ID by themselves; the body must look like a Google, Apps Script, GFE, backend, service-unavailable, temporary, or timeout failure. This avoids punishing healthy deployments for ordinary origin-side errors that Apps Script relayed correctly. The same classifier is used across the direct relay path, h1 fallback path, tunnel single-operation path, and tunnel batch path. Quota-like errors returned inside the Apps Script JSON envelope still force the hard quarantine path even when the outer HTTP status is 200. The English and Persian guides now describe auto-quarantine as two failure classes instead of a single ten-minute blacklist. Unit coverage verifies hard quota/account classification, transient Google-edge classification, ordinary upstream 5xx pass-through, and the quarantine durations for both classes.
Add a read-only per-deployment health snapshot over the existing relay state so operators can inspect how deployment selection is behaving without changing the scheduler itself. The snapshot reports masked script IDs, locally observed rolling quota usage, the configured local quota threshold, saturation state, active cooldown seconds, cooldown reason, and timeout strike count. Cooldown reasons are tracked alongside the existing blacklist timestamps and are pruned whenever expired blacklist entries are removed. Surface the snapshot in the desktop UI as a collapsible Script health table, clear stale rows when the proxy stops or exits, and document that these values are local client observations rather than authoritative Google-side quota counters. Add focused unit coverage for quota saturation, cooldown reason exposure, timeout strike visibility, and compact duration formatting. The relay routing, quarantine durations, and selection behavior remain unchanged.
This was referenced May 24, 2026
Contributor
|
one thing i would recommend, try to first do the task you wanna do, push the branch and then make the PR, generally makes it ezier to merge and it can cause problems in more active repos, currently the owner which is the only maintainer is sleeping for a few days so its fine in this case, just telling for later on specially the push thing, if you make a mistake or want to merge some commits with one another, its alot ezier to fix those problems if you havent pushed the branch yet, if you do push it the history stays and it can make the branch commits a bit messy, a fully local branch is alot eizer to change around and fix than one that is already pushed, just so you know :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This groups the relay batching, local quota steering, failure quarantine, and script-health visibility pieces into one coherent relay behavior slice.
TunnelMux now classifies queued operations by dispatch urgency before applying the coalescing wait. Connection opens, connect-and-send opens, and non-empty TCP/UDP data operations bypass the short batching delay once already queued work has been drained. Empty polls and close notices remain batch-friendly so idle long-poll cadence and cleanup traffic can still piggyback without forcing extra Apps Script batches.
DomainFronter also keeps a local rolling 24-hour ledger per configured Apps Script deployment. Selection prunes expired observations and prefers non-blacklisted deployments whose locally observed call count is still below the free-tier steering threshold. If all healthy deployments are locally saturated, routing continues instead of creating a client-side outage, preserving compatibility with paid quotas, shared deployments, and cases where the local estimate is conservative.
Failure handling now separates hard quota/account failures from transient relay failures. HTTP 429/403 and recognizable quota-exceeded bodies quarantine the affected deployment for 24 hours. Transient relay failures such as 5xx responses use a short cooldown, while timeout strikes continue to protect against repeatedly selecting a stalled deployment.
The desktop UI exposes a read-only Script health panel with masked deployment IDs, local rolling-window usage, saturation status, cooldown expiry, failure reason, and timeout strike counts. The guide documents that this is local telemetry rather than an authoritative Google quota reading.
Validation:
git diff --checkcargo test mux_priority --libcargo test should_fire --libcargo test quota --libcargo test quarantine --libcargo test script_health --libcargo test compact_seconds_formatter --features ui --bin mhrv-rs-ui