feat(ilp): multi-host ws ingress + egress failover hardening#20
Open
feat(ilp): multi-host ws ingress + egress failover hardening#20
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-host failover for ws/wss ingress; egress
QwpQueryClientfailover loop refactored onto a shared host-health tracker; role-aware503 + X-QuestDB-Roleupgrade-reject handling unified across both.Ingress
addr=accepts multi-host (comma list, repeated key, IPv4, bracketed and bare IPv6); TCP/UDP unchanged single-host.QwpWebSocketSender.buildAndConnect()walks the sharedQwpHostHealthTrackeron every reconnect (initial, I/O loop, orphan-drainer replays);synchronizedto serialise concurrent callers.auth_timeout_ms(default 15s) bounds the upgrade response read per endpoint — catches "TCP accepts but never replies" blackholes the OS connect timeout misses.gorilla=on|offplumbed throughconnect()(was a post-connect setter).Egress (
QwpQueryClient)reconnectViaTracker()replacesreconnectSkippingIndex(failedIndex)— picks by tracker priority instead of "skip the failed index" round-robin.lb_strategy=random|first(defaultrandom): shuffles the initial endpoint list so N clients spread across N hosts.failover_max_duration_ms(default 30s): wall-clock cap complementary tofailover_max_attempts.[0, capped]. The previous additive[base, 2·base)clamped atfailover_max_backoff_msself-cancelled at saturation.target=anystill reads from any role. Withtarget=primary|replica, SERVER_INFO mismatches and503 + X-QuestDB-Rolerejects are classifiedPRIMARY_CATCHUPtransient,REPLICAtopological.Shared
QwpHostHealthTracker: priorityHealthy → Unknown → TransientReject → TransportError → TopologyReject; sticky-Healthy across rounds.QwpIngressRoleRejectedExceptioncarriesrole + host:portwithisTransient/isTopological.WebSocketClient.getUpgradeRejectRole()parsesX-QuestDB-Rolefrom503upgrade responses.Config keys
auth_timeout_msgorillaonlb_strategyrandomfailover_max_duration_msConnect/error logs now emit the currently-bound endpoint, not the first configured host.