New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate weird disconnects/bridge lock-ups of 2015-04-27 #115
Comments
@Merovius supplied a logfile which clarifies how long a connection attempt took:
We should set a more aggressive timeout than that. More importantly, though, it seems like bridges were getting 404s from the server and abandoned the robustsession as a consequence, leading to the timeout and eventual reconnect. It’s not clear to me yet why IRC clients/bridges did not reconnect more aggressively. It seems like only WeeChat users (using the bridge via SOCKS) were affected, though. irssi users reconnected quicker. Maybe WeeChat doesn’t have the same lagcheck mechanism that irssi has? (disconnect after no PING reply for n seconds) Here’s an excerpt from a WeeChat log:
|
One thing to look at is how long a request can be stuck in the server-side proxying that we do. We should definitely set an aggressive deadline for that as well. |
I have a commit pending which makes the bridge use the default This addresses the case where the TCP connection can be detected as broken by the OS, which would have helped for the issue at hand. On the server, we should use an aggressive end-to-end timeouts for the entire proxied request, i.e. set To be even more robust, we should also detect connections that are still okay on the TCP level, but just don’t send any data. One can use |
Two quick thoughts on the last paragraph before I got to leave: I think the only way to cleanly get out of the state is to use |
Actually, maybe we could wrap the connection and use SetReadDeadline (extending it after every successful read) in order to have an idle timeout on the long-running connection. Will check later if that works with a net/http client in a reasonably straight-forward manner. |
This is to reliably detect unresponsive servers. See robustirc/robustirc#115 for context. This change also prevents a lockup in the bridge where the bridge would hang in the sync.WaitGroup.Wait call because (encoding/json).Decoder.Decode() would block on the Read until the underlying connection times out on the TCP level (can take O(hours)). This lockup is now impossible because the Read will return with an error after 70 seconds.
This bug results in a crash and contributed to the trouble described in issue #115.
alp.robustirc.net was hard-rebooted using sysrq at 20:16 CEST (i.e. no proper TCP connection terminations). This was done because systemd (204) had locked up after a failed assertion and the system could not be rebooted any other way. In hindsight, killing the robustirc process would have been a good move.
A number of sessions were running into a network-side timeout, whereas others survived (e.g. sPhErE using the bridge on eris did not disconnect and still sent a message at 21:21 CEST).
The text was updated successfully, but these errors were encountered: