Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate weird disconnects/bridge lock-ups of 2015-04-27 #115

Closed
stapelberg opened this issue Apr 27, 2015 · 5 comments
Closed

Investigate weird disconnects/bridge lock-ups of 2015-04-27 #115

stapelberg opened this issue Apr 27, 2015 · 5 comments

Comments

@stapelberg
Copy link
Contributor

alp.robustirc.net was hard-rebooted using sysrq at 20:16 CEST (i.e. no proper TCP connection terminations). This was done because systemd (204) had locked up after a failed assertion and the system could not be rebooted any other way. In hindsight, killing the robustirc process would have been a good move.

A number of sessions were running into a network-side timeout, whereas others survived (e.g. sPhErE using the bridge on eris did not disconnect and still sent a message at 21:21 CEST).

@stapelberg
Copy link
Contributor Author

@Merovius supplied a logfile which clarifies how long a connection attempt took:

2015/04/27 20:34:38 <-irc: "NICK Merovius"
2015/04/27 20:36:49 sendRequest("/robustirc/v1/session") failed: Post https://alp.robustirc.net:60667/robustirc/v1/session: dial tcp 46.20.246.99:60667: connection timed out

We should set a more aggressive timeout than that.

More importantly, though, it seems like bridges were getting 404s from the server and abandoned the robustsession as a consequence, leading to the timeout and eventual reconnect.

It’s not clear to me yet why IRC clients/bridges did not reconnect more aggressively. It seems like only WeeChat users (using the bridge via SOCKS) were affected, though. irssi users reconnected quicker. Maybe WeeChat doesn’t have the same lagcheck mechanism that irssi has? (disconnect after no PING reply for n seconds)

Here’s an excerpt from a WeeChat log:

20:34:32 robustirc =!= │ Could not post message to RobustIRC: No such RobustIRC session (killed by the network?)
20:34:32 robustirc =!= │ irc: reading data on socket: error 104 Connection reset by peer
20:34:32 robustirc  -- │ irc: disconnecting from server...
20:34:32 robustirc  -- │ irc: disconnected from server
20:34:32 robustirc  -- │ irc: reconnecting to server in 10 seconds
20:34:42 robustirc  -- │ irc: reconnecting to server...
20:34:42 robustirc  -- │ irc: connecting to server robustirc.net/6667 via socks5 proxy localhost/1080...
20:34:44 robustirc =!= │ irc: proxy fails to establish connection to server (check username/password if used and if server address/port
                       │ is allowed by proxy)
20:34:44 robustirc  -- │ irc: reconnecting to server in 20 seconds
20:35:04 robustirc  -- │ irc: reconnecting to server...
20:35:04 robustirc  -- │ irc: connecting to server robustirc.net/6667 via socks5 proxy localhost/1080...
20:35:04 robustirc  -- │ irc: connected to robustirc.net/6667 (127.0.0.1)
20:35:04 robustirc  -- │ Welcome to RobustIRC!

@stapelberg
Copy link
Contributor Author

One thing to look at is how long a request can be stuck in the server-side proxying that we do. We should definitely set an aggressive deadline for that as well.

@stapelberg
Copy link
Contributor Author

I have a commit pending which makes the bridge use the default http.DefaultTransport instead of creating our own, custom transport. That way, we inherit the TCP keepalive settings on modern Go versions (1.3+ at least). I still need to backport this for go 1.0.2 users (Debian wheezy).

This addresses the case where the TCP connection can be detected as broken by the OS, which would have helped for the issue at hand.

On the server, we should use an aggressive end-to-end timeouts for the entire proxied request, i.e. set http.Client.Timeout = 10 * time.Second.

To be even more robust, we should also detect connections that are still okay on the TCP level, but just don’t send any data. One can use kill -SIGSTOP on the server to simulate that. Currently, the bridge will just hang in sync.WaitGroup.Wait() until the connection expires because the JSON decoder will just not read any new data and block. I’m not sure yet how to solve this.

@stapelberg
Copy link
Contributor Author

Two quick thoughts on the last paragraph before I got to leave: I think the only way to cleanly get out of the state is to use CancelRequest (which in a first test didn’t work well for me and is not present in go1.0.2). But perhaps a better insight is that leaving goroutines and TCP connections open for the duration of the problem might actually be okay. They’ll vanish once the server is restarted at the latest.

@stapelberg
Copy link
Contributor Author

Actually, maybe we could wrap the connection and use SetReadDeadline (extending it after every successful read) in order to have an idle timeout on the long-running connection. Will check later if that works with a net/http client in a reasonably straight-forward manner.

stapelberg added a commit to robustirc/bridge that referenced this issue Apr 29, 2015
This is to reliably detect unresponsive servers.
See robustirc/robustirc#115 for context.

This change also prevents a lockup in the bridge where the bridge
would hang in the sync.WaitGroup.Wait call because
(encoding/json).Decoder.Decode() would block on the Read until the
underlying connection times out on the TCP level (can take O(hours)).
This lockup is now impossible because the Read will return with an
error after 70 seconds.
stapelberg added a commit that referenced this issue May 1, 2015
This bug results in a crash and contributed to the trouble
described in issue #115.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant