-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server went in deadlock in cluster setup #786
Comments
@aforge I guess we can make some tweaks, such as detect congestion at |
Here Line 236 in ebcb536
instead of conn, err := net.DialTimeout("tcp", n.address, time.Second)
if err != nil {
// handle dialling error
}
conn.SetDeadline(time.Second)
n.endpoint := rpc.NewClient(conn) But doing just that will lead to cluster de-synchronization. |
I think we need to reject client requests at proxy topic when the proxy topic is unable communicate with the master fast enough. |
Actually, rejecting requests at proxy topic looks pretty straightforward. |
Remove deadlock on server overload, #786
Fix merged, please verify. Thanks. |
Deadlock Condition Observed
There is a cluster of 3 nodes. During high load on these servers with tinode running, the servers went in a deadlock condition because the goroutine in context of
hub.go
got stuck in RPC call to another server within the cluster. This condition is noticeable for all the nodes in the cluster, hence creating a circular dependencies of goroutines leading to deadlock.Your environment
Server-side
true
(3 nodes)Steps to reproduce
In the Scenaro, there are 3 Servers, S1, S2, S3 respectively. The servers when put in high load the channel
globals.hub.routeSrv
got full and went in block state (buffered channel blocking when the buffer is full) and at this time if hub of S1 makes a call to the functionglobals.cluster.routeToTopicIntraCluster
then this function makes a request toCluster.Route
of S2. This results in S1's hub waiting for response but S2's channelglobals.hub.routeSrv
is in a block state (as mentioned above). Also, S1's hub is stuck and this scenario happens with all three nodes at the same time and hence all servers went in a deadlock situation which it does not recover from until it is restarted.chat/server/cluster.go
Line 584 in ebcb536
For verifying this, I have attached goroutine stacks when cluster went in deadlock
Hub GoRoutine stack:
S1
S2
S3
Cluster.Route
goroutine stackS1
S2
S3
Actual behaviour
No user will be able to operation like sub, pub etc. using any client. No response from server in the cluster. (
ctrl
messages)Server-side GoRoutine Stacks
tinode-3-gr.txt
tinode-2-gr.txt
tinode-1-gr.txt
The text was updated successfully, but these errors were encountered: