Server went in deadlock in cluster setup #786

aman-gupta-doc · 2022-08-31T12:55:46Z

Deadlock Condition Observed

There is a cluster of 3 nodes. During high load on these servers with tinode running, the servers went in a deadlock condition because the goroutine in context of hub.go got stuck in RPC call to another server within the cluster. This condition is noticeable for all the nodes in the cluster, hence creating a circular dependencies of goroutines leading to deadlock.

Your environment

Server-side

web.tinode.co, api.tinode.co
sandbox.tinode.co
Your own setup:
- Platform: Linux
- Version: 0.19.3
- DB: MongoDB
- cluster: true (3 nodes)

Steps to reproduce

In the Scenaro, there are 3 Servers, S1, S2, S3 respectively. The servers when put in high load the channel globals.hub.routeSrv got full and went in block state (buffered channel blocking when the buffer is full) and at this time if hub of S1 makes a call to the function globals.cluster.routeToTopicIntraCluster then this function makes a request to Cluster.Route of S2. This results in S1's hub waiting for response but S2's channel globals.hub.routeSrv is in a block state (as mentioned above). Also, S1's hub is stuck and this scenario happens with all three nodes at the same time and hence all servers went in a deadlock situation which it does not recover from until it is restarted.

chat/server/cluster.go

Line 584 in ebcb536

globals.hub.routeSrv <- msg.SrvMsg

For verifying this, I have attached goroutine stacks when cluster went in deadlock

Hub GoRoutine stack:
S1

goroutine 51 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc000160960, {0x111d390, 0xd}, {0xf018c0, 0xc00d7edb40}, {0xeeee80, 0xc00da4604d})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000092410, {0xc00ca5e650, 0xe}, 0xc00ca6ab40, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc000092870)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

S2

goroutine 50 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc0003a8a20, {0x111d390, 0xd}, {0xf018c0, 0xc00145fd80}, {0xeeee80, 0xc0043d187d})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000100280, {0xc00202b190, 0xe}, 0xc006f00d80, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc000c36000)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

S3

goroutine 68 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc000566a80, {0x111d390, 0xd}, {0xf018c0, 0xc00363dac0}, {0xeeee80, 0xc002d796bd})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000100280, {0xc005842420, 0xe}, 0xc002b098c0, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc0001005f0)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

Cluster.Route goroutine stack

S1

goroutine 1224359 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00da86900, 0xc00da8aceb)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224394 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edb80, 0xc00da460b3)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224393 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edb00, 0xc00da25fbd)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224392 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edac0, 0xc00da25fb3)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224358 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00da868c0, 0xc00da8ac8b)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

S2

goroutine 1211292 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc002312740, 0xc0036a8073)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1211328 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc0051b4040, 0xc0039f2823)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

S3

Couldn't find in this.

For detailed goRoutine Stack, refer Server-Side GoRoutine Stacks attached below,

Actual behaviour

No user will be able to operation like sub, pub etc. using any client. No response from server in the cluster. (ctrl messages)

Server-side GoRoutine Stacks

tinode-3-gr.txt
tinode-2-gr.txt
tinode-1-gr.txt

The text was updated successfully, but these errors were encountered:

or-else · 2022-08-31T16:58:27Z

@aforge I guess we can make some tweaks, such as detect congestion at n.endpoint.Call(proc, req, resp) and globals.hub.routeSrv <- msg.SrvMsg, then instead of fracturing the cluster we can, say, start rejecting client requests. But ultimately someone will be able to overwhelm the cluster with enough load no matter what we do. What's the right way of handling it? Ideally we should handle as much traffic as we can and drop the rest. If not, what are the other options beside panic?

or-else · 2022-08-31T17:22:37Z

Here

chat/server/cluster.go

Line 236 in ebcb536

if n.endpoint, err = rpc.Dial("tcp", n.address); err == nil {

instead of rpc.Dial("tcp", n.address) we can use something like

  conn, err := net.DialTimeout("tcp", n.address, time.Second)
  if err != nil {
    // handle dialling error
  }
  conn.SetDeadline(time.Second)
  n.endpoint := rpc.NewClient(conn)

But doing just that will lead to cluster de-synchronization.

or-else · 2022-08-31T17:53:32Z

I think we need to reject client requests at proxy topic when the proxy topic is unable communicate with the master fast enough.

or-else · 2022-08-31T18:09:34Z

Actually, rejecting requests at proxy topic looks pretty straightforward.

Remove deadlock on server overload, #786

or-else · 2022-09-06T19:38:56Z

Fix merged, please verify. Thanks.

aman-gupta-doc added the bug label Aug 31, 2022

or-else added a commit that referenced this issue Sep 5, 2022

remove deadlock on server overload, #786

c7508ee

or-else added a commit that referenced this issue Sep 6, 2022

Merge pull request #787 from tinode/deadlock

9a2ec4f

Remove deadlock on server overload, #786

or-else closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server went in deadlock in cluster setup #786

Server went in deadlock in cluster setup #786

aman-gupta-doc commented Aug 31, 2022

or-else commented Aug 31, 2022 •

edited

or-else commented Aug 31, 2022 •

edited

or-else commented Aug 31, 2022

or-else commented Aug 31, 2022

or-else commented Sep 6, 2022

Server went in deadlock in cluster setup #786

Server went in deadlock in cluster setup #786

Comments

aman-gupta-doc commented Aug 31, 2022

Deadlock Condition Observed

Your environment

Server-side

Steps to reproduce

Actual behaviour

Server-side GoRoutine Stacks

or-else commented Aug 31, 2022 • edited

or-else commented Aug 31, 2022 • edited

or-else commented Aug 31, 2022

or-else commented Aug 31, 2022

or-else commented Sep 6, 2022

or-else commented Aug 31, 2022 •

edited

or-else commented Aug 31, 2022 •

edited