Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server went in deadlock in cluster setup #786

Closed
1 of 3 tasks
aman-gupta-doc opened this issue Aug 31, 2022 · 5 comments
Closed
1 of 3 tasks

Server went in deadlock in cluster setup #786

aman-gupta-doc opened this issue Aug 31, 2022 · 5 comments
Labels

Comments

@aman-gupta-doc
Copy link

Deadlock Condition Observed

There is a cluster of 3 nodes. During high load on these servers with tinode running, the servers went in a deadlock condition because the goroutine in context of hub.go got stuck in RPC call to another server within the cluster. This condition is noticeable for all the nodes in the cluster, hence creating a circular dependencies of goroutines leading to deadlock.

Your environment

Server-side

  • web.tinode.co, api.tinode.co
  • sandbox.tinode.co
  • Your own setup:
    • Platform: Linux
    • Version: 0.19.3
    • DB: MongoDB
    • cluster: true (3 nodes)

Steps to reproduce

In the Scenaro, there are 3 Servers, S1, S2, S3 respectively. The servers when put in high load the channel globals.hub.routeSrv got full and went in block state (buffered channel blocking when the buffer is full) and at this time if hub of S1 makes a call to the function globals.cluster.routeToTopicIntraCluster then this function makes a request to Cluster.Route of S2. This results in S1's hub waiting for response but S2's channel globals.hub.routeSrv is in a block state (as mentioned above). Also, S1's hub is stuck and this scenario happens with all three nodes at the same time and hence all servers went in a deadlock situation which it does not recover from until it is restarted.

globals.hub.routeSrv <- msg.SrvMsg

For verifying this, I have attached goroutine stacks when cluster went in deadlock

Hub GoRoutine stack:
S1

goroutine 51 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc000160960, {0x111d390, 0xd}, {0xf018c0, 0xc00d7edb40}, {0xeeee80, 0xc00da4604d})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000092410, {0xc00ca5e650, 0xe}, 0xc00ca6ab40, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc000092870)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

S2

goroutine 50 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc0003a8a20, {0x111d390, 0xd}, {0xf018c0, 0xc00145fd80}, {0xeeee80, 0xc0043d187d})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000100280, {0xc00202b190, 0xe}, 0xc006f00d80, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc000c36000)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

S3

goroutine 68 [chan receive, 38 minutes]:
net/rpc.(*Client).Call(...)
	/usr/local/go/src/net/rpc/client.go:321
main.(*ClusterNode).call(0xc000566a80, {0x111d390, 0xd}, {0xf018c0, 0xc00363dac0}, {0xeeee80, 0xc002d796bd})
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:286 +0xfe
main.(*ClusterNode).route(...)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:381
main.(*Cluster).routeToTopicIntraCluster(0xc000100280, {0xc005842420, 0xe}, 0xc002b098c0, 0x0)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:843 +0x1a5
main.(*Hub).run(0xc0001005f0)
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:254 +0x1165
created by main.newHub
	/Users/gene/go/src/github.com/tinode/chat/server/hub.go:143 +0x65b

Cluster.Route goroutine stack

S1

goroutine 1224359 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00da86900, 0xc00da8aceb)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224394 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edb80, 0xc00da460b3)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224393 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edb00, 0xc00da25fbd)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224392 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00d7edac0, 0xc00da25fb3)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1224358 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc00da868c0, 0xc00da8ac8b)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

S2

goroutine 1211292 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc002312740, 0xc0036a8073)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

goroutine 1211328 [chan send, 38 minutes]:
main.(*Cluster).Route(0x2, 0xc0051b4040, 0xc0039f2823)
	/Users/gene/go/src/github.com/tinode/chat/server/cluster.go:584 +0x109

S3

Couldn't find in this.

For detailed goRoutine Stack, refer Server-Side GoRoutine Stacks attached below,

Actual behaviour

No user will be able to operation like sub, pub etc. using any client. No response from server in the cluster. (ctrl messages)

Server-side GoRoutine Stacks

tinode-3-gr.txt
tinode-2-gr.txt
tinode-1-gr.txt

@or-else
Copy link
Contributor

or-else commented Aug 31, 2022

@aforge I guess we can make some tweaks, such as detect congestion at n.endpoint.Call(proc, req, resp) and globals.hub.routeSrv <- msg.SrvMsg, then instead of fracturing the cluster we can, say, start rejecting client requests. But ultimately someone will be able to overwhelm the cluster with enough load no matter what we do. What's the right way of handling it? Ideally we should handle as much traffic as we can and drop the rest. If not, what are the other options beside panic?

@or-else
Copy link
Contributor

or-else commented Aug 31, 2022

Here

if n.endpoint, err = rpc.Dial("tcp", n.address); err == nil {

instead of rpc.Dial("tcp", n.address) we can use something like

  conn, err := net.DialTimeout("tcp", n.address, time.Second)
  if err != nil {
    // handle dialling error
  }
  conn.SetDeadline(time.Second)
  n.endpoint := rpc.NewClient(conn)

But doing just that will lead to cluster de-synchronization.

@or-else
Copy link
Contributor

or-else commented Aug 31, 2022

I think we need to reject client requests at proxy topic when the proxy topic is unable communicate with the master fast enough.

@or-else
Copy link
Contributor

or-else commented Aug 31, 2022

Actually, rejecting requests at proxy topic looks pretty straightforward.

or-else added a commit that referenced this issue Sep 5, 2022
or-else added a commit that referenced this issue Sep 6, 2022
Remove deadlock on server overload, #786
@or-else
Copy link
Contributor

or-else commented Sep 6, 2022

Fix merged, please verify. Thanks.

@or-else or-else closed this as completed Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants