Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} #11

Closed
KrlosWd opened this issue May 24, 2017 · 3 comments

Comments

@KrlosWd
Copy link

KrlosWd commented May 24, 2017

Hello,

I'm trying to do some benchmarking with multiple servers running mzbench + vmq_mzbench. However, whenever I reach over 35k publishers (distributed in 14 nodes) I start getting some errors related to timeouts as shown next:

12:42:29.004 [error] <0.196.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 204
12:42:31.532 [error] <0.196.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:terminate/7 line 826
12:42:31.540 [error] <0.132.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.196.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated

Anyone has any idea of what could be causing this?

This is the scenario I''m trying to run:

#!benchDL

make_install(git = "https://github.com/erlio/vmq_mzbench.git",
             branch = "master")

defaults("topic" = "topic1", 
            "subtopic" = "topic1",
            "sub_host" = "192.168.144.11", 
            "pub_host" = "192.168.144.11",
            "pubs"     = 1000, 
            "subname"  = "subscriber1",
            "poolname" = "pool1")


pool(size = 1,
     worker_type = mqtt_worker):

            connect([t(host, var("sub_host")),
                    t(port,1883),
                    t(client,var("subname")),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            wait(1 sec)
            subscribe(var("subtopic"), 0)


pool(size = numvar("pubs"),
     worker_type = mqtt_worker,
     worker_start = linear(40 rps)):

            connect([t(host, var("pub_host")),
                    t(port,1883),
                    t(client,fixed_client_id(var("poolname"), worker_id())),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            wait(15 sec)
            set_signal("connect1",1)
            wait_signal("connect1", numvar("pubs"))
            loop(time = 10 min, rate = 1 rps):
                publish(var("topic"), random_binary(150), 0)
            disconnect()

This is scenario is executed independently by each node (14 nodes in total), each node has its own topic,, they all publish/subscribe to the same server and the error occurs in at least one node after reaching 35k publishers, meaning 2.5k publishers per node.

Thanks in advance for your help,
Best,

Carlos

@ioolkos
Copy link
Contributor

ioolkos commented May 24, 2017

Hi @KrlosWd thanks for asking... I have seen those kind of errors. I can't give you a single reason why.

A couple of observations:

  • why use 14 nodes for 35k publishers only? you could use way less nodes
  • you can go faster than 40 rps for worker setup, if you want (and have configured enough acceptors)
  • 35k msg/s into a single queue will likely just block that queue. If your use case is massive fan-in to a single consumer, you'd most likely have to user additional strategies.

@KrlosWd
Copy link
Author

KrlosWd commented May 24, 2017

Hi @ioolkos, thanks for your quick answer,
As for your observations, it is worth to mention that I'm benchmarking the open source MQTT broker named mosquitto since I implemented some changes for some experiments I'm conducting for a research project, the problem with mosquitto is that it is single threaded. So having that in mind:

  • why use 14 nodes for 35k publishers only? you could use way less nodes
    The VM I have access to have limited CPU (actually a single core), thus I found out that having more than 3k publishers per node will cause the following error:
14:15:06.228 [error] emulator Error in process <0.10270.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:15:51.412 [error] emulator Error in process <0.10277.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:16:18.462 [error] emulator Error in process <0.10280.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:17:09.136 [error] emulator Error in process <0.10283.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
  • you can go faster than 40 rps for worker setup, if you want (and have configured enough acceptors)

Since mosquitto is single threaded, the number of connections it can accept in one second is pretty limited. With 40 rps per node I have a total of 560 rps, however I think I could go higher than that, but I'm using this number as a safe rate in the mean time.

  • 35k msg/s into a single queue will likely just block that queue. If your use case is massive fan-in to a single consumer, you'd most likely have to user additional strategies.

Each node uses a different topic, so I actually have 14 queues but I'm open to suggestions :D

@ioolkos
Copy link
Contributor

ioolkos commented May 25, 2017

Thanks for your details @KrlosWd !
Keep us posted on your testing progress and any results with Mosquitto (which of course is incredibly powerful on 1 core)

@ioolkos ioolkos closed this as completed Aug 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants