Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lua, CRASH ERROR terminated with reason: timeout #341

Closed
funwun opened this issue Apr 2, 2017 · 32 comments
Closed

lua, CRASH ERROR terminated with reason: timeout #341

funwun opened this issue Apr 2, 2017 · 32 comments

Comments

@funwun
Copy link

funwun commented Apr 2, 2017

Environment

  • VerneMQ Version: erlio/docker-vernemq:1.0.0

I use lua plugin, and pub/sub each 10ms, always get the CRASH ERROR
The vernemq and lua request http service in same ethernet
At http service I return 200 always

function request(url, params)
    response = http.post('mqtt', endpoint .. url, json.encode(params), header)
    local ret = false
    if response then
      body = http.body(response.ref) -- if I remove this, I will get the ERROR after 50 requests, and if I add this, I will get the ERORR after  > 150 requests
      if response.status == 200 then
        ret = true
      end
    end
    return ret
end

function my_auth_on_register(reg)
  return request('/auth_on_register', reg)
end

function my_auth_on_publish(pub)
  return request('/auth_on_publish', pub)
end

function my_auth_on_subscribe(sub)
  return request('/auth_on_subscribe', sub)
end

hooks = {
    auth_on_register = my_auth_on_register,
    auth_on_publish = my_auth_on_publish,
    auth_on_subscribe = my_auth_on_subscribe
}

2017-04-02 08:06:32.948 [error] <0.561.0> Ranch listener {{172,17,0,17},1883} terminated with reason: {timeout,{gen_server,call,[<0.381.0>,{call_function,auth_on_register,[{addr,<<"192.168.5.104">>},{port,61806},{mountpoint,<<>>},{client_id,<<"mqttjs_94a1dfdf">>},{username,<<"aaaaa">>},{password,<<"bbbbb">>},{clean_session,true}]}]}}
2017-04-02 08:06:38.951 [error] <0.563.0> CRASH REPORT Process <0.563.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.381.0>,{call_function,auth_on_register,[{addr,<<"192.168.5.104">>},{port,61835},{mountpoint,<<>>},{client_id,<<"mqttjs_94a1dfdf">>},{username,<<"aaaaa">>},{password,<<"bbbbb">>},{clean_session,true}]}]}} in gen_server:call/2 line 204

@funwun
Copy link
Author

funwun commented Apr 2, 2017

If I return false in my_auth_on_publish, the connection will disconnect each time, but the my_auth_on_subscribe is ok

function my_auth_on_publish(pub)
  return false
end

Maybe this is the ERROR's key point.

2017-04-02 13:36:28.448 [warning] <0.937.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:29.461 [debug] <0.944.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:29.466 [error] <0.941.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:29.466 [warning] <0.941.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:30.479 [debug] <0.948.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:31.407 [error] <0.945.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:31.407 [warning] <0.945.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:32.430 [debug] <0.953.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:32.437 [error] <0.950.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:32.437 [warning] <0.950.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:33.452 [debug] <0.957.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:33.458 [error] <0.954.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:33.458 [warning] <0.954.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:34.474 [debug] <0.961.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:34.480 [error] <0.958.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error

@dergraf
Copy link
Contributor

dergraf commented Apr 2, 2017

Your comment above seems correct. Returning false in an auth_on_publish hook should close the connection as the client isn't authorized to publish. In case of returning false in an auth_on_subscribe hook we don't close the connection but should return the proper SUBACK packet.

Regarding your first issue I don't know to be honest as the log you're showing only describes a timeout that occured probably when reaching out to your web server. Can you add some Lua print statements around your http request and figure out that your web server is responding 'fast enough' ?

@funwun
Copy link
Author

funwun commented Apr 3, 2017

1) Close connection question:
I don't think this is correct, I use the mosquitto with acl, if I return false in acl check of publish message, the connection is not close, and the message is not deliver, so why we must close the connection?

Is same as this issue?
erlio/vmq_webhooks#3
#312

I use mqtt.js, and if I use this config, everything is ok

protocolVersion: 3

2) The timeout question:

body = http.body(response.ref)

if I remove this line, I will get the ERROR after 50 requests (I test this issue 20+ times, the 50 is fixed), so I don't think this is normal

I don't want check the body of http response, because I check the http status only, why I must add this line?

The http server and vernemq in same host, and I just return 200 with empty body every time, so I don't think this is my http server's problem.

@funwun
Copy link
Author

funwun commented Apr 6, 2017

Is there anybody review the 2nd question(timeout question)?

Thanks

@dergraf
Copy link
Contributor

dergraf commented Apr 7, 2017

We haven't looked at it yet, thanks for your patience.

@francescoPadovani81
Copy link

Hello,
I'm facing the same identical issue.
I'm performing some load tests against vernemq by running 4000 clients (which start to connect at the rate of 10 per second) in tls. Around the 500th client, vernemq crashes with the following error:

"
[error] <0.5047.1> CRASH REPORT Process <0.5047.1> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.352.0>,{call_functio
n,auth_on_register,[{addr,<<"34.249.106.240">>},{port,47266},{mountpoint,<<>>},{client_id,<<"38596">>},{username,<<"38596">>},{password,<<"547690844">>},{clean_ses
sion,true}]}]}} in gen_server:call/2 line 204
"

If I slow down the speed rate of connections (e.g. 5 clients per second), vernemq is able to hold on and I can go to regime with all 4000 clients connected and subscribed.
Ok for slow connections (with possible timeout error and retry on client side), but a total crash of vernemq I think is not god :-(
There could be some advanced parameter I can configure?

Thanks

@ioolkos
Copy link
Contributor

ioolkos commented Apr 28, 2017

Hi @francescoPadovani81 thanks for reporting... a couple questions:

  • How do you see that VerneMQ crashed? (can you verify by vernemq ping or the ps command?)
  • What exactly do you do in the auth_on_register hook? (do you use a plugin?)

@francescoPadovani81
Copy link

Hi @ioolkos ,

  • I see that VerneMQ crashed because it starts to NOT accept any connections. The process is up (and listening on acceptor port) but no client can connect to it. And for each connection tentative, the error above is logged into console.log and error.log files (log at debug level).
  • Yes: we're using the "vmq_diversity.auth_redis" plugin.

Thanks in advance for your help.

Francesco

@dergraf
Copy link
Contributor

dergraf commented May 2, 2017

@francescoPadovani81 The default timeout is set to 5000ms, which means for whatever reasons it took too long to process the Lua script. In this case the Lua script is very simple, it just accesses the database and returns its result. This means, either it took more than ~5000ms to access the configured database or there are too many requests that pile up in the vmq_diversity_script mailbox.

This of course isn't a solution for your problem, but might help you understand the problem you're experiencing.

@francescoPadovani81
Copy link

Thanks so much @dergraf for your quick reply.
Your supposition sounds good: probably, with too many requests per second, it took more than ~5000ms to access redis. Is there a way to increase the default timeout? Or anyway to better manage our usecase (where there are so many connections at the same time with redis plugin)?

Francesco

@dergraf
Copy link
Contributor

dergraf commented May 2, 2017

This timeout is currently fixed, we'd have to make it configurable.

However, no timeout isn't an option either, as it would just allow to accumulate even more messages in the mailbox of vmq_diversity_script, and this would only simulate the broker behaviour under connection spikes conditions, and connect latencies won't tell you much at this point. This could be ok for benchmarking but not for production.
If you expect such load in production I 'd recommend to beef up the Redis instances if Redis is the bottleneck, scale up VerneMQ if CPU is maxed out, and or throttle connection setup at the edges.

Hope this helps.

@francescoPadovani81
Copy link

I've just retried the load test by keeping monitored cpu and redis-server: unfortunately it seems there's no redis resources problem (redis cpu usage is always under 2%). And also the CPU of the entire system is good (never more than 50%). Also ram and I/O operations are good (both redis and vernemq are great from this point of view...)
I've also optimized (the best as I can) our redis instance, that for clarity is on the same server of vernemq. On redis, our Max latency so far is 67 microseconds (I think more than acceptable).
I'm no more so sure about redis as bottleneck :-(
I'll try by using acl files to see if something changes.

@diego-gabriele
Copy link

I have experimented the same problem in my vernemq+redis installation.
I see vernemq to NOT accept any connections after about 4000 subscribtions. I also TRIED TO REPLICATE the redis instance but result is EXACTLY the same:

2017-05-05 14:03:14.293 [error] <0.27867.1> CRASH REPORT Process <0.27867.1> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.392.0>,{call_function,auth_on_register,[{addr,<<"54.72.215.177">>},{port,43641},{mountpoint,<<>>},{client_id,<<"38332s">>},{username,<<"38332">>},{password,<<"246987977">>},{clean_session,true}]}]}} in gen_server:call/2 line 204 }

This problem do not show when using ACL, so guess there is something related with redis. Any suggestion would be apreciated

@ioolkos
Copy link
Contributor

ioolkos commented May 5, 2017

Hm, vmq_diversity_script is a gen_server, but how can this be possibly overloaded by 4000 subscriptions?
@dergraf could it be the Lua interpreter itself?

@dergraf
Copy link
Contributor

dergraf commented May 5, 2017

yep. that's a good question. The single obvious explanation is that the access to the redis took too long.

@diego-gabriele
Copy link

diego-gabriele commented May 5, 2017

is it worth to use mongodb in replacement to redis? Or maybe to work on the redis plugin in order to make it faster?

@dergraf
Copy link
Contributor

dergraf commented May 5, 2017

@diego-gabriele it depends. But, if you have a chance to run the same benchmark using Mongo we'd see if we run in the exact same bottleneck.
;)

@ioolkos
Copy link
Contributor

ioolkos commented May 5, 2017

We'd have to loadtest this somehow anyway. Access to Redis taking too long... this just sounds impossible :) ... like a programmer not loving pizza or something.

@francescoPadovani81
Copy link

I totally agree with ioolkos. I'm not so sure the problem is at redis side: when my load test crashed, redis was'nt stressed: it was consuming under 2% of cpu. And the "redis-cli --intrinsic-latency 100" command shown a Max latency of about 60 microseconds: a more than acceptable value I think. Maybe the problem is a limit on a different layer. I don't know...
Anyway, by using passwd and acl files, connection (And so the authentication phase) of 16000 clients (8000 sub and 8000 pub) works like a charm. The system starts to be stressed only when publishers start to send messagges (2 msg per minute each one) hence when I entered the authorization phase... but this is for another topic :-)

@dergraf
Copy link
Contributor

dergraf commented May 13, 2017

#381 Should fix the crashes.

@diego-gabriele
Copy link

I connected VerneMQ to MongoDB instead of Redis and the results in term of subscribed users is EXACTLY the same. This means the bottleneck is somewhere inside verneMQ code.

@larshesel
Copy link
Contributor

We've believe we've found the bottleneck and are working on a fix.

@diego-gabriele
Copy link

Any idea when this fix will be released?

@larshesel
Copy link
Contributor

I hope soon(a week, maybe two), but can't promise anything.

@diego-gabriele
Copy link

any update?

@larshesel
Copy link
Contributor

As soon as it's merged we'll let you know.

@dergraf
Copy link
Contributor

dergraf commented Jun 15, 2017

We merged a new feature that enables to load balance Lua hooks. In the current master those database Lua wrappers have this feature enabled.
If you want to use this feature in your own Lua script you can set the global variable num_states =10 in your script which would command vmq_diversity to start 10 instances and load balance the hook calls among them.

We'll soon release 1.1.0 that contains this improvement.
Could you give it a testrun?

@diego-gabriele
Copy link

Of course i will

@larshesel
Copy link
Contributor

The fixes has been released as part of the VerneMQ 1.1.0 release so I'm closing this - we'd of course appreciate any feedback you might have.

@funwun
Copy link
Author

funwun commented Jun 23, 2017 via email

@larshesel
Copy link
Contributor

It's available on the download page: http://vernemq.com/downloads/

@funwun
Copy link
Author

funwun commented Jun 26, 2017

wow, GOOOOOOOD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants