lua, CRASH ERROR terminated with reason: timeout #341

funwun · 2017-04-02T13:01:50Z

Environment

VerneMQ Version: erlio/docker-vernemq:1.0.0

I use lua plugin, and pub/sub each 10ms, always get the CRASH ERROR
The vernemq and lua request http service in same ethernet
At http service I return 200 always

function request(url, params)
    response = http.post('mqtt', endpoint .. url, json.encode(params), header)
    local ret = false
    if response then
      body = http.body(response.ref) -- if I remove this, I will get the ERROR after 50 requests, and if I add this, I will get the ERORR after  > 150 requests
      if response.status == 200 then
        ret = true
      end
    end
    return ret
end

function my_auth_on_register(reg)
  return request('/auth_on_register', reg)
end

function my_auth_on_publish(pub)
  return request('/auth_on_publish', pub)
end

function my_auth_on_subscribe(sub)
  return request('/auth_on_subscribe', sub)
end

hooks = {
    auth_on_register = my_auth_on_register,
    auth_on_publish = my_auth_on_publish,
    auth_on_subscribe = my_auth_on_subscribe
}

2017-04-02 08:06:32.948 [error] <0.561.0> Ranch listener {{172,17,0,17},1883} terminated with reason: {timeout,{gen_server,call,[<0.381.0>,{call_function,auth_on_register,[{addr,<<"192.168.5.104">>},{port,61806},{mountpoint,<<>>},{client_id,<<"mqttjs_94a1dfdf">>},{username,<<"aaaaa">>},{password,<<"bbbbb">>},{clean_session,true}]}]}}
2017-04-02 08:06:38.951 [error] <0.563.0> CRASH REPORT Process <0.563.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.381.0>,{call_function,auth_on_register,[{addr,<<"192.168.5.104">>},{port,61835},{mountpoint,<<>>},{client_id,<<"mqttjs_94a1dfdf">>},{username,<<"aaaaa">>},{password,<<"bbbbb">>},{clean_session,true}]}]}} in gen_server:call/2 line 204

funwun · 2017-04-02T13:40:47Z

If I return false in my_auth_on_publish, the connection will disconnect each time, but the my_auth_on_subscribe is ok

function my_auth_on_publish(pub)
  return false
end

Maybe this is the ERROR's key point.

2017-04-02 13:36:28.448 [warning] <0.937.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:29.461 [debug] <0.944.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:29.466 [error] <0.941.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:29.466 [warning] <0.941.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:30.479 [debug] <0.948.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:31.407 [error] <0.945.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:31.407 [warning] <0.945.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:32.430 [debug] <0.953.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:32.437 [error] <0.950.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:32.437 [warning] <0.950.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:33.452 [debug] <0.957.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:33.458 [error] <0.954.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error
2017-04-02 13:36:33.458 [warning] <0.954.0>@vmq_ranch:teardown:127 session stopped abnormally due to 'publish_not_authorized_3_1_1'
2017-04-02 13:36:34.474 [debug] <0.961.0>@vmq_queue:state_change:898 transition from offline --> online because of add_session
2017-04-02 13:36:34.480 [error] <0.958.0>@vmq_mqtt_fsm:auth_on_publish:659 can't auth publish [<<"aaaa">>,{[],<<"mqttjs_db774851">>},0,[<<"f2">>,<<"d">>],<<"{test: 2}">>,false] due to error

dergraf · 2017-04-02T18:06:01Z

Your comment above seems correct. Returning false in an auth_on_publish hook should close the connection as the client isn't authorized to publish. In case of returning false in an auth_on_subscribe hook we don't close the connection but should return the proper SUBACK packet.

Regarding your first issue I don't know to be honest as the log you're showing only describes a timeout that occured probably when reaching out to your web server. Can you add some Lua print statements around your http request and figure out that your web server is responding 'fast enough' ?

funwun · 2017-04-03T00:35:24Z

1) Close connection question:
I don't think this is correct, I use the mosquitto with acl, if I return false in acl check of publish message, the connection is not close, and the message is not deliver, so why we must close the connection?

Is same as this issue?
erlio/vmq_webhooks#3
#312

I use mqtt.js, and if I use this config, everything is ok

protocolVersion: 3

2) The timeout question:

body = http.body(response.ref)

if I remove this line, I will get the ERROR after 50 requests (I test this issue 20+ times, the 50 is fixed), so I don't think this is normal

I don't want check the body of http response, because I check the http status only, why I must add this line?

The http server and vernemq in same host, and I just return 200 with empty body every time, so I don't think this is my http server's problem.

funwun · 2017-04-06T23:31:29Z

Is there anybody review the 2nd question(timeout question)?

Thanks

dergraf · 2017-04-07T02:05:58Z

We haven't looked at it yet, thanks for your patience.

francescoPadovani81 · 2017-04-28T16:24:28Z

Hello,
I'm facing the same identical issue.
I'm performing some load tests against vernemq by running 4000 clients (which start to connect at the rate of 10 per second) in tls. Around the 500th client, vernemq crashes with the following error:

"
[error] <0.5047.1> CRASH REPORT Process <0.5047.1> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.352.0>,{call_functio
n,auth_on_register,[{addr,<<"34.249.106.240">>},{port,47266},{mountpoint,<<>>},{client_id,<<"38596">>},{username,<<"38596">>},{password,<<"547690844">>},{clean_ses
sion,true}]}]}} in gen_server:call/2 line 204
"

If I slow down the speed rate of connections (e.g. 5 clients per second), vernemq is able to hold on and I can go to regime with all 4000 clients connected and subscribed.
Ok for slow connections (with possible timeout error and retry on client side), but a total crash of vernemq I think is not god :-(
There could be some advanced parameter I can configure?

Thanks

ioolkos · 2017-04-28T16:53:40Z

Hi @francescoPadovani81 thanks for reporting... a couple questions:

How do you see that VerneMQ crashed? (can you verify by vernemq ping or the ps command?)
What exactly do you do in the auth_on_register hook? (do you use a plugin?)

francescoPadovani81 · 2017-05-02T10:36:30Z

Hi @ioolkos ,

I see that VerneMQ crashed because it starts to NOT accept any connections. The process is up (and listening on acceptor port) but no client can connect to it. And for each connection tentative, the error above is logged into console.log and error.log files (log at debug level).
Yes: we're using the "vmq_diversity.auth_redis" plugin.

Thanks in advance for your help.

Francesco

dergraf · 2017-05-02T10:58:58Z

@francescoPadovani81 The default timeout is set to 5000ms, which means for whatever reasons it took too long to process the Lua script. In this case the Lua script is very simple, it just accesses the database and returns its result. This means, either it took more than ~5000ms to access the configured database or there are too many requests that pile up in the vmq_diversity_script mailbox.

This of course isn't a solution for your problem, but might help you understand the problem you're experiencing.

francescoPadovani81 · 2017-05-02T12:11:06Z

Thanks so much @dergraf for your quick reply.
Your supposition sounds good: probably, with too many requests per second, it took more than ~5000ms to access redis. Is there a way to increase the default timeout? Or anyway to better manage our usecase (where there are so many connections at the same time with redis plugin)?

Francesco

dergraf · 2017-05-02T12:30:33Z

This timeout is currently fixed, we'd have to make it configurable.

However, no timeout isn't an option either, as it would just allow to accumulate even more messages in the mailbox of vmq_diversity_script, and this would only simulate the broker behaviour under connection spikes conditions, and connect latencies won't tell you much at this point. This could be ok for benchmarking but not for production.
If you expect such load in production I 'd recommend to beef up the Redis instances if Redis is the bottleneck, scale up VerneMQ if CPU is maxed out, and or throttle connection setup at the edges.

Hope this helps.

francescoPadovani81 · 2017-05-02T13:08:33Z

I've just retried the load test by keeping monitored cpu and redis-server: unfortunately it seems there's no redis resources problem (redis cpu usage is always under 2%). And also the CPU of the entire system is good (never more than 50%). Also ram and I/O operations are good (both redis and vernemq are great from this point of view...)
I've also optimized (the best as I can) our redis instance, that for clarity is on the same server of vernemq. On redis, our Max latency so far is 67 microseconds (I think more than acceptable).
I'm no more so sure about redis as bottleneck :-(
I'll try by using acl files to see if something changes.

diego-gabriele · 2017-05-05T14:20:17Z

I have experimented the same problem in my vernemq+redis installation.
I see vernemq to NOT accept any connections after about 4000 subscribtions. I also TRIED TO REPLICATE the redis instance but result is EXACTLY the same:

2017-05-05 14:03:14.293 [error] <0.27867.1> CRASH REPORT Process <0.27867.1> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.392.0>,{call_function,auth_on_register,[{addr,<<"54.72.215.177">>},{port,43641},{mountpoint,<<>>},{client_id,<<"38332s">>},{username,<<"38332">>},{password,<<"246987977">>},{clean_session,true}]}]}} in gen_server:call/2 line 204 }

This problem do not show when using ACL, so guess there is something related with redis. Any suggestion would be apreciated

ioolkos · 2017-05-05T14:39:12Z

Hm, vmq_diversity_script is a gen_server, but how can this be possibly overloaded by 4000 subscriptions?
@dergraf could it be the Lua interpreter itself?

dergraf · 2017-05-05T14:48:04Z

yep. that's a good question. The single obvious explanation is that the access to the redis took too long.

diego-gabriele · 2017-05-05T14:49:05Z

is it worth to use mongodb in replacement to redis? Or maybe to work on the redis plugin in order to make it faster?

dergraf · 2017-05-05T14:50:21Z

@diego-gabriele it depends. But, if you have a chance to run the same benchmark using Mongo we'd see if we run in the exact same bottleneck.
;)

ioolkos · 2017-05-05T14:54:38Z

We'd have to loadtest this somehow anyway. Access to Redis taking too long... this just sounds impossible :) ... like a programmer not loving pizza or something.

francescoPadovani81 · 2017-05-06T17:10:51Z

I totally agree with ioolkos. I'm not so sure the problem is at redis side: when my load test crashed, redis was'nt stressed: it was consuming under 2% of cpu. And the "redis-cli --intrinsic-latency 100" command shown a Max latency of about 60 microseconds: a more than acceptable value I think. Maybe the problem is a limit on a different layer. I don't know...
Anyway, by using passwd and acl files, connection (And so the authentication phase) of 16000 clients (8000 sub and 8000 pub) works like a charm. The system starts to be stressed only when publishers start to send messagges (2 msg per minute each one) hence when I entered the authorization phase... but this is for another topic :-)

dergraf · 2017-05-13T18:45:16Z

#381 Should fix the crashes.

diego-gabriele · 2017-05-17T12:28:55Z

I connected VerneMQ to MongoDB instead of Redis and the results in term of subscribed users is EXACTLY the same. This means the bottleneck is somewhere inside verneMQ code.

larshesel · 2017-05-17T12:37:52Z

We've believe we've found the bottleneck and are working on a fix.

diego-gabriele · 2017-05-18T07:01:23Z

Any idea when this fix will be released?

larshesel · 2017-05-18T07:09:43Z

I hope soon(a week, maybe two), but can't promise anything.

diego-gabriele · 2017-05-25T14:22:16Z

any update?

larshesel · 2017-05-28T06:22:12Z

As soon as it's merged we'll let you know.

dergraf · 2017-06-15T09:30:47Z

We merged a new feature that enables to load balance Lua hooks. In the current master those database Lua wrappers have this feature enabled.
If you want to use this feature in your own Lua script you can set the global variable num_states =10 in your script which would command vmq_diversity to start 10 instances and load balance the hook calls among them.

We'll soon release 1.1.0 that contains this improvement.
Could you give it a testrun?

diego-gabriele · 2017-06-15T09:44:05Z

Of course i will

larshesel · 2017-06-23T05:48:32Z

The fixes has been released as part of the VerneMQ 1.1.0 release so I'm closing this - we'd of course appreciate any feedback you might have.

funwun · 2017-06-23T07:11:47Z

When we can get the 1.1.0? :-) Best Regards Lewis Deng

…

On Fri, Jun 23, 2017 at 1:48 PM, Lars Hesel Christensen < ***@***.***> wrote: The fixes has been released as part of the VerneMQ 1.1.0 release so I'm closing this - we'd of course appreciate any feedback you might have. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#341 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEW2TLH39ZfGdm6GtgeyWEaYJHPO3F09ks5sG1GzgaJpZM4Mwy6t> .

larshesel · 2017-06-23T07:45:43Z

It's available on the download page: http://vernemq.com/downloads/

funwun · 2017-06-26T10:32:52Z

wow, GOOOOOOOD

ioolkos mentioned this issue May 12, 2017

Vernemq crash and 100% CPU consumption when integrated with mongodb #380

Closed

larshesel closed this as completed Jun 23, 2017

kaiterramike mentioned this issue Jan 22, 2018

Lua scripting docs don't mention it's critical to call http.body after making an http request #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lua, CRASH ERROR terminated with reason: timeout #341

lua, CRASH ERROR terminated with reason: timeout #341

funwun commented Apr 2, 2017 •

edited

funwun commented Apr 2, 2017

dergraf commented Apr 2, 2017

funwun commented Apr 3, 2017 •

edited

funwun commented Apr 6, 2017

dergraf commented Apr 7, 2017

francescoPadovani81 commented Apr 28, 2017

ioolkos commented Apr 28, 2017 •

edited

francescoPadovani81 commented May 2, 2017

dergraf commented May 2, 2017

francescoPadovani81 commented May 2, 2017

dergraf commented May 2, 2017

francescoPadovani81 commented May 2, 2017

diego-gabriele commented May 5, 2017

ioolkos commented May 5, 2017

dergraf commented May 5, 2017

diego-gabriele commented May 5, 2017 •

edited

dergraf commented May 5, 2017

ioolkos commented May 5, 2017

francescoPadovani81 commented May 6, 2017

dergraf commented May 13, 2017

diego-gabriele commented May 17, 2017

larshesel commented May 17, 2017

diego-gabriele commented May 18, 2017

larshesel commented May 18, 2017

diego-gabriele commented May 25, 2017

larshesel commented May 28, 2017

dergraf commented Jun 15, 2017

diego-gabriele commented Jun 15, 2017

larshesel commented Jun 23, 2017

funwun commented Jun 23, 2017 via email

larshesel commented Jun 23, 2017

funwun commented Jun 26, 2017

lua, CRASH ERROR terminated with reason: timeout #341

lua, CRASH ERROR terminated with reason: timeout #341

Comments

funwun commented Apr 2, 2017 • edited

Environment

funwun commented Apr 2, 2017

dergraf commented Apr 2, 2017

funwun commented Apr 3, 2017 • edited

funwun commented Apr 6, 2017

dergraf commented Apr 7, 2017

francescoPadovani81 commented Apr 28, 2017

ioolkos commented Apr 28, 2017 • edited

francescoPadovani81 commented May 2, 2017

dergraf commented May 2, 2017

francescoPadovani81 commented May 2, 2017

dergraf commented May 2, 2017

francescoPadovani81 commented May 2, 2017

diego-gabriele commented May 5, 2017

ioolkos commented May 5, 2017

dergraf commented May 5, 2017

diego-gabriele commented May 5, 2017 • edited

dergraf commented May 5, 2017

ioolkos commented May 5, 2017

francescoPadovani81 commented May 6, 2017

dergraf commented May 13, 2017

diego-gabriele commented May 17, 2017

larshesel commented May 17, 2017

diego-gabriele commented May 18, 2017

larshesel commented May 18, 2017

diego-gabriele commented May 25, 2017

larshesel commented May 28, 2017

dergraf commented Jun 15, 2017

diego-gabriele commented Jun 15, 2017

larshesel commented Jun 23, 2017

funwun commented Jun 23, 2017 via email

larshesel commented Jun 23, 2017

funwun commented Jun 26, 2017

funwun commented Apr 2, 2017 •

edited

funwun commented Apr 3, 2017 •

edited

ioolkos commented Apr 28, 2017 •

edited

diego-gabriele commented May 5, 2017 •

edited