PubSub client can get overloaded #28

Closed
pmembrey opened this Issue May 31, 2012 · 8 comments

Comments

Projects
None yet
2 participants
@pmembrey
Contributor

pmembrey commented May 31, 2012

Hi guys,

I've picked up an issue with the current pub/sub client. Basically from what I can tell, the process handling the TCP connection waits for the controlling process (the one that ack's receipt of a message) to respond before it process the next message. I guess that's to provide a level of flow control.

In my case though, we received a large batch of messages in one go (around 7,000 or so I think) this filled up the message queue for the process handling the TCP connection (it's a gen_server) which then continues to grow and grow until I kill it.

I'm not sure how to fix this apart from adding a queue internally to the gen_server so that it serves messages from that rather than keeping messages in the TCP queue...

In short, I'm receiving a burst of messages which fills up the message queue on the gen_server which never manages to recover...

Any ideas?

@knutin

This comment has been minimized.

Show comment
Hide comment
@knutin

knutin May 31, 2012

Collaborator

Hi,

The behaviour you are describing sounds very surprising to me as the design choices was made specifically to stop this exact problem.

eredis_sub_client which handles the connection does not wait for a reply from the controlling process before receiving more messages on the socket from Redis. It does wait until it gets back an ack before sending the next message to the controlling process, in the meantime it will queue messages from Redis. Pushing the queuing to the TCP layer forces Redis to do the queuing which can easily kill an instance by running out of memory. Instead, we do the queueing inside eredis, where the user can configure if you want to drop messages or kill the client when a max queue size is reached (by default, drop and infinity).

The only way for the mailbox in eredis_sub_client to grow too big is if we were using {active, true} on the socket. Maybe the problem is the performance of the queue module when dealing with big queues. I use queue:len/1 which is linear, this could be changed to keep a manual counter. process_info/1 should reveal exactly what is going on. If there are messages in the mailbox, process_info will show it.

I would be curious to know exactly what happens, as this behaviour is exactly what we don't want.

Collaborator

knutin commented May 31, 2012

Hi,

The behaviour you are describing sounds very surprising to me as the design choices was made specifically to stop this exact problem.

eredis_sub_client which handles the connection does not wait for a reply from the controlling process before receiving more messages on the socket from Redis. It does wait until it gets back an ack before sending the next message to the controlling process, in the meantime it will queue messages from Redis. Pushing the queuing to the TCP layer forces Redis to do the queuing which can easily kill an instance by running out of memory. Instead, we do the queueing inside eredis, where the user can configure if you want to drop messages or kill the client when a max queue size is reached (by default, drop and infinity).

The only way for the mailbox in eredis_sub_client to grow too big is if we were using {active, true} on the socket. Maybe the problem is the performance of the queue module when dealing with big queues. I use queue:len/1 which is linear, this could be changed to keep a manual counter. process_info/1 should reveal exactly what is going on. If there are messages in the mailbox, process_info will show it.

I would be curious to know exactly what happens, as this behaviour is exactly what we don't want.

@pmembrey

This comment has been minimized.

Show comment
Hide comment
@pmembrey

pmembrey May 31, 2012

Contributor

Hi,

Here's the process_info output. Maybe it's not the client itself but something else - I'm afraid I am far from an expert. I've removed the data shown in tcp data as its confidential but the rest is intact.

[{current_function,{eredis_parser,do_parse_multibulk,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,15142},
 {messages,[{tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<...>>},
            {tcp,#Port<0.5638>,
                 <<"""...>>},
            {tcp,#Port<0.5638>,
                 <<"..>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,<<""...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,...},
            {tcp,...},
            {...}|...]},
 {links,[<0.103.0>,#Port<0.5638>]},
 {dictionary,[{'$ancestors',[testapp_redis,testapp_sup,
                             <0.52.0>]},
              {'$initial_call',{eredis_sub_client,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.51.0>},
 {total_heap_size,34727220},
 {heap_size,4108475},
 {stack_size,21},
 {reductions,269218575},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]

Hope that makes sense...

Contributor

pmembrey commented May 31, 2012

Hi,

Here's the process_info output. Maybe it's not the client itself but something else - I'm afraid I am far from an expert. I've removed the data shown in tcp data as its confidential but the rest is intact.

[{current_function,{eredis_parser,do_parse_multibulk,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,15142},
 {messages,[{tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<...>>},
            {tcp,#Port<0.5638>,
                 <<"""...>>},
            {tcp,#Port<0.5638>,
                 <<"..>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,<<""...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,...},
            {tcp,...},
            {...}|...]},
 {links,[<0.103.0>,#Port<0.5638>]},
 {dictionary,[{'$ancestors',[testapp_redis,testapp_sup,
                             <0.52.0>]},
              {'$initial_call',{eredis_sub_client,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.51.0>},
 {total_heap_size,34727220},
 {heap_size,4108475},
 {stack_size,21},
 {reductions,269218575},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]

Hope that makes sense...

@knutin

This comment has been minimized.

Show comment
Hide comment
@knutin

knutin May 31, 2012

Collaborator

You're right, these are messages from a port filling up the mailbox. Just to rule out something silly, which version of eredis are you using?

I would like to reproduce it. I will try to dump a large amount of messages to Redis. How is your consumer behaving? How fast can it ack the message?

Collaborator

knutin commented May 31, 2012

You're right, these are messages from a port filling up the mailbox. Just to rule out something silly, which version of eredis are you using?

I would like to reproduce it. I will try to dump a large amount of messages to Redis. How is your consumer behaving? How fast can it ack the message?

@pmembrey

This comment has been minimized.

Show comment
Hide comment
@pmembrey

pmembrey May 31, 2012

Contributor

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....

Contributor

pmembrey commented May 31, 2012

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....

@knutin

This comment has been minimized.

Show comment
Hide comment
@knutin

knutin May 31, 2012

Collaborator

Ok. I just put together a small example and I'm able to reproduce the
behaviour you are seeing. I will investigate further.

On Thu, May 31, 2012 at 12:12 PM, pmembrey
reply@reply.github.com
wrote:

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....


Reply to this email directly or view it on GitHub:
#28 (comment)

Collaborator

knutin commented May 31, 2012

Ok. I just put together a small example and I'm able to reproduce the
behaviour you are seeing. I will investigate further.

On Thu, May 31, 2012 at 12:12 PM, pmembrey
reply@reply.github.com
wrote:

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....


Reply to this email directly or view it on GitHub:
#28 (comment)

knutin added a commit that referenced this issue May 31, 2012

Fixed issue #28: The sub socket was incorrectly set to resulting in t…
…he socket sending more messages than wanted. At large volumes of messages, we get into a situation where we can no longer handle the messages fast enough. Thanks to pmembrey for reporting.
@knutin

This comment has been minimized.

Show comment
Hide comment
@knutin

knutin May 31, 2012

Collaborator

I found a problem and was able to at least fix the problem I was able to reproduce. Let me know if you still experience problems with this or have any other problem. Thanks for reporting.

Collaborator

knutin commented May 31, 2012

I found a problem and was able to at least fix the problem I was able to reproduce. Let me know if you still experience problems with this or have any other problem. Thanks for reporting.

@knutin knutin closed this May 31, 2012

@pmembrey

This comment has been minimized.

Show comment
Hide comment
@pmembrey

pmembrey May 31, 2012

Contributor

Deployed the update. I'll know by this time tomorrow whether this resolves the issue I came across. I'll post back as soon as I know.

Thanks for the very fast fix!

Contributor

pmembrey commented May 31, 2012

Deployed the update. I'll know by this time tomorrow whether this resolves the issue I came across. I'll post back as soon as I know.

Thanks for the very fast fix!

@pmembrey

This comment has been minimized.

Show comment
Hide comment
@pmembrey

pmembrey Jun 1, 2012

Contributor

Hi,

I can confirm that I didn't see any problems today even though I drastically increased the traffic (i.e. more than doubled it). Historically it would fail a couple of times a day - now even with the additional pressure it's solid as a rock.

Guess we can sign this one off :)

Contributor

pmembrey commented Jun 1, 2012

Hi,

I can confirm that I didn't see any problems today even though I drastically increased the traffic (i.e. more than doubled it). Historically it would fail a couple of times a day - now even with the additional pressure it's solid as a rock.

Guess we can sign this one off :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment