Skip to content

Loading…

PubSub client can get overloaded #28

Closed
pmembrey opened this Issue · 8 comments

2 participants

@pmembrey

Hi guys,

I've picked up an issue with the current pub/sub client. Basically from what I can tell, the process handling the TCP connection waits for the controlling process (the one that ack's receipt of a message) to respond before it process the next message. I guess that's to provide a level of flow control.

In my case though, we received a large batch of messages in one go (around 7,000 or so I think) this filled up the message queue for the process handling the TCP connection (it's a gen_server) which then continues to grow and grow until I kill it.

I'm not sure how to fix this apart from adding a queue internally to the gen_server so that it serves messages from that rather than keeping messages in the TCP queue...

In short, I'm receiving a burst of messages which fills up the message queue on the gen_server which never manages to recover...

Any ideas?

@knutin

Hi,

The behaviour you are describing sounds very surprising to me as the design choices was made specifically to stop this exact problem.

eredis_sub_client which handles the connection does not wait for a reply from the controlling process before receiving more messages on the socket from Redis. It does wait until it gets back an ack before sending the next message to the controlling process, in the meantime it will queue messages from Redis. Pushing the queuing to the TCP layer forces Redis to do the queuing which can easily kill an instance by running out of memory. Instead, we do the queueing inside eredis, where the user can configure if you want to drop messages or kill the client when a max queue size is reached (by default, drop and infinity).

The only way for the mailbox in eredis_sub_client to grow too big is if we were using {active, true} on the socket. Maybe the problem is the performance of the queue module when dealing with big queues. I use queue:len/1 which is linear, this could be changed to keep a manual counter. process_info/1 should reveal exactly what is going on. If there are messages in the mailbox, process_info will show it.

I would be curious to know exactly what happens, as this behaviour is exactly what we don't want.

@pmembrey

Hi,

Here's the process_info output. Maybe it's not the client itself but something else - I'm afraid I am far from an expert. I've removed the data shown in tcp data as its confidential but the rest is intact.

[{current_function,{eredis_parser,do_parse_multibulk,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,15142},
 {messages,[{tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<...>>},
            {tcp,#Port<0.5638>,
                 <<"""...>>},
            {tcp,#Port<0.5638>,
                 <<"..>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,<<""...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,...},
            {tcp,...},
            {...}|...]},
 {links,[<0.103.0>,#Port<0.5638>]},
 {dictionary,[{'$ancestors',[testapp_redis,testapp_sup,
                             <0.52.0>]},
              {'$initial_call',{eredis_sub_client,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.51.0>},
 {total_heap_size,34727220},
 {heap_size,4108475},
 {stack_size,21},
 {reductions,269218575},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]

Hope that makes sense...

@knutin

You're right, these are messages from a port filling up the mailbox. Just to rule out something silly, which version of eredis are you using?

I would like to reproduce it. I will try to dump a large amount of messages to Redis. How is your consumer behaving? How fast can it ack the message?

@pmembrey

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....

@knutin
@knutin knutin added a commit that referenced this issue
@knutin knutin Fixed issue #28: The sub socket was incorrectly set to resulting in t…
…he socket sending more messages than wanted. At large volumes of messages, we get into a situation where we can no longer handle the messages fast enough. Thanks to pmembrey for reporting.
4d35fcc
@knutin

I found a problem and was able to at least fix the problem I was able to reproduce. Let me know if you still experience problems with this or have any other problem. Thanks for reporting.

@knutin knutin closed this
@pmembrey

Deployed the update. I'll know by this time tomorrow whether this resolves the issue I came across. I'll post back as soon as I know.

Thanks for the very fast fix!

@pmembrey

Hi,

I can confirm that I didn't see any problems today even though I drastically increased the traffic (i.e. more than doubled it). Historically it would fail a couple of times a day - now even with the additional pressure it's solid as a rock.

Guess we can sign this one off :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.