Lengthy throughput drop when flushing queue index journal #227

Closed
videlalvaro opened this Issue Jul 21, 2015 · 12 comments

Projects

None yet

3 participants

@videlalvaro
Member

What needs to be optimized is the rabbit_queue_index:publish/6 case where rabbit_queue_index:maybe_flush_journal/2 is called at the end.

The problem here is: we have a use case where the consumers have been stopped for whatever the reason, and messages start accumulating in the queue, with lots of Journal entries in the index. At some point the Journal will be flushed to disk, and there performance will drop terribly. We are talking of going from 20k 1kb msgs/sec down to a few hundreds per second.

Yes, we can say: don't publish faster than you can consume, and that's fine, but on the few tests I ran on my mac with SSDs and 16gb of RAM, after two minutes or so, the broker performance drops significantly, because the Journals must be flushed. I think this has to be optimized. Yes, we can't escape flushing stuff that's in memory to disk, but we have to find a strategy that doesn't do this so eagerly and expects the world to stop while doing so.

To test this case I used the following:

./runjava.sh com.rabbitmq.examples.PerfTest -x 2 -y 0 -u bench -s 1000 -z 360

As you can see the message size is 1kb (depending who you ask) so messages are embedded in the queue index. See queue_index_embed_msgs_below.

It might be worth investigating the relation between the ingres rate and the queue_index_max_journal_entries config setting.

Also be aware that many of these Journal entries/Segments end up stored in the process dictionary when they are passed down to the FHC.

@gmr
gmr commented Aug 14, 2015

I'm thinking this might be even more significant the larger the amount of data. In my production issue today my consumers could not keep up with the publishing velocity. Messages are generally >50KB, but can be larger. Consumers would only be delivered at a rate that was roughly the disk read rate, but there was no IOWait on any node in the cluster.

For example, where I might be able to consume @1k per sec normally, when the transient message storage was around 20GB in size, it would only deliver at 10 messages per second.

When it was ~1GB on disk, it started to go back to normal speeds

It seems that the larger the transient message store becomes, the disk read rate disproportionately slows down. All this while there are not any indicators at the OS level as to why.

Big data, slow performance

@videlalvaro
Member

Thanks for taking the time to investigate this as well. For me this is a bit embarrassing to be honest :(

@michaelklishin
Member

@gmr adding to our discussion yesterday, can you try reducing rabbit.queue_index_max_journal_entries (2 to 4 times, for example) and see what kind of difference that may make?

@gmr
gmr commented Aug 19, 2015

Here's a quick in-progress update to what I found as I've been working to validate/invalidate #226 as a fix, along with other settings changes.

tl;dr version: by updating to RabbitMQ 3.5.4.60818, turning of FHC, and queue_index_embed_msgs_below set to 0, RabbitMQ's memory usage is much better and I am having to work a lot harder to get messages to page to disk in the first place.

I am evaluating the changes on the same RabbitMQ cluster I ran into the problem with, using the nightly build from the 18th (3.5.4.60818).

This is on our "production" cluster for a new service running in EC2 on 4 i2.2xlarge instances.

I am using CoreOS and RabbitMQ is running inside one docker container on each node. The /var/lib/rabbitmq directory is mounted to the OS level for persistence of data across restarting of my docker containers.

In the issue from last week, once I got above ~12GB in queue size (almost all messages paged to disk at that point), I started to run into the issue.

The queue is HA with the following settings applied via policy:

ha-mode: exactly
ha-params: 3

In trying to address the problem and validate tuning changes, I added the following entries to my config, with the values previously being non-configured defaults:

{fhc_read_buffering, false},
{fhc_write_buffering, false}
{queue_index_embed_msgs_below, 0}

I am having a much harder time getting to memory pressure... I'm currently sitting on 19GB of messages in the queue, with 14GB in memory and only 5GB on disk. My plan is to keep pushing until I have 19GB on disk before I try and dequeue and see if I can maintain a sustained rate above 2k messages per consumer, which is what I get when I don't have any issues.

I'm currently rolling out a change to set:

{queue_index_max_journal_entries, 16384}

and will resume filling the queue once I've updated all the nodes and resynced the queue so it's back up on three nodes. More updates as I have more findings.

@michaelklishin
Member

We have a few updates on this that other team members will post.

One outcome of this is that the QI journal size default is quite excessive. We should lower this as it makes throughput much more predictable. It also would lower the time window in which messages embedded into QI may stay in RAM only (but lets not digress).

As a result, the plan is for 3.5.5 to reduce QI journal size and make another related value configurable (and lower). For 3.6.0, we are looking into various ways to parallelise folding over the sparse array of entries or avoiding it entirely if we can.

@videlalvaro videlalvaro added a commit that referenced this issue Aug 21, 2015
@videlalvaro videlalvaro Converts jorunal entries in place.
Instead of folding over the whole journal when we flush it to list, it
might be better to convert the journal entries to their final disk
format, to avoid folding over 65k journal entries. Of course this
approach uses more memory.

See #227
6dba4fa
@videlalvaro
Member

Some updates:

After running a couple of fprof instances around flush_journal/1 the main culprit seems to be this line here: https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_queue_index.erl#L709 Which is totally understandable since with default config values, the function array:sparse_foldr/3 is iterating over 65k elements. Also there doesn't seem to be anything suspicious in entry_to_segment/3 so it's hard to say what to optimize there.

What could be fixed/improved:

  • Parallelize the call to segment_fold, which each segment having it's on worker flushing contents to disk. See worker_pool.erl and gatherer.erl which are already used in the rabbit_queue_index module to load the index during startup/recovery. This approach could benefit from allowing users to set SEGMENT_ENTRY_COUNT as explained here: https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_queue_index.erl#L124
  • Completely avoid the array:sparse_foldr/3 bit by converting the journal entries to their disk format as they enter the index. This approach uses more memory of course, but avoids blocking the process while folding the whole journal. This approach reduces flush_journal times from 1.3 seconds to 0.3 seconds according to fprof. This change can be tested in branch https://github.com/rabbitmq/rabbitmq-server/tree/rabbitmq-server-227

@gmr and everyone else, please whenever you have time, run some tests with the branch I've mention above. Note that my tests were done with the Java client and default rabbitmq configuration.

@videlalvaro
Member

rabbitmq-test make full pass without problems.

@michaelklishin michaelklishin added a commit that referenced this issue Aug 21, 2015
@michaelklishin michaelklishin Reduce default QI journal size
References #279, see #227 for background info.
9e588f6
@videlalvaro
Member

The segment entries cache from the branch I've pushed might need to take this into account: https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_queue_index.erl#L1035

I have a possible solution already, using arrays for the cache instead of a list.

@michaelklishin michaelklishin changed the title from rabbit_queue_index:maybe_flush_journal/2 significantly reduces performance when called. to Lengthy throughput drop when flushing queue index journal Aug 24, 2015
@michaelklishin
Member

While #283 is merged, we have a few other optimisations coming, so we'll leave this open for some more time.

@michaelklishin michaelklishin added this to the 3.5.5 milestone Aug 28, 2015
@videlalvaro videlalvaro added a commit that closed this issue Aug 31, 2015
@videlalvaro videlalvaro Converts journal entries in place.
Instead of folding over the whole journal when we flush it to disk, it
might be better to convert the journal entries to their final disk
format, to avoid folding over 65k journal entries. Of course this
approach uses more memory.

Fixes #227
3832456
@michaelklishin
Member

Note that this issue is closed but we have a bunch of others (primarily #289) that are related and continue improving things in the same area. They too will be in 3.5.5.

@michaelklishin
Member

Related: #291 (will be in 3.5.5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment