CQ shared message store improvements #6090

lhoguin · 2022-10-11T11:03:46Z

Performance improvements!!

Great results for consuming long queues.

lhoguin · 2022-10-28T13:03:52Z

I don't usually push commits with just my thoughts about the changes I am currently doing but when I do I do it in style and break compilation.

lhoguin

After all this is done the PR will be in a good enough state to merge.

deps/rabbit/src/rabbit_memory_monitor.erl

deps/rabbit/src/rabbit_msg_file.erl

deps/rabbit/src/rabbit_msg_store_ets_index.erl

deps/rabbit/src/rabbit_msg_store.erl

deps/rabbit/src/rabbit_variable_queue.erl

This commit replaces file combining with single-file compaction where data is moved near the beginning of the file before updating the index entries. The file is then truncated when all existing readers are gone. This allows removing the lock that existed before and enables reading multiple messages at once from the shared files. This also helps us avoid many ets operations and simplify the code greatly. This commit still has some issues: reading a single message is currently slow due to the removal of FHC in the client code. This will be resolved by implementing read buffering in a similar way as FHC but without keeping files open more than necessary. The dirty recovery code also likely has a number of issues because of the compaction changes.

We no longer use FHC there and don't keep FDs open after reading.

This allows simplifying a bunch of things.

The cache used to help keep binary references around for fan-out cases, was introduced in 2009 and removed in 2011. It's no longer relevant...

So in 2009 it was written that combining helped performance. But I doubt we ever get to a scenario where it matters to reduce the number of file_summary table entries today. There's been plenty of ets table optimisations, and this branch reduces the number of fields in entries anyway and we don't go over the whole table as often as before. See 30bc61f

The first is not going to be super useful. The second is not possible because we already have a check on the file_summary table.

Instead of doing a complicated +1/-1 we do an update_counter of an integer value using 2^n values. We always know exactly in which state we are when looking at the ets table. We also can avoid some ets operations as a result although the performance improvements are minimal.

Let's see if this helps performance of single reads.

Also use defines.

lhoguin · 2023-05-30T10:31:00Z

This is ready for review/merge into main for inclusion in 3.13.

The main performance improvement comes when reading long queues that have many messages in the store: the queue will now perform a read of multiple messages at once, just like in the CQv2 embedded store. This is all done from within the queue process itself as well. In order to make this possible the store's compaction mechanism had to be changed so that it would never overwrite data that may be accessed by a queue. Now instead of combining two files together (and deleting the old data) the store will compact a single file and move data at the end of the file to the beginning of the file where there are holes (messages were removed). Truncation happens later on when we know there are no queue reading from the file (we keep track of when the queue accesses the file to know this). We also avoid hard locks in the process.

I have also reworked the flying message mechanism in hopes it will allow us further optimisations, but I don't think I will be able to do those until converting the message store to no longer use gen_server2. So this feels like a good time to stop and merge what was done (early to get a lot of testing) and have further work done in the future building upon this.

mkuratczyk · 2023-05-30T13:49:57Z

Consumption of a long queue with different messages sizes, one consumer, compared to main:

Similar, but a publisher is still present - creates a backlog without consumers and then 10 consumers start (after a short burst, the three fastest envs stabilize since they caught up and now consume the 10k msg/s that the publisher sends).

mkuratczyk · 2023-05-30T16:17:19Z

The only regression I found is queue deletion time. Deleting a queue with 1M 5kb messages (no other queues in the system) takes 2s on main and 14s on this branch. It gets even worse with many queues. If I create 5 queues, 1M 5kb messages each (perf-test -ad false -f persistent -u test -C 1000000 -s 5000 -c 100 -y 0 -x 5 -qp test-%d -qpf 1 -qpt 5 ) then on main I can delete each in about 2 seconds. On this branch however, it take even 2 minutes.

> for i in (seq 5); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
34.95 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
35.07 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
90.25 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
112.68 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
117.41 seconds

On main, each of these takes about 2 seconds.

mkuratczyk · 2023-05-30T19:08:09Z

I repeated the test with 15 queues. That's a hopefully very much a corner case territory putting 15M messages in the message store (and deleting them) but I was interested how these numbers change with scale. main was still able to delete each of the 15 queues in 2 seconds, this branch got worse as expected:

> for i in (seq 15); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
77.90 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
74.58 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
77.27 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
78.41 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
90.21 seconds
Deleting queue 'test-6' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
75.75 seconds
Deleting queue 'test-7' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
143.39 seconds
Deleting queue 'test-8' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
183.35 seconds
Deleting queue 'test-9' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.34 seconds
Deleting queue 'test-10' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.41 seconds
Deleting queue 'test-11' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
183.54 seconds
Deleting queue 'test-12' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
193.82 seconds
Deleting queue 'test-13' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
193.40 seconds
Deleting queue 'test-14' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
191.92 seconds
Deleting queue 'test-15' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.38 seconds

I'm not sure how much time we want to spend on this issue but let's have a chat about this before we merge (or we can merge and then perhaps improve upon that)

We don't need it and it slows down queue deletion far too much.

mkuratczyk · 2023-06-01T11:53:00Z

With 59259b2, the results are now similar to main:

$ for i in (seq 5); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.23 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.21 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.22 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.21 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.62 seconds

michaelklishin

I observe similar improvements with messages of 8192 bytes in size
with a backlog of messages across N CQs (CQv2 specifically) and
32 fast consumers.

Great to see that this PR lines add fewer lines that it removes.

michaelklishin · 2023-06-01T23:35:59Z

Now that 3.12.0 has shipped, we can merge this.

lhoguin changed the title ~~Old per-vhost message store improvements~~ WIP: Old per-vhost message store improvements Oct 11, 2022

This comment was marked as outdated.

Sign in to view

essen force-pushed the lh-msg-store branch from 51cdd56 to 3ffdf3f Compare October 27, 2022 14:33

lhoguin commented Dec 21, 2022

View reviewed changes

essen force-pushed the lh-msg-store branch 3 times, most recently from fffe717 to 3868da6 Compare December 23, 2022 11:50

essen force-pushed the lh-msg-store branch from d27860a to 7729706 Compare February 3, 2023 11:04

essen force-pushed the lh-msg-store branch 2 times, most recently from 226ea78 to fac8c5d Compare March 6, 2023 13:40

essen force-pushed the lh-msg-store branch 3 times, most recently from 20d2b70 to e8e8986 Compare May 26, 2023 11:03

lhoguin commented May 26, 2023

View reviewed changes

lhoguin added 15 commits May 30, 2023 11:19

CQ: Remove mechanism for closing FHC FDs in queues

4e4e6e4

We no longer use FHC there and don't keep FDs open after reading.

CQ: Fix a Dialyzer warning

ba6c075

Tweak recovery following compaction changes

608bd7c

Messages were reversed so... whoops! Fixed

06ada80

Rework the top-level comment

f286bfc

Cleanup todos

c22fc46

Remove a couple values no longer used in the state

f66be29

This allows simplifying a bunch of things.

Remove another value and related functions

4f06544

Remove unused fields in client_msstate

c031f04

Remove very old comment referring to a dedup cache

468d9f9

The cache used to help keep binary references around for fan-out cases, was introduced in 2009 and removed in 2011. It's no longer relevant...

Remove another old comment that was only relevant in 2016 if ever

e35a75c

Remove pointless todos

4b87ded

The first is not going to be super useful. The second is not possible because we already have a check on the file_summary table.

lhoguin added 6 commits May 30, 2023 11:19

Fix compilation following rebase

af21e80

Keep fd open in msg store client single reads

76ee654

Let's see if this helps performance of single reads.

Close fd open in msg store client before hibernating

8c0fd4f

Set read_many threshold for shared store to 12k bytes

54e04a1

Also use defines.

CQ: Fix crash when queue is not durable

9b7cbfd

Cleanup todos and unnecessary comments

93eccae

essen force-pushed the lh-msg-store branch from 6aef66c to 93eccae Compare May 30, 2023 09:20

Fix Dialyzer warning

7a8d110

lhoguin changed the title ~~WIP: Old per-vhost message store improvements~~ CQ shared message store improvements May 30, 2023

lhoguin marked this pull request as ready for review May 30, 2023 10:31

lhoguin requested a review from mkuratczyk May 30, 2023 10:31

michaelklishin added this to the 3.13.0 milestone May 30, 2023

CQ: Don't read message data when purging the queue

59259b2

We don't need it and it slows down queue deletion far too much.

mkuratczyk approved these changes Jun 1, 2023

View reviewed changes

michaelklishin approved these changes Jun 1, 2023

View reviewed changes

michaelklishin merged commit 46d4279 into main Jun 1, 2023
16 checks passed

michaelklishin deleted the lh-msg-store branch June 1, 2023 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQ shared message store improvements #6090

CQ shared message store improvements #6090

lhoguin commented Oct 11, 2022 •

edited

Loading

This comment was marked as outdated.

lhoguin commented Oct 28, 2022

lhoguin left a comment

lhoguin commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented Jun 1, 2023

michaelklishin left a comment

michaelklishin commented Jun 1, 2023

CQ shared message store improvements #6090

CQ shared message store improvements #6090

Conversation

lhoguin commented Oct 11, 2022 • edited Loading

This comment was marked as outdated.

lhoguin commented Oct 28, 2022

lhoguin left a comment

Choose a reason for hiding this comment

lhoguin commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented May 30, 2023

mkuratczyk commented Jun 1, 2023

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin commented Jun 1, 2023

lhoguin commented Oct 11, 2022 •

edited

Loading