Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQ shared message store improvements #6090

Merged
merged 23 commits into from
Jun 1, 2023
Merged

CQ shared message store improvements #6090

merged 23 commits into from
Jun 1, 2023

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Oct 11, 2022

  • Performance improvements!!

Great results for consuming long queues.

@lhoguin lhoguin changed the title Old per-vhost message store improvements WIP: Old per-vhost message store improvements Oct 11, 2022
@mkuratczyk

This comment was marked as outdated.

@lhoguin
Copy link
Contributor Author

lhoguin commented Oct 28, 2022

I don't usually push commits with just my thoughts about the changes I am currently doing but when I do I do it in style and break compilation.

Copy link
Contributor Author

@lhoguin lhoguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all this is done the PR will be in a good enough state to merge.

deps/rabbit/src/rabbit_memory_monitor.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_file.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_file.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_file.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store_ets_index.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
@essen essen force-pushed the lh-msg-store branch 3 times, most recently from fffe717 to 3868da6 Compare December 23, 2022 11:50
@essen essen force-pushed the lh-msg-store branch 2 times, most recently from 226ea78 to fac8c5d Compare March 6, 2023 13:40
@essen essen force-pushed the lh-msg-store branch 3 times, most recently from 20d2b70 to e8e8986 Compare May 26, 2023 11:03
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_msg_store.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_variable_queue.erl Outdated Show resolved Hide resolved
lhoguin added 15 commits May 30, 2023 11:19
This commit replaces file combining with single-file compaction
where data is moved near the beginning of the file before
updating the index entries. The file is then truncated when
all existing readers are gone. This allows removing the lock
that existed before and enables reading multiple messages at
once from the shared files. This also helps us avoid many
ets operations and simplify the code greatly.

This commit still has some issues: reading a single message
is currently slow due to the removal of FHC in the client
code. This will be resolved by implementing read buffering
in a similar way as FHC but without keeping files open
more than necessary.

The dirty recovery code also likely has a number of issues
because of the compaction changes.
We no longer use FHC there and don't keep FDs open
after reading.
This allows simplifying a bunch of things.
The cache used to help keep binary references around
for fan-out cases, was introduced in 2009 and removed
in 2011. It's no longer relevant...
So in 2009 it was written that combining helped performance.
But I doubt we ever get to a scenario where it matters to
reduce the number of file_summary table entries today.
There's been plenty of ets table optimisations, and this
branch reduces the number of fields in entries anyway
and we don't go over the whole table as often as before.

See 30bc61f
The first is not going to be super useful.

The second is not possible because we already have a check
on the file_summary table.
Instead of doing a complicated +1/-1 we do an update_counter
of an integer value using 2^n values. We always know exactly
in which state we are when looking at the ets table. We also
can avoid some ets operations as a result although the
performance improvements are minimal.
@lhoguin lhoguin changed the title WIP: Old per-vhost message store improvements CQ shared message store improvements May 30, 2023
@lhoguin
Copy link
Contributor Author

lhoguin commented May 30, 2023

This is ready for review/merge into main for inclusion in 3.13.

The main performance improvement comes when reading long queues that have many messages in the store: the queue will now perform a read of multiple messages at once, just like in the CQv2 embedded store. This is all done from within the queue process itself as well. In order to make this possible the store's compaction mechanism had to be changed so that it would never overwrite data that may be accessed by a queue. Now instead of combining two files together (and deleting the old data) the store will compact a single file and move data at the end of the file to the beginning of the file where there are holes (messages were removed). Truncation happens later on when we know there are no queue reading from the file (we keep track of when the queue accesses the file to know this). We also avoid hard locks in the process.

I have also reworked the flying message mechanism in hopes it will allow us further optimisations, but I don't think I will be able to do those until converting the message store to no longer use gen_server2. So this feels like a good time to stop and merge what was done (early to get a lot of testing) and have further work done in the future building upon this.

@lhoguin lhoguin marked this pull request as ready for review May 30, 2023 10:31
@lhoguin lhoguin requested a review from mkuratczyk May 30, 2023 10:31
@michaelklishin michaelklishin added this to the 3.13.0 milestone May 30, 2023
@mkuratczyk
Copy link
Contributor

Consumption of a long queue with different messages sizes, one consumer, compared to main:
Screenshot 2023-05-30 at 15 30 43

Similar, but a publisher is still present - creates a backlog without consumers and then 10 consumers start (after a short burst, the three fastest envs stabilize since they caught up and now consume the 10k msg/s that the publisher sends).
Screenshot 2023-05-30 at 15 34 53

@mkuratczyk
Copy link
Contributor

The only regression I found is queue deletion time. Deleting a queue with 1M 5kb messages (no other queues in the system) takes 2s on main and 14s on this branch. It gets even worse with many queues. If I create 5 queues, 1M 5kb messages each (perf-test -ad false -f persistent -u test -C 1000000 -s 5000 -c 100 -y 0 -x 5 -qp test-%d -qpf 1 -qpt 5 ) then on main I can delete each in about 2 seconds. On this branch however, it take even 2 minutes.

> for i in (seq 5); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
34.95 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
35.07 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
90.25 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
112.68 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
117.41 seconds

On main, each of these takes about 2 seconds.

@mkuratczyk
Copy link
Contributor

I repeated the test with 15 queues. That's a hopefully very much a corner case territory putting 15M messages in the message store (and deleting them) but I was interested how these numbers change with scale. main was still able to delete each of the 15 queues in 2 seconds, this branch got worse as expected:

> for i in (seq 15); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
77.90 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
74.58 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
77.27 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
78.41 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
90.21 seconds
Deleting queue 'test-6' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
75.75 seconds
Deleting queue 'test-7' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
143.39 seconds
Deleting queue 'test-8' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
183.35 seconds
Deleting queue 'test-9' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.34 seconds
Deleting queue 'test-10' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.41 seconds
Deleting queue 'test-11' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
183.54 seconds
Deleting queue 'test-12' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
193.82 seconds
Deleting queue 'test-13' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
193.40 seconds
Deleting queue 'test-14' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
191.92 seconds
Deleting queue 'test-15' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
188.38 seconds

I'm not sure how much time we want to spend on this issue but let's have a chat about this before we merge (or we can merge and then perhaps improve upon that)

We don't need it and it slows down queue deletion far too much.
@mkuratczyk
Copy link
Contributor

With 59259b2, the results are now similar to main:

$ for i in (seq 5); /usr/bin/time --format '%e seconds' rabbitmqctl delete_queue test-$i; end
Deleting queue 'test-1' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.23 seconds
Deleting queue 'test-2' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.21 seconds
Deleting queue 'test-3' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.22 seconds
Deleting queue 'test-4' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.21 seconds
Deleting queue 'test-5' on vhost '/' ...
Queue was successfully deleted with 1000000 ready messages
2.62 seconds

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observe similar improvements with messages of 8192 bytes in size
with a backlog of messages across N CQs (CQv2 specifically) and
32 fast consumers.

Great to see that this PR lines add fewer lines that it removes.

@michaelklishin
Copy link
Member

Now that 3.12.0 has shipped, we can merge this.

@michaelklishin michaelklishin merged commit 46d4279 into main Jun 1, 2023
16 checks passed
@michaelklishin michaelklishin deleted the lh-msg-store branch June 1, 2023 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants