CQ: Fix entry missing from cache leading to crash on read #11288

lhoguin · 2024-05-21T13:51:15Z

The issue comes from a mechanic that allows us to avoid writing to disk when a message has already been consumed. It works fine in normal circumstances, but fan-out makes things trickier.

When multiple queues write and read the same message, we could get a crash. Let's say queues A and B both handle message Msg.

Queue A asks store to write Msg
Queue B asks store to write Msg
Queue B asks store to delete Msg (message was immediately consumed)
Store processes Msg write from queue A
- Store writes Msg to current file
Store processes Msg write from queue B
- Store notices queue B doesn't need Msg anymore; doesn't write
- Store clears Msg from the cache
Queue A tries to read Msg
- Msg is missing from the cache
- Queue A tries to read from disk
- Msg is in the current write file and may not be on disk yet
- Crash

The problem is that the store clears Msg from the cache. We need all messages written to the current file to remain in the cache as we can't guarantee the data is on disk when comes the time to read. That is, until we roll over to the next file.

The issue was that a match was wrong, instead of matching a single location from the index, the code was matching against a list. The error was present in the code for almost 13 years since commit 2ef30dc.

Types of Changes

Bug fix (non-breaking change which fixes discussion comment Restarting crashed queue with function_clause reason after update to RabbitMQ 3.13 #10902 (reply in thread))

The issue comes from a mechanic that allows us to avoid writing to disk when a message has already been consumed. It works fine in normal circumstances, but fan-out makes things trickier. When multiple queues write and read the same message, we could get a crash. Let's say queues A and B both handle message Msg. * Queue A asks store to write Msg * Queue B asks store to write Msg * Queue B asks store to delete Msg (message was immediately consumed) * Store processes Msg write from queue A * Store writes Msg to current file * Store processes Msg write from queue B * Store notices queue B doesn't need Msg anymore; doesn't write * Store clears Msg from the cache * Queue A tries to read Msg * Msg is missing from the cache * Queue A tries to read from disk * Msg is in the current write file and may not be on disk yet * Crash The problem is that the store clears Msg from the cache. We need all messages written to the current file to remain in the cache as we can't guarantee the data is on disk when comes the time to read. That is, until we roll over to the next file. The issue was that a match was wrong, instead of matching a single location from the index, the code was matching against a list. The error was present in the code for almost 13 years since commit 2ef30dc.

lhoguin · 2024-05-21T14:16:27Z

The following command should trigger the bug before this patch, and not trigger the bug after this patch:

while true; do perf-test -e amq.fanout -t fanout -y 49 -c 1 -qp q-%d -qpf 1 -qpt 50 \
-qa x-queue-version=2 -ad false -f persistent -csd 2 -s 5000 -z 10; done

It might need converting to non-zsh shell.

The bug can be seen when looking at the logs at the same time (tail -F /path/to/log).

lhoguin · 2024-05-21T14:19:57Z

While the bug does occur in v3.12.x and earlier, it doesn't result in a crash on read, because those versions would defer the read to the message store process when they couldn't read directly. But we have removed that from v3.13.x by ensuring queues can always read from cache or from disk. Except we didn't expect this bug to pop up.

michaelklishin · 2024-05-21T15:16:31Z

I had to run PerfTest from the binary distribution like so (note the different -e argument to avoid a precondition_failed due to different built-in exchange properties):

while true; do ./bin/runjava com.rabbitmq.perf.PerfTest -f persistent -e server.11288.fanout -t fanout -y 49 -c 1 -qp q-%d -qpf 1 -qpt 50 -qa x-queue-version=2 -ad false -f persistent -csd 2 -s 5000 -z 10; done

michaelklishin

I could reproduce the exception quickly against a 3.13.2 node but not this PR.

This is a type of issues that Dialyzer could have caught but in index_lookup/2 in this module we delegate to an unknown (to Dialyzer) other module.

mkuratczyk

Great. Can't reproduce anymore

CQ: Fix entry missing from cache leading to crash on read (backport #11288)

lhoguin added the backport-v3.13.x label May 21, 2024

lhoguin requested a review from mkuratczyk May 21, 2024 13:51

michaelklishin approved these changes May 21, 2024

View reviewed changes

mkuratczyk approved these changes May 21, 2024

View reviewed changes

lhoguin marked this pull request as ready for review May 21, 2024 16:08

michaelklishin merged commit 6047583 into main May 21, 2024
17 of 18 checks passed

michaelklishin deleted the loic-fix-fan-out-pread-crash-main branch May 21, 2024 17:25

mergify bot mentioned this pull request May 21, 2024

CQ: Fix entry missing from cache leading to crash on read (backport #11288) #11292

Merged

1 task

michaelklishin added a commit that referenced this pull request May 21, 2024

Merge pull request #11292 from rabbitmq/mergify/bp/v3.13.x/pr-11288

6dbe4de

CQ: Fix entry missing from cache leading to crash on read (backport #11288)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQ: Fix entry missing from cache leading to crash on read #11288

CQ: Fix entry missing from cache leading to crash on read #11288

lhoguin commented May 21, 2024

lhoguin commented May 21, 2024 •

edited

Loading

lhoguin commented May 21, 2024

michaelklishin commented May 21, 2024

michaelklishin left a comment

mkuratczyk left a comment

CQ: Fix entry missing from cache leading to crash on read #11288

CQ: Fix entry missing from cache leading to crash on read #11288

Conversation

lhoguin commented May 21, 2024

Types of Changes

lhoguin commented May 21, 2024 • edited Loading

lhoguin commented May 21, 2024

michaelklishin commented May 21, 2024

michaelklishin left a comment

Choose a reason for hiding this comment

mkuratczyk left a comment

Choose a reason for hiding this comment

lhoguin commented May 21, 2024 •

edited

Loading