CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

abhijat · 2024-03-26T07:04:48Z

LZ4 decompression uses malloc to allocate memory for temporary buffers. The size allocated depends on the input and can range upto two buffers of slightly over 4MiB each per decompression.

In cases where contiguous space of 4MiB is not available after redpanda has been running for a while, this can result in a OOM. This is usually seen in context of compaction where a segment needs to be read through to discard old values for a key, requiring decompression of the record batches.

This PR switches LZ4 to use a custom memory allocator and a pair of preallocated 4MiB+128KiB buffers per shard. The custom memory allocator intercepts calls to malloc and returns the preallocated buffers if the allocation size is larger than a threshold. In corresponding calls to free the state of managed buffers is updated.

A caller must reserve buffers before starting decompression. The current decompression operation is synchronous so once a shard begins decompressing a record batch, it cannot suspend the operation, and the buffers are implicitly protected against parallel access; but the buffers in this PR are still accessed after reserving a semaphore unit to protect against future changes to the decompression routine.

While this PR only allows for one decompression operation per shard, if we move to use the object_pool utility and support multiple buffer pairs per shard IE multiple decompression ops per shard in parallel, then the current reserve/release methods will serve as a base to control concurrency.

The following new settings are introduced to control the operation:

lz4_decompress_buffer_size_bytes: size per buffer (two are allocated per shard)
lz4_decompress_malloc_min_threshold: min threshold below which preallocated buffers are not used and we pass through to malloc for allocation
lz4_decompress_reusable_buffers_enabled: enables turning off preallocated buffers, all calls via lz4 are passed through to malloc.

FIXES #16179

force push - removes semaphore, removes settings for sizes

force push - fixes semantics of deallocation order and state transition

Backports Required

Release Notes

none

vbotbuildovich · 2024-03-26T11:53:30Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46776#018e7a4f-4a3a-4983-84d5-778f306167e8

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46776#018e7a4f-4a37-4aab-89b4-21e500e7326b

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46860#018e7eb4-1e73-4b5d-bcdd-da7e2a7a7af0

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46860#018e7ec6-58d5-40f3-8bb3-535b9c7983ff

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47278#018ea36b-1216-4b4a-8016-b07308f5ac75

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47278#018ea36b-121d-4f7e-968b-931241ace206

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47296#018ea47c-5140-4c68-83c8-31bb814524de

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47370#018ea7dd-b880-4dbf-8f1d-b5cf9d3ec128

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47438#018eadb8-e4d8-4cb0-bfa1-b17476a3c883

abhijat · 2024-04-01T14:51:03Z

test failures seem related to changes in this PR

andijcr · 2024-04-02T10:30:55Z

src/v/config/configuration.cc

+      "Size of buffers preallocated for LZ4 decompression. Two buffers "
+      "allocated per shard, so twice of this value is reserved at startup",
+      {.needs_restart = needs_restart::yes, .visibility = visibility::tunable},
+      4224_KiB)


nit: is this is 4_Mib + 8?
maybe 4_Mib + sizeof(uint64_t) would give a bit more context

This is 4 MiB + 128 KiB which is the maximum lz4 allocates during decompression (https://github.com/lz4/lz4/blob/dev/lib/lz4frame.c#L1678-L1687)

Would it be clearer to write the two quantities separately ie 4mib +...?

ah, I see. Up to you, I feel it would be less "magic constant" if it was broken into 4_Mib + 128_Kib.
Only if there are new commits to add.

src/v/compression/include/compression/lz4_decompression_buffers.h

andrwng

Pretty much LGTM. One substantive question about whether we should expose the buffer size. Otherwise, nothing major from me. Nice work!

andrwng · 2024-04-05T16:59:29Z

src/v/compression/include/compression/lz4_decompression_buffers.h

+inline lz4_decompression_buffers::alloc_ctx::allocation_state operator|(
+  lz4_decompression_buffers::alloc_ctx::allocation_state a,
+  lz4_decompression_buffers::alloc_ctx::allocation_state b) {
+    using t = std::underlying_type_t<
+      lz4_decompression_buffers::alloc_ctx::allocation_state>;
+    return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) | t(b));
+}
+
+inline void operator|=(
+  lz4_decompression_buffers::alloc_ctx::allocation_state& a,
+  lz4_decompression_buffers::alloc_ctx::allocation_state b) {
+    a = (a | b);
+}
+
+inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&(
+  lz4_decompression_buffers::alloc_ctx::allocation_state a,
+  lz4_decompression_buffers::alloc_ctx::allocation_state b) {
+    using t = std::underlying_type_t<
+      lz4_decompression_buffers::alloc_ctx::allocation_state>;
+    return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b));
+}
+
+inline void operator&=(
+  lz4_decompression_buffers::alloc_ctx::allocation_state& a,
+  lz4_decompression_buffers::alloc_ctx::allocation_state b) {
+    a = (a & b);
+}
+
+inline lz4_decompression_buffers::alloc_ctx::allocation_state
+operator~(lz4_decompression_buffers::alloc_ctx::allocation_state a) {
+    return lz4_decompression_buffers::alloc_ctx::allocation_state(
+      ~std::underlying_type_t<
+        lz4_decompression_buffers::alloc_ctx::allocation_state>(a));
+}


Sounds good. FWIW now that I understand it, it's not as odd as I thought, but perhaps it's worth adding a comment to the allocation state definition about it (e.g. to discourage folks from treating it as a regular enum)

andrwng · 2024-04-05T17:37:23Z

src/v/compression/lz4_decompression_buffers.cc

+    case both_buffers_allocated:
+        st->deallocated();


Just curious, does lz4 always call free_lz4_obj, even the decompression failed? And do the buffers need to be cleared at all?

lz4 does not guarantee to free memory, the user has to free it, it has the function LZ4F_freeDecompressionContext to do this. In our case it is guaranteed because we wrap the decompression context in a unique ptr with a custom deleter, this deleter then calls LZ4F_freeDecompressionContext for us when the unique ptr is destroyed, so free_lz4_obj will always be called via LZ4F_freeDecompressionContext -> LZ4_Free -> free_lz4_obj

This is the relevant type definition https://github.com/redpanda-data/redpanda/blob/dev/src/v/compression/internal/lz4_frame_compressor.cc#L53-L59

andrwng · 2024-04-05T17:47:53Z

src/v/compression/lz4_decompression_buffers.cc

+void* alloc_lz4_obj(void* state, size_t size) {
+    auto* st = static_cast<compression::lz4_decompression_buffers*>(state);


do we want to expose the buffer size as a config parameter

Right, I would lean towards not exposing it, other than enabling/disabling the extra buffers. Otherwise it seems too easy to get Redpanda to crash

dotnwat

i think we should give the restart_required::no configuration option a thought. like, it doesn't seem like we can reliably turn this on after the system has been running for a while since it cannot make a big allocation.

with a 128kb max allocation size, what is the point in having the small allocation code path? it seems like that might be simplified?

a few other nits / suggestions primarily related to simplifying the implementation. but overall, it looks like it'll work ok.

dotnwat · 2024-04-06T17:01:38Z

src/v/config/configuration.cc

+  , lz4_decompress_malloc_min_threshold(
+      *this,
+      "lz4_decompress_malloc_min_threshold",
+      "Threshold below which preallocated buffers will not be used during "
+      "LZ4 decompression, if preallocated buffer use is enabled",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      128_KiB + 1)


is there some advantage to not using pre-allocated buffers? if they are there anyway, why not always use them?

dotnwat · 2024-04-06T17:03:20Z

src/v/config/configuration.cc

+      "Enable reusable preallocated buffers for LZ4 decompression, helps "
+      "prevent OOM errors by avoiding allocation of large memory blocks "
+      "during decompression",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},


is it problematic to have this option be needs_restart::no? after the system runs for a while, it's likely to become quite difficult/unlikely that a 4mb allocation would succeed.

changed this to needs_restart::yes

dotnwat · 2024-04-06T17:07:50Z

src/v/compression/lz4_decompression_buffers.cc

+        _buffers = {
+          .input_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8),
+          .output_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8),
+          .state = alloc_ctx::allocation_state::no_buffers_allocated,


Here the allocated refers to given out by alloc_lz4_obj to the caller.

it would be helpful if in the header for alloc_ctx then that you define what the flags are / what the states are for bookkeeping?

dotnwat · 2024-04-06T17:08:28Z

src/v/compression/lz4_decompression_buffers.cc

+bool lz4_decompression_buffers::has_reserved_buffers() const {
+    return _reservation.current() == 0;
+}


Let me know if there are any suggestions.

is it being used like a lock/mutex?

dotnwat · 2024-04-06T17:09:36Z

src/v/compression/lz4_decompression_buffers.cc

+    if (_disabled) {
+        co_return std::nullopt;
+    }
+
+    co_return co_await ss::get_units(_reservation, 1);


yeh, if there is no contention, then there is no overhead and no need for a fast path. also if there no contention, there is no need for a lock?

dotnwat · 2024-04-06T17:14:23Z

src/v/compression/include/compression/lz4_decompression_buffers.h

+inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&(
+  lz4_decompression_buffers::alloc_ctx::allocation_state a,
+  lz4_decompression_buffers::alloc_ctx::allocation_state b) {
+    using t = std::underlying_type_t<
+      lz4_decompression_buffers::alloc_ctx::allocation_state>;
+    return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b));
+}


Instead of writing this out explicitly, why not use bitset<4>? you can even assign names, like this:

enum class states { a, b, c, d, num_flags }; std::bitset<states::num_flags> flags;

and then all these operations can be done with bitset operations?

I started exploring the use of a bitset because it looked like a good fit, but then I realized it contains just plain ints, so we lose the type safety of the enum class, and the compiler warnings from incomplete switch statements etc. that enum class provides.

I did move the bitwise operators to an anonymous namespace in the source file so they are not part of the public API.

This makes sense. No need to swap out for bitset.

dotnwat · 2024-04-06T17:16:47Z

src/v/compression/lz4_decompression_buffers.cc

+        return malloc(size);
+    }
+
+    vassert(
+      st->has_reserved_buffers(),
+      "must reserve space with lz4_decompression_buffers::reserve_buffers() "
+      "before allocation");
+
+    auto& bufs = st->buffers();


if we already have a chunk of 4mb memory, wouldn't using malloc be strictly more work than using the arena? what's the advantage to allocating more memory when its small?

When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.

On the other hand I looked through the usages for LZ4F_malloc and did not find any calls to it other than those two buffers in the decompression code path. So it might work but I'm not sure if it is guaranteed that calls with remain the same within lz4.

When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.

I see. This seems very dangerous, right?

We don't want to use the preallocated buffers for those calls.

Specifically, you can't use preallocated buffers for those calls unless there is exactly one such call, right? IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?

It should be safe. Right now we can describe the calls to malloc during decompression as below:

all calls to malloc <= 128_KiB go to the system malloc implementation

all calls to malloc > 128_KiB go to our two buffers, which are allocated in sequence.

We know from the source code of LZ4 that during decompression it makes exactly two large sized malloc calls, one for input buffer and one for output buffer.

Both are relatively similar in size, and depend on block size. So they can be 64 KiB, 256 KiB, 1 MiB or 4 MiB each. In all but the first case our buffers will be allocated out. For all other smaller sized malloc calls during decompression system malloc will be used.

Note that the malloc calls are made when the decompression context is in the init stage, which only happens once per a review of the decompression code path.

If we look at the free calls they are a lot simpler:

For addresses belonging to our two preallocated buffers, we just update the state in our free implementation.

For all other addresses we fall back to the system free impl.

Also if the lz4 library changes in future and more mallocs are added, our unit test will start failing because we have exact call counts in the decompression code path test.

IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?

Yes, there are two calls to malloc expected which will use our buffers. All others should go to system malloc. In the current lz4 implementation, during decompression, there are only two malloc calls. These are expected to result in allocation of our two buffers.

A configuration setting is added to disable lz4 buffer preallocation.

abhijat · 2024-04-10T15:03:00Z

@dotnwat @andrwng I believe most feedbacks are addressed other than swapping out the enum class for bitset. Let me know if this looks good or if we need to revisit using the bitset.

An object which holds a pair of preallocated buffers is added. The buffers are char arrays allocated at startup. State of allocation is maintained along with the buffers, to ensure correctness when using these buffers with malloc and free calls. The c api is introduced in the next commit.

The custom allocator wraps the managed buffers into a c struct which is used by lz4 to route malloc and free calls. If the feature is disabled the struct is constructed such that calls are directly sent to malloc and free.

An optional block size parameter is added so that the compression op can set large block size, this enables testing the newly added preallocated buffers for large block size.

dotnwat

🔥

vbotbuildovich · 2024-04-11T20:29:54Z

/backport v23.3.x

vbotbuildovich · 2024-04-11T20:29:54Z

/backport v23.2.x

vbotbuildovich · 2024-04-11T20:30:46Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17385-v23.3.x-870 remotes/upstream/v23.3.x
git cherry-pick -x 3ec7f4e25af99c06fe141a80b4d6df26b4220e14 cf44e60391ec6406307b618a2ed4fda0f8605bc7 91168752e60c77c8006a382b9cf62ef91e78470b edfba3dd08bee8b63369f3dd8b58e935ec0d43b4 5b3d7a31dec0e75b56b14ae617499441a99a2d3b 34860a98f53c9276775ab45885ab4fa869bb6fa6

Workflow run logs.

vbotbuildovich · 2024-04-11T20:30:46Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17385-v23.2.x-61 remotes/upstream/v23.2.x
git cherry-pick -x 3ec7f4e25af99c06fe141a80b4d6df26b4220e14 cf44e60391ec6406307b618a2ed4fda0f8605bc7 91168752e60c77c8006a382b9cf62ef91e78470b edfba3dd08bee8b63369f3dd8b58e935ec0d43b4 5b3d7a31dec0e75b56b14ae617499441a99a2d3b 34860a98f53c9276775ab45885ab4fa869bb6fa6

Workflow run logs.

This is needed after redpanda-data#17385 where the shared library version will no longer work. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

github-actions bot added the area/redpanda label Mar 26, 2024

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from b50cd6f to 7fda580 Compare March 26, 2024 08:54

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 8 times, most recently from a505fe9 to a08b80b Compare March 28, 2024 08:07

abhijat marked this pull request as draft March 28, 2024 08:07

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 4 times, most recently from 628788e to 1605681 Compare April 1, 2024 05:48

abhijat mentioned this pull request Apr 1, 2024

deps: Add lz4-static for fedora #17520

Merged

6 tasks

abhijat changed the title ~~WIP: decompression buffers~~ compression: Use preallocated decompression buffers for lz4 Apr 1, 2024

abhijat marked this pull request as ready for review April 1, 2024 06:39

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 2 times, most recently from 71dba19 to dedb0ad Compare April 1, 2024 12:30

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 4 times, most recently from 64d43b8 to 2eb41db Compare April 3, 2024 08:47

andijcr reviewed Apr 3, 2024

View reviewed changes

abhijat requested review from andijcr, dotnwat and andrwng April 3, 2024 11:40

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from c40169e to 323e9d8 Compare April 5, 2024 08:24

abhijat requested review from dotnwat, andrwng and michael-redpanda April 5, 2024 15:30

andrwng reviewed Apr 5, 2024

View reviewed changes

dotnwat reviewed Apr 6, 2024

View reviewed changes

dotnwat changed the title ~~compression: Use preallocated decompression buffers for lz4~~ CORE-1722: compression: Use preallocated decompression buffers for lz4 Apr 8, 2024

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 323e9d8 to 755a6a8 Compare April 10, 2024 08:39

config: Add setting for preallocated lz4 buffers

3ec7f4e

A configuration setting is added to disable lz4 buffer preallocation.

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 755a6a8 to 9b2a03d Compare April 10, 2024 11:35

abhijat requested review from andrwng and dotnwat April 10, 2024 15:01

abhijat added 5 commits April 10, 2024 20:37

compression: Add custom allocator for use with lz4

9116875

The custom allocator wraps the managed buffers into a c struct which is used by lz4 to route malloc and free calls. If the feature is disabled the struct is constructed such that calls are directly sent to malloc and free.

compression/app: Add static lz4 buffer instance

edfba3d

compression: Make lz4 block size configurable for testing

5b3d7a3

An optional block size parameter is added so that the compression op can set large block size, this enables testing the newly added preallocated buffers for large block size.

compression/storage: Use preallocated lz4 blocks

34860a9

abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 9b2a03d to 34860a9 Compare April 10, 2024 15:19

dotnwat approved these changes Apr 11, 2024

View reviewed changes

dotnwat merged commit 83973db into redpanda-data:dev Apr 11, 2024
18 checks passed

This was referenced Apr 11, 2024

[v23.3.x] CORE-1722: compression: Use preallocated decompression buffers for lz4 #17815

Closed

[v23.2.x] CORE-1722: compression: Use preallocated decompression buffers for lz4 #17816

Closed

This was referenced Apr 12, 2024

[v23.3.x] CORE-1722: compression: Use preallocated decompression buffers for lz4 #17825

Merged

[v23.2.x] CORE-1722: compression: Use preallocated decompression buffers for lz4 #17826

Merged

dotnwat added a commit to dotnwat/redpanda that referenced this pull request Apr 12, 2024

build: look for static lz4 library

b13c005

This is needed after redpanda-data#17385 where the shared library version will no longer work. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

abhijat commented Mar 26, 2024 •

edited

vbotbuildovich commented Mar 26, 2024 •

edited

abhijat commented Apr 1, 2024

andijcr Apr 2, 2024

abhijat Apr 3, 2024

abhijat Apr 3, 2024

andijcr Apr 3, 2024

abhijat Apr 3, 2024

andrwng left a comment

andrwng Apr 5, 2024

andrwng Apr 5, 2024

abhijat Apr 10, 2024

abhijat Apr 10, 2024

andrwng Apr 5, 2024

dotnwat left a comment

dotnwat Apr 6, 2024

dotnwat Apr 6, 2024

abhijat Apr 10, 2024

dotnwat Apr 6, 2024

dotnwat Apr 6, 2024

dotnwat Apr 6, 2024

dotnwat Apr 6, 2024

abhijat Apr 10, 2024

dotnwat Apr 10, 2024

dotnwat Apr 6, 2024

abhijat Apr 10, 2024

dotnwat Apr 10, 2024

abhijat Apr 11, 2024

abhijat Apr 11, 2024

abhijat commented Apr 10, 2024

dotnwat left a comment

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

		void* alloc_lz4_obj(void* state, size_t size) {
		auto* st = static_cast<compression::lz4_decompression_buffers*>(state);

CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

Conversation

abhijat commented Mar 26, 2024 • edited

Backports Required

Release Notes

vbotbuildovich commented Mar 26, 2024 • edited

abhijat commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijat commented Apr 10, 2024

dotnwat left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

vbotbuildovich commented Apr 11, 2024

abhijat commented Mar 26, 2024 •

edited

vbotbuildovich commented Mar 26, 2024 •

edited