Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385

Merged

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Mar 26, 2024

LZ4 decompression uses malloc to allocate memory for temporary buffers. The size allocated depends on the input and can range upto two buffers of slightly over 4MiB each per decompression.

In cases where contiguous space of 4MiB is not available after redpanda has been running for a while, this can result in a OOM. This is usually seen in context of compaction where a segment needs to be read through to discard old values for a key, requiring decompression of the record batches.

This PR switches LZ4 to use a custom memory allocator and a pair of preallocated 4MiB+128KiB buffers per shard. The custom memory allocator intercepts calls to malloc and returns the preallocated buffers if the allocation size is larger than a threshold. In corresponding calls to free the state of managed buffers is updated.

A caller must reserve buffers before starting decompression. The current decompression operation is synchronous so once a shard begins decompressing a record batch, it cannot suspend the operation, and the buffers are implicitly protected against parallel access; but the buffers in this PR are still accessed after reserving a semaphore unit to protect against future changes to the decompression routine.

While this PR only allows for one decompression operation per shard, if we move to use the object_pool utility and support multiple buffer pairs per shard IE multiple decompression ops per shard in parallel, then the current reserve/release methods will serve as a base to control concurrency.

The following new settings are introduced to control the operation:

  1. lz4_decompress_buffer_size_bytes: size per buffer (two are allocated per shard)
  2. lz4_decompress_malloc_min_threshold: min threshold below which preallocated buffers are not used and we pass through to malloc for allocation
  3. lz4_decompress_reusable_buffers_enabled: enables turning off preallocated buffers, all calls via lz4 are passed through to malloc.

FIXES #16179

force push - removes semaphore, removes settings for sizes

force push - fixes semantics of deallocation order and state transition

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from b50cd6f to 7fda580 Compare March 26, 2024 08:54
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 8 times, most recently from a505fe9 to a08b80b Compare March 28, 2024 08:07
@abhijat abhijat marked this pull request as draft March 28, 2024 08:07
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 4 times, most recently from 628788e to 1605681 Compare April 1, 2024 05:48
@abhijat abhijat mentioned this pull request Apr 1, 2024
6 tasks
@abhijat abhijat changed the title WIP: decompression buffers compression: Use preallocated decompression buffers for lz4 Apr 1, 2024
@abhijat abhijat marked this pull request as ready for review April 1, 2024 06:39
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 2 times, most recently from 71dba19 to dedb0ad Compare April 1, 2024 12:30
@abhijat
Copy link
Contributor Author

abhijat commented Apr 1, 2024

test failures seem related to changes in this PR

@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch 4 times, most recently from 64d43b8 to 2eb41db Compare April 3, 2024 08:47
"Size of buffers preallocated for LZ4 decompression. Two buffers "
"allocated per shard, so twice of this value is reserved at startup",
{.needs_restart = needs_restart::yes, .visibility = visibility::tunable},
4224_KiB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this is 4_Mib + 8?
maybe 4_Mib + sizeof(uint64_t) would give a bit more context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 4 MiB + 128 KiB which is the maximum lz4 allocates during decompression (https://github.com/lz4/lz4/blob/dev/lib/lz4frame.c#L1678-L1687)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to write the two quantities separately ie 4mib +...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. Up to you, I feel it would be less "magic constant" if it was broken into 4_Mib + 128_Kib.
Only if there are new commits to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from c40169e to 323e9d8 Compare April 5, 2024 08:24
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much LGTM. One substantive question about whether we should expose the buffer size. Otherwise, nothing major from me. Nice work!

Comment on lines 107 to 138
inline lz4_decompression_buffers::alloc_ctx::allocation_state operator|(
lz4_decompression_buffers::alloc_ctx::allocation_state a,
lz4_decompression_buffers::alloc_ctx::allocation_state b) {
using t = std::underlying_type_t<
lz4_decompression_buffers::alloc_ctx::allocation_state>;
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) | t(b));
}

inline void operator|=(
lz4_decompression_buffers::alloc_ctx::allocation_state& a,
lz4_decompression_buffers::alloc_ctx::allocation_state b) {
a = (a | b);
}

inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&(
lz4_decompression_buffers::alloc_ctx::allocation_state a,
lz4_decompression_buffers::alloc_ctx::allocation_state b) {
using t = std::underlying_type_t<
lz4_decompression_buffers::alloc_ctx::allocation_state>;
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b));
}

inline void operator&=(
lz4_decompression_buffers::alloc_ctx::allocation_state& a,
lz4_decompression_buffers::alloc_ctx::allocation_state b) {
a = (a & b);
}

inline lz4_decompression_buffers::alloc_ctx::allocation_state
operator~(lz4_decompression_buffers::alloc_ctx::allocation_state a) {
return lz4_decompression_buffers::alloc_ctx::allocation_state(
~std::underlying_type_t<
lz4_decompression_buffers::alloc_ctx::allocation_state>(a));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. FWIW now that I understand it, it's not as odd as I thought, but perhaps it's worth adding a comment to the allocation state definition about it (e.g. to discourage folks from treating it as a regular enum)

Comment on lines +203 to +221
case both_buffers_allocated:
st->deallocated();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, does lz4 always call free_lz4_obj, even the decompression failed? And do the buffers need to be cleared at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lz4 does not guarantee to free memory, the user has to free it, it has the function LZ4F_freeDecompressionContext to do this. In our case it is guaranteed because we wrap the decompression context in a unique ptr with a custom deleter, this deleter then calls LZ4F_freeDecompressionContext for us when the unique ptr is destroyed, so free_lz4_obj will always be called via LZ4F_freeDecompressionContext -> LZ4_Free -> free_lz4_obj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +135 to +138
void* alloc_lz4_obj(void* state, size_t size) {
auto* st = static_cast<compression::lz4_decompression_buffers*>(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to expose the buffer size as a config parameter

Right, I would lean towards not exposing it, other than enabling/disabling the extra buffers. Otherwise it seems too easy to get Redpanda to crash

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should give the restart_required::no configuration option a thought. like, it doesn't seem like we can reliably turn this on after the system has been running for a while since it cannot make a big allocation.

with a 128kb max allocation size, what is the point in having the small allocation code path? it seems like that might be simplified?

a few other nits / suggestions primarily related to simplifying the implementation. but overall, it looks like it'll work ok.

Comment on lines 2516 to 2522
, lz4_decompress_malloc_min_threshold(
*this,
"lz4_decompress_malloc_min_threshold",
"Threshold below which preallocated buffers will not be used during "
"LZ4 decompression, if preallocated buffer use is enabled",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
128_KiB + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some advantage to not using pre-allocated buffers? if they are there anyway, why not always use them?

"Enable reusable preallocated buffers for LZ4 decompression, helps "
"prevent OOM errors by avoiding allocation of large memory blocks "
"during decompression",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it problematic to have this option be needs_restart::no? after the system runs for a while, it's likely to become quite difficult/unlikely that a 4mb allocation would succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to needs_restart::yes

_buffers = {
.input_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8),
.output_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8),
.state = alloc_ctx::allocation_state::no_buffers_allocated,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the allocated refers to given out by alloc_lz4_obj to the caller.

it would be helpful if in the header for alloc_ctx then that you define what the flags are / what the states are for bookkeeping?

Comment on lines 49 to 53
bool lz4_decompression_buffers::has_reserved_buffers() const {
return _reservation.current() == 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there are any suggestions.

is it being used like a lock/mutex?

Comment on lines 55 to 61
if (_disabled) {
co_return std::nullopt;
}

co_return co_await ss::get_units(_reservation, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, if there is no contention, then there is no overhead and no need for a fast path. also if there no contention, there is no need for a lock?

Comment on lines 124 to 125
inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&(
lz4_decompression_buffers::alloc_ctx::allocation_state a,
lz4_decompression_buffers::alloc_ctx::allocation_state b) {
using t = std::underlying_type_t<
lz4_decompression_buffers::alloc_ctx::allocation_state>;
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing this out explicitly, why not use bitset<4>? you can even assign names, like this:

enum class states { a, b, c, d, num_flags };
std::bitset<states::num_flags> flags;

and then all these operations can be done with bitset operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started exploring the use of a bitset because it looked like a good fit, but then I realized it contains just plain ints, so we lose the type safety of the enum class, and the compiler warnings from incomplete switch statements etc. that enum class provides.

I did move the bitwise operators to an anonymous namespace in the source file so they are not part of the public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. No need to swap out for bitset.

Comment on lines 117 to 151
return malloc(size);
}

vassert(
st->has_reserved_buffers(),
"must reserve space with lz4_decompression_buffers::reserve_buffers() "
"before allocation");

auto& bufs = st->buffers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already have a chunk of 4mb memory, wouldn't using malloc be strictly more work than using the arena? what's the advantage to allocating more memory when its small?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.

On the other hand I looked through the usages for LZ4F_malloc and did not find any calls to it other than those two buffers in the decompression code path. So it might work but I'm not sure if it is guaranteed that calls with remain the same within lz4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.

I see. This seems very dangerous, right?

We don't want to use the preallocated buffers for those calls.

Specifically, you can't use preallocated buffers for those calls unless there is exactly one such call, right? IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe. Right now we can describe the calls to malloc during decompression as below:

  • all calls to malloc <= 128_KiB go to the system malloc implementation
  • all calls to malloc > 128_KiB go to our two buffers, which are allocated in sequence.

We know from the source code of LZ4 that during decompression it makes exactly two large sized malloc calls, one for input buffer and one for output buffer.

Both are relatively similar in size, and depend on block size. So they can be 64 KiB, 256 KiB, 1 MiB or 4 MiB each. In all but the first case our buffers will be allocated out. For all other smaller sized malloc calls during decompression system malloc will be used.

Note that the malloc calls are made when the decompression context is in the init stage, which only happens once per a review of the decompression code path.

If we look at the free calls they are a lot simpler:

  • For addresses belonging to our two preallocated buffers, we just update the state in our free implementation.
  • For all other addresses we fall back to the system free impl.

Also if the lz4 library changes in future and more mallocs are added, our unit test will start failing because we have exact call counts in the decompression code path test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?

Yes, there are two calls to malloc expected which will use our buffers. All others should go to system malloc. In the current lz4 implementation, during decompression, there are only two malloc calls. These are expected to result in allocation of our two buffers.

@dotnwat dotnwat changed the title compression: Use preallocated decompression buffers for lz4 CORE-1722: compression: Use preallocated decompression buffers for lz4 Apr 8, 2024
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 323e9d8 to 755a6a8 Compare April 10, 2024 08:39
A configuration setting is added to disable lz4 buffer preallocation.
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 755a6a8 to 9b2a03d Compare April 10, 2024 11:35
@abhijat abhijat requested review from andrwng and dotnwat April 10, 2024 15:01
@abhijat
Copy link
Contributor Author

abhijat commented Apr 10, 2024

@dotnwat @andrwng I believe most feedbacks are addressed other than swapping out the enum class for bitset. Let me know if this looks good or if we need to revisit using the bitset.

An object which holds a pair of preallocated buffers is added. The
buffers are char arrays allocated at startup. State of allocation is
maintained along with the buffers, to ensure correctness when using
these buffers with malloc and free calls. The c api is introduced in the
next commit.
The custom allocator wraps the managed buffers into a c struct which is
used by lz4 to route malloc and free calls.

If the feature is disabled the struct is constructed such that calls are
directly sent to malloc and free.
An optional block size parameter is added so that the compression op can
set large block size, this enables testing the newly added preallocated
buffers for large block size.
@abhijat abhijat force-pushed the issue/16179/lz4-oversized-alloc-fix branch from 9b2a03d to 34860a9 Compare April 10, 2024 15:19
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@dotnwat dotnwat merged commit 83973db into redpanda-data:dev Apr 11, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17385-v23.3.x-870 remotes/upstream/v23.3.x
git cherry-pick -x 3ec7f4e25af99c06fe141a80b4d6df26b4220e14 cf44e60391ec6406307b618a2ed4fda0f8605bc7 91168752e60c77c8006a382b9cf62ef91e78470b edfba3dd08bee8b63369f3dd8b58e935ec0d43b4 5b3d7a31dec0e75b56b14ae617499441a99a2d3b 34860a98f53c9276775ab45885ab4fa869bb6fa6

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17385-v23.2.x-61 remotes/upstream/v23.2.x
git cherry-pick -x 3ec7f4e25af99c06fe141a80b4d6df26b4220e14 cf44e60391ec6406307b618a2ed4fda0f8605bc7 91168752e60c77c8006a382b9cf62ef91e78470b edfba3dd08bee8b63369f3dd8b58e935ec0d43b4 5b3d7a31dec0e75b56b14ae617499441a99a2d3b 34860a98f53c9276775ab45885ab4fa869bb6fa6

Workflow run logs.

dotnwat added a commit to dotnwat/redpanda that referenced this pull request Apr 12, 2024
This is needed after
redpanda-data#17385 where the shared
library version will no longer work.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

oversized allocation in lz4_frame_compressor::compress
7 participants