-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385
CORE-1722: compression: Use preallocated decompression buffers for lz4 #17385
Conversation
b50cd6f
to
7fda580
Compare
a505fe9
to
a08b80b
Compare
628788e
to
1605681
Compare
71dba19
to
dedb0ad
Compare
test failures seem related to changes in this PR |
64d43b8
to
2eb41db
Compare
src/v/config/configuration.cc
Outdated
"Size of buffers preallocated for LZ4 decompression. Two buffers " | ||
"allocated per shard, so twice of this value is reserved at startup", | ||
{.needs_restart = needs_restart::yes, .visibility = visibility::tunable}, | ||
4224_KiB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this is 4_Mib + 8?
maybe 4_Mib + sizeof(uint64_t)
would give a bit more context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 4 MiB + 128 KiB which is the maximum lz4 allocates during decompression (https://github.com/lz4/lz4/blob/dev/lib/lz4frame.c#L1678-L1687)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to write the two quantities separately ie 4mib +...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. Up to you, I feel it would be less "magic constant" if it was broken into 4_Mib + 128_Kib.
Only if there are new commits to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
c40169e
to
323e9d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much LGTM. One substantive question about whether we should expose the buffer size. Otherwise, nothing major from me. Nice work!
inline lz4_decompression_buffers::alloc_ctx::allocation_state operator|( | ||
lz4_decompression_buffers::alloc_ctx::allocation_state a, | ||
lz4_decompression_buffers::alloc_ctx::allocation_state b) { | ||
using t = std::underlying_type_t< | ||
lz4_decompression_buffers::alloc_ctx::allocation_state>; | ||
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) | t(b)); | ||
} | ||
|
||
inline void operator|=( | ||
lz4_decompression_buffers::alloc_ctx::allocation_state& a, | ||
lz4_decompression_buffers::alloc_ctx::allocation_state b) { | ||
a = (a | b); | ||
} | ||
|
||
inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&( | ||
lz4_decompression_buffers::alloc_ctx::allocation_state a, | ||
lz4_decompression_buffers::alloc_ctx::allocation_state b) { | ||
using t = std::underlying_type_t< | ||
lz4_decompression_buffers::alloc_ctx::allocation_state>; | ||
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b)); | ||
} | ||
|
||
inline void operator&=( | ||
lz4_decompression_buffers::alloc_ctx::allocation_state& a, | ||
lz4_decompression_buffers::alloc_ctx::allocation_state b) { | ||
a = (a & b); | ||
} | ||
|
||
inline lz4_decompression_buffers::alloc_ctx::allocation_state | ||
operator~(lz4_decompression_buffers::alloc_ctx::allocation_state a) { | ||
return lz4_decompression_buffers::alloc_ctx::allocation_state( | ||
~std::underlying_type_t< | ||
lz4_decompression_buffers::alloc_ctx::allocation_state>(a)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. FWIW now that I understand it, it's not as odd as I thought, but perhaps it's worth adding a comment to the allocation state definition about it (e.g. to discourage folks from treating it as a regular enum)
case both_buffers_allocated: | ||
st->deallocated(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, does lz4 always call free_lz4_obj
, even the decompression failed? And do the buffers need to be cleared at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lz4 does not guarantee to free memory, the user has to free it, it has the function LZ4F_freeDecompressionContext
to do this. In our case it is guaranteed because we wrap the decompression context in a unique ptr with a custom deleter, this deleter then calls LZ4F_freeDecompressionContext
for us when the unique ptr is destroyed, so free_lz4_obj
will always be called via LZ4F_freeDecompressionContext -> LZ4_Free -> free_lz4_obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the relevant type definition https://github.com/redpanda-data/redpanda/blob/dev/src/v/compression/internal/lz4_frame_compressor.cc#L53-L59
void* alloc_lz4_obj(void* state, size_t size) { | ||
auto* st = static_cast<compression::lz4_decompression_buffers*>(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to expose the buffer size as a config parameter
Right, I would lean towards not exposing it, other than enabling/disabling the extra buffers. Otherwise it seems too easy to get Redpanda to crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should give the restart_required::no configuration option a thought. like, it doesn't seem like we can reliably turn this on after the system has been running for a while since it cannot make a big allocation.
with a 128kb max allocation size, what is the point in having the small allocation code path? it seems like that might be simplified?
a few other nits / suggestions primarily related to simplifying the implementation. but overall, it looks like it'll work ok.
src/v/config/configuration.cc
Outdated
, lz4_decompress_malloc_min_threshold( | ||
*this, | ||
"lz4_decompress_malloc_min_threshold", | ||
"Threshold below which preallocated buffers will not be used during " | ||
"LZ4 decompression, if preallocated buffer use is enabled", | ||
{.needs_restart = needs_restart::no, .visibility = visibility::tunable}, | ||
128_KiB + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some advantage to not using pre-allocated buffers? if they are there anyway, why not always use them?
src/v/config/configuration.cc
Outdated
"Enable reusable preallocated buffers for LZ4 decompression, helps " | ||
"prevent OOM errors by avoiding allocation of large memory blocks " | ||
"during decompression", | ||
{.needs_restart = needs_restart::no, .visibility = visibility::tunable}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it problematic to have this option be needs_restart::no
? after the system runs for a while, it's likely to become quite difficult/unlikely that a 4mb allocation would succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to needs_restart::yes
_buffers = { | ||
.input_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8), | ||
.output_buffer = ss::allocate_aligned_buffer<char>(buffer_size, 8), | ||
.state = alloc_ctx::allocation_state::no_buffers_allocated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the allocated refers to given out by alloc_lz4_obj to the caller.
it would be helpful if in the header for alloc_ctx then that you define what the flags are / what the states are for bookkeeping?
bool lz4_decompression_buffers::has_reserved_buffers() const { | ||
return _reservation.current() == 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if there are any suggestions.
is it being used like a lock/mutex?
if (_disabled) { | ||
co_return std::nullopt; | ||
} | ||
|
||
co_return co_await ss::get_units(_reservation, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, if there is no contention, then there is no overhead and no need for a fast path. also if there no contention, there is no need for a lock?
inline lz4_decompression_buffers::alloc_ctx::allocation_state operator&( | ||
lz4_decompression_buffers::alloc_ctx::allocation_state a, | ||
lz4_decompression_buffers::alloc_ctx::allocation_state b) { | ||
using t = std::underlying_type_t< | ||
lz4_decompression_buffers::alloc_ctx::allocation_state>; | ||
return lz4_decompression_buffers::alloc_ctx::allocation_state(t(a) & t(b)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of writing this out explicitly, why not use bitset<4>
? you can even assign names, like this:
enum class states { a, b, c, d, num_flags };
std::bitset<states::num_flags> flags;
and then all these operations can be done with bitset operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started exploring the use of a bitset because it looked like a good fit, but then I realized it contains just plain ints, so we lose the type safety of the enum class, and the compiler warnings from incomplete switch statements etc. that enum class provides.
I did move the bitwise operators to an anonymous namespace in the source file so they are not part of the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. No need to swap out for bitset.
return malloc(size); | ||
} | ||
|
||
vassert( | ||
st->has_reserved_buffers(), | ||
"must reserve space with lz4_decompression_buffers::reserve_buffers() " | ||
"before allocation"); | ||
|
||
auto& bufs = st->buffers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already have a chunk of 4mb memory, wouldn't using malloc
be strictly more work than using the arena? what's the advantage to allocating more memory when its small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.
On the other hand I looked through the usages for LZ4F_malloc
and did not find any calls to it other than those two buffers in the decompression code path. So it might work but I'm not sure if it is guaranteed that calls with remain the same within lz4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we opt in to use this custom memory manager with lz4 it will route all calls to malloc and free via our provided functions. So if there are any calls to malloc made by lz4 during decompression other than the ones for the two buffers we care about, those should still be serviceable. The idea behind falling back to malloc for small buffers is to service those calls. We don't want to use the preallocated buffers for those calls.
I see. This seems very dangerous, right?
We don't want to use the preallocated buffers for those calls.
Specifically, you can't use preallocated buffers for those calls unless there is exactly one such call, right? IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be safe. Right now we can describe the calls to malloc
during decompression as below:
- all calls to
malloc
<=128_KiB
go to the systemmalloc
implementation - all calls to
malloc
>128_KiB
go to our two buffers, which are allocated in sequence.
We know from the source code of LZ4 that during decompression it makes exactly two large sized malloc
calls, one for input buffer and one for output buffer.
Both are relatively similar in size, and depend on block size. So they can be 64 KiB, 256 KiB, 1 MiB or 4 MiB each. In all but the first case our buffers will be allocated out. For all other smaller sized malloc
calls during decompression system malloc
will be used.
Note that the malloc
calls are made when the decompression context is in the init stage, which only happens once per a review of the decompression code path.
If we look at the free calls they are a lot simpler:
- For addresses belonging to our two preallocated buffers, we just update the state in our free implementation.
- For all other addresses we fall back to the system
free
impl.
Also if the lz4 library changes in future and more malloc
s are added, our unit test will start failing because we have exact call counts in the decompression code path test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC you are depending on lz4 making arbitrary malloc/free calls, where exactly one of those malloc/free calls is a call that have exclusive access to the pre-allocated region?
Yes, there are two calls to malloc expected which will use our buffers. All others should go to system malloc. In the current lz4 implementation, during decompression, there are only two malloc calls. These are expected to result in allocation of our two buffers.
323e9d8
to
755a6a8
Compare
A configuration setting is added to disable lz4 buffer preallocation.
755a6a8
to
9b2a03d
Compare
An object which holds a pair of preallocated buffers is added. The buffers are char arrays allocated at startup. State of allocation is maintained along with the buffers, to ensure correctness when using these buffers with malloc and free calls. The c api is introduced in the next commit.
The custom allocator wraps the managed buffers into a c struct which is used by lz4 to route malloc and free calls. If the feature is disabled the struct is constructed such that calls are directly sent to malloc and free.
An optional block size parameter is added so that the compression op can set large block size, this enables testing the newly added preallocated buffers for large block size.
9b2a03d
to
34860a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
/backport v23.3.x |
/backport v23.2.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
Failed to create a backport PR to v23.2.x branch. I tried:
|
This is needed after redpanda-data#17385 where the shared library version will no longer work. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
LZ4 decompression uses malloc to allocate memory for temporary buffers. The size allocated depends on the input and can range upto two buffers of slightly over 4MiB each per decompression.
In cases where contiguous space of 4MiB is not available after redpanda has been running for a while, this can result in a OOM. This is usually seen in context of compaction where a segment needs to be read through to discard old values for a key, requiring decompression of the record batches.
This PR switches LZ4 to use a custom memory allocator and a pair of preallocated 4MiB+128KiB buffers per shard. The custom memory allocator intercepts calls to malloc and returns the preallocated buffers if the allocation size is larger than a threshold. In corresponding calls to
free
the state of managed buffers is updated.A caller must reserve buffers before starting decompression. The current decompression operation is synchronous so once a shard begins decompressing a record batch, it cannot suspend the operation, and the buffers are implicitly protected against parallel access; but the buffers in this PR are still accessed after reserving a semaphore unit to protect against future changes to the decompression routine.
While this PR only allows for one decompression operation per shard, if we move to use the
object_pool
utility and support multiple buffer pairs per shard IE multiple decompression ops per shard in parallel, then the current reserve/release methods will serve as a base to control concurrency.The following new settings are introduced to control the operation:
lz4_decompress_buffer_size_bytes
: size per buffer (two are allocated per shard)lz4_decompress_malloc_min_threshold
: min threshold below which preallocated buffers are not used and we pass through to malloc for allocationlz4_decompress_reusable_buffers_enabled
: enables turning off preallocated buffers, all calls via lz4 are passed through to malloc.FIXES #16179
force push - removes semaphore, removes settings for sizes
force push - fixes semantics of deallocation order and state transition
Backports Required
Release Notes