-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 128K iobuf zero-copy #15778
Fix 128K iobuf zero-copy #15778
Conversation
Add details and clarifications around several member functions, especially those that copy or append to the iobuf and which may or may not perform zero-copy operations. Add a general "mutation" caveat to the iobuf class doc and point there from several member functons to avoid repeating myself.
When appending one large iobuf to another (or a large temporary_buffer to an iobuf) one would expect that this is mostly a zero-copy operation, as the underlying temporary buffers can be shared and iobuf is designed to accommodate this case. However, a check in the existing logic which tries to avoid creating many small iobufs also kicks in when we append a series of full-sized fragments to the iobuf, resulting in a full copy of each of them. This change is a relatively conservative change to this logic where we will now do zero-copy when a full-sized (or larger) temporary buffer is appended. We don't otherwise change the heuristic though one could argue that a future improvement where we also zero-copy some "large but not totally full" buffers would be beneficial. I also remove a redundant check in this heuristic: b.size() < available_bytes() as it is subsumed by the next check. Partial fix for issue redpanda-data#15769.
bfea6cc
to
3953fb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masterful, thanks! I really appreciate the new comments on iobuf class/methods
// full-sized fragments is in practice a common operation when buffers | ||
// grow beyond the maximum fragment size. | ||
if ( | ||
b.size() <= last_asz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you kept b.size() <= last_asz
instead of b.size() <= available_bytes()
to still force linearization even if it does require an extra alloc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think b.size() <= available_bytes()
is subsumed by b.size() <= last_asz
, i.e., when former is true the latter is also true (last_asz is always >= available_bytes()), so it can simply be deleted without changing the behavior. I.e., this deletion was just a performance optimization.
Not sure if that addresses your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another answer is that using only b.size() < available_bytes()
would be a very large change in behavior and break a key property: that when you append a series of buffers, the growth strategy of the io_buf is still doubling (until the 128K cap).
If the check was just b.size() < available_bytes()
then appending N 1 byte buffer when avaialble_size()
was initially zero would actually result in N 1-byte fragments. io_buf
is trying to avoid that (and there are tests that this doesn't happen).
So we want to preserve the "doubling allocations" behavior sort of regardless of the sequence of appended buffers, which means more copying/linearization until (after this change) we hit the cap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited after reading the second reply.
Interesting I see, that makes sense then. I didn't know iobuf had that property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did have further thoughts on it: I think a slightly more aggressive version of this change would use zero-copy in more cases, e.g., maybe if the buffer is more than half the size of the max alloc (128K) or some other threshold if it is somehow "similar" in size to the last allocation (e.g., maybe if it's more than half that size). With the latter type of change though you have to avoid it "going linear" which is what happens if you just implement that (or < last_asz
instead of <= last_asz
as I tried originally) because that results in no more doubling.
It seems like the two most important properties are:
- Keep the doubling growth
- Do zero copy "at the limit" of large iobufs
This fixes (2) without impacting (1) and beyond that it's just tuning (though I think we should do the tuning which is why I did not close #15769.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even this change is possibly worse on some dimension than the existing behaviors for some patterns, e.g., appending allocating 64K+1, alternating with 128K. The old behavior would linearize it to full 128K fragments, involving copying every byte but being fully efficient in memory under the "unique owner" assumption (i.e., the source buffers all go away).
The new behavior will result in (for each 2 appends) 128K capacity fragments, but with every other one being half-empty.
Similarly with 1-byte and 128K alternating appends, the new one will be mostly memory efficient but with 1 byte buffers alternating with 128K ones (and needing an extra free+alloc to trim down to 1 byte).
I don't think you can totally solve these problems without an oracle though: an oracle that knows both about the future append pattern and also the lifetime of the source buffers (i.e., zero-copy is relatively more favorable if the source buffer lifetime extends beyond the lifetime of this buffer, since then it really saves you memory footprint).
Even w/o that though some of the methods could be smarter: like this append(io_buf)
method just has a loop where it appends the underlying fragments 1-by-1 and then within that append (which hits the code in this PR) we have to make a local decision about copying or not, and growth: however the outer append method already knows the full size of the incoming iobuf
so it could just see that it's 400K or whatever and skip straight to 128K fragments (assuming linearizatoin was what you wanted), avoiding a bunch of pointless small allocations.
/backport v23.3.x |
/backport v23.2.x |
@travisdowns awesome work. |
Fix iobuf zero-copy for full fragments
When appending one large iobuf to another (or a large temporary_buffer
to an iobuf) one would expect that this is mostly a zero-copy operation,
as the underlying temporary buffers can be shared and iobuf is designed
to accommodate this case.
However, a check in the existing logic which tries to avoid creating
many small iobufs also kicks in when we append a series of full-sized
fragments to the iobuf, resulting in a full copy of each of them.
This change is a relatively conservative change to this logic where
we will now do zero-copy when a full-sized (or larger) temporary
buffer is appended. We don't otherwise change the heuristic though one
could argue that a future improvement where we also zero-copy some
"large but not totally full" buffers would be beneficial.
I also remove a redundant check in this heuristic:
as it is subsumed by the next check.
Partial fix for: #15769
Partial or complete fix for: redpanda-data/core-internal#946
Backports Required
Release Notes