Fix 128K iobuf zero-copy #15778

travisdowns · 2023-12-19T17:16:02Z

Fix iobuf zero-copy for full fragments

When appending one large iobuf to another (or a large temporary_buffer
to an iobuf) one would expect that this is mostly a zero-copy operation,
as the underlying temporary buffers can be shared and iobuf is designed
to accommodate this case.

However, a check in the existing logic which tries to avoid creating
many small iobufs also kicks in when we append a series of full-sized
fragments to the iobuf, resulting in a full copy of each of them.

This change is a relatively conservative change to this logic where
we will now do zero-copy when a full-sized (or larger) temporary
buffer is appended. We don't otherwise change the heuristic though one
could argue that a future improvement where we also zero-copy some
"large but not totally full" buffers would be beneficial.

I also remove a redundant check in this heuristic:

 b.size() < available_bytes()

as it is subsumed by the next check.

Partial fix for: #15769
Partial or complete fix for: redpanda-data/core-internal#946

Backports Required

Release Notes

none

Add details and clarifications around several member functions, especially those that copy or append to the iobuf and which may or may not perform zero-copy operations. Add a general "mutation" caveat to the iobuf class doc and point there from several member functons to avoid repeating myself.

When appending one large iobuf to another (or a large temporary_buffer to an iobuf) one would expect that this is mostly a zero-copy operation, as the underlying temporary buffers can be shared and iobuf is designed to accommodate this case. However, a check in the existing logic which tries to avoid creating many small iobufs also kicks in when we append a series of full-sized fragments to the iobuf, resulting in a full copy of each of them. This change is a relatively conservative change to this logic where we will now do zero-copy when a full-sized (or larger) temporary buffer is appended. We don't otherwise change the heuristic though one could argue that a future improvement where we also zero-copy some "large but not totally full" buffers would be beneficial. I also remove a redundant check in this heuristic: b.size() < available_bytes() as it is subsumed by the next check. Partial fix for issue redpanda-data#15769.

rockwotj

Masterful, thanks! I really appreciate the new comments on iobuf class/methods

StephanDollberg · 2023-12-19T18:42:22Z

src/v/bytes/iobuf.h

+    // full-sized fragments is in practice a common operation when buffers
+    // grow beyond the maximum fragment size.
+    if (
+      b.size() <= last_asz


So you kept b.size() <= last_asz instead of b.size() <= available_bytes() to still force linearization even if it does require an extra alloc?

I think b.size() <= available_bytes() is subsumed by b.size() <= last_asz, i.e., when former is true the latter is also true (last_asz is always >= available_bytes()), so it can simply be deleted without changing the behavior. I.e., this deletion was just a performance optimization.

Not sure if that addresses your question?

Another answer is that using only b.size() < available_bytes() would be a very large change in behavior and break a key property: that when you append a series of buffers, the growth strategy of the io_buf is still doubling (until the 128K cap).

If the check was just b.size() < available_bytes() then appending N 1 byte buffer when avaialble_size() was initially zero would actually result in N 1-byte fragments. io_buf is trying to avoid that (and there are tests that this doesn't happen).

So we want to preserve the "doubling allocations" behavior sort of regardless of the sequence of appended buffers, which means more copying/linearization until (after this change) we hit the cap.

Edited after reading the second reply.

Interesting I see, that makes sense then. I didn't know iobuf had that property.

I did have further thoughts on it: I think a slightly more aggressive version of this change would use zero-copy in more cases, e.g., maybe if the buffer is more than half the size of the max alloc (128K) or some other threshold if it is somehow "similar" in size to the last allocation (e.g., maybe if it's more than half that size). With the latter type of change though you have to avoid it "going linear" which is what happens if you just implement that (or < last_asz instead of <= last_asz as I tried originally) because that results in no more doubling.

It seems like the two most important properties are:

Keep the doubling growth

Do zero copy "at the limit" of large iobufs

This fixes (2) without impacting (1) and beyond that it's just tuning (though I think we should do the tuning which is why I did not close #15769.

Even this change is possibly worse on some dimension than the existing behaviors for some patterns, e.g., appending allocating 64K+1, alternating with 128K. The old behavior would linearize it to full 128K fragments, involving copying every byte but being fully efficient in memory under the "unique owner" assumption (i.e., the source buffers all go away).

The new behavior will result in (for each 2 appends) 128K capacity fragments, but with every other one being half-empty.

Similarly with 1-byte and 128K alternating appends, the new one will be mostly memory efficient but with 1 byte buffers alternating with 128K ones (and needing an extra free+alloc to trim down to 1 byte).

I don't think you can totally solve these problems without an oracle though: an oracle that knows both about the future append pattern and also the lifetime of the source buffers (i.e., zero-copy is relatively more favorable if the source buffer lifetime extends beyond the lifetime of this buffer, since then it really saves you memory footprint).

Even w/o that though some of the methods could be smarter: like this append(io_buf) method just has a loop where it appends the underlying fragments 1-by-1 and then within that append (which hits the code in this PR) we have to make a local decision about copying or not, and growth: however the outer append method already knows the full size of the incoming iobuf so it could just see that it's 400K or whatever and skip straight to 128K fragments (assuming linearizatoin was what you wanted), avoiding a bunch of pointless small allocations.

vbotbuildovich · 2023-12-19T21:00:10Z

/backport v23.3.x

vbotbuildovich · 2023-12-19T21:00:11Z

/backport v23.2.x

emaxerrno · 2023-12-19T21:35:13Z

@travisdowns awesome work.

travisdowns requested review from StephanDollberg, rockwotj and mmaslankaprv December 19, 2023 17:16

github-actions bot added the area/redpanda label Dec 19, 2023

travisdowns force-pushed the td-iobuf-128k-append-zero-copy branch from bfea6cc to 3953fb5 Compare December 19, 2023 17:36

rockwotj approved these changes Dec 19, 2023

View reviewed changes

StephanDollberg approved these changes Dec 19, 2023

View reviewed changes

piyushredpanda added this to the v23.3.1-rc5 milestone Dec 19, 2023

travisdowns merged commit 50d0446 into redpanda-data:dev Dec 19, 2023
22 checks passed

This was referenced Dec 19, 2023

[v23.2.x] Fix 128K iobuf zero-copy #15780

Merged

[v23.3.x] Fix 128K iobuf zero-copy #15781

Merged

travisdowns mentioned this pull request Mar 7, 2024

Examine and refine zero-copy behavior in iobuf #16930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 128K iobuf zero-copy #15778

Fix 128K iobuf zero-copy #15778

travisdowns commented Dec 19, 2023 •

edited

rockwotj left a comment

StephanDollberg Dec 19, 2023

travisdowns Dec 19, 2023 •

edited

travisdowns Dec 19, 2023

StephanDollberg Dec 19, 2023 •

edited

travisdowns Dec 19, 2023

travisdowns Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

emaxerrno commented Dec 19, 2023

Fix 128K iobuf zero-copy #15778

Fix 128K iobuf zero-copy #15778

Conversation

travisdowns commented Dec 19, 2023 • edited

Backports Required

Release Notes

rockwotj left a comment

Choose a reason for hiding this comment

StephanDollberg Dec 19, 2023

Choose a reason for hiding this comment

travisdowns Dec 19, 2023 • edited

Choose a reason for hiding this comment

travisdowns Dec 19, 2023

Choose a reason for hiding this comment

StephanDollberg Dec 19, 2023 • edited

Choose a reason for hiding this comment

travisdowns Dec 19, 2023

Choose a reason for hiding this comment

travisdowns Dec 19, 2023

Choose a reason for hiding this comment

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

emaxerrno commented Dec 19, 2023

travisdowns commented Dec 19, 2023 •

edited

travisdowns Dec 19, 2023 •

edited

StephanDollberg Dec 19, 2023 •

edited