Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 128K iobuf zero-copy #15778

Merged

Conversation

travisdowns
Copy link
Member

@travisdowns travisdowns commented Dec 19, 2023

Fix iobuf zero-copy for full fragments

When appending one large iobuf to another (or a large temporary_buffer
to an iobuf) one would expect that this is mostly a zero-copy operation,
as the underlying temporary buffers can be shared and iobuf is designed
to accommodate this case.

However, a check in the existing logic which tries to avoid creating
many small iobufs also kicks in when we append a series of full-sized
fragments to the iobuf, resulting in a full copy of each of them.

This change is a relatively conservative change to this logic where
we will now do zero-copy when a full-sized (or larger) temporary
buffer is appended. We don't otherwise change the heuristic though one
could argue that a future improvement where we also zero-copy some
"large but not totally full" buffers would be beneficial.

I also remove a redundant check in this heuristic:

 b.size() < available_bytes()

as it is subsumed by the next check.

Partial fix for: #15769
Partial or complete fix for: redpanda-data/core-internal#946

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

Add details and clarifications around several member functions,
especially those that copy or append to the iobuf and which may or
may not perform zero-copy operations.

Add a general "mutation" caveat to the iobuf class doc and point there
from several member functons to avoid repeating myself.
When appending one large iobuf to another (or a large temporary_buffer
to an iobuf) one would expect that this is mostly a zero-copy operation,
 as the underlying temporary buffers can be shared and iobuf is designed
to accommodate this case.

However, a check in the existing logic which tries to avoid creating
many small iobufs also kicks in when we append a series of full-sized
fragments to the iobuf, resulting in a full copy of each of them.

This change is a relatively conservative change to this logic where
we will now do zero-copy when a full-sized (or larger) temporary
buffer is appended. We don't otherwise change the heuristic though one
could argue that a future improvement where we also zero-copy some
"large but not totally full" buffers would be beneficial.

I also remove a redundant check in this heuristic:

b.size() < available_bytes()

as it is subsumed by the next check.

Partial fix for issue redpanda-data#15769.
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masterful, thanks! I really appreciate the new comments on iobuf class/methods

// full-sized fragments is in practice a common operation when buffers
// grow beyond the maximum fragment size.
if (
b.size() <= last_asz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you kept b.size() <= last_asz instead of b.size() <= available_bytes() to still force linearization even if it does require an extra alloc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think b.size() <= available_bytes() is subsumed by b.size() <= last_asz, i.e., when former is true the latter is also true (last_asz is always >= available_bytes()), so it can simply be deleted without changing the behavior. I.e., this deletion was just a performance optimization.

Not sure if that addresses your question?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another answer is that using only b.size() < available_bytes() would be a very large change in behavior and break a key property: that when you append a series of buffers, the growth strategy of the io_buf is still doubling (until the 128K cap).

If the check was just b.size() < available_bytes() then appending N 1 byte buffer when avaialble_size() was initially zero would actually result in N 1-byte fragments. io_buf is trying to avoid that (and there are tests that this doesn't happen).

So we want to preserve the "doubling allocations" behavior sort of regardless of the sequence of appended buffers, which means more copying/linearization until (after this change) we hit the cap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited after reading the second reply.

Interesting I see, that makes sense then. I didn't know iobuf had that property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have further thoughts on it: I think a slightly more aggressive version of this change would use zero-copy in more cases, e.g., maybe if the buffer is more than half the size of the max alloc (128K) or some other threshold if it is somehow "similar" in size to the last allocation (e.g., maybe if it's more than half that size). With the latter type of change though you have to avoid it "going linear" which is what happens if you just implement that (or < last_asz instead of <= last_asz as I tried originally) because that results in no more doubling.

It seems like the two most important properties are:

  1. Keep the doubling growth
  2. Do zero copy "at the limit" of large iobufs

This fixes (2) without impacting (1) and beyond that it's just tuning (though I think we should do the tuning which is why I did not close #15769.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even this change is possibly worse on some dimension than the existing behaviors for some patterns, e.g., appending allocating 64K+1, alternating with 128K. The old behavior would linearize it to full 128K fragments, involving copying every byte but being fully efficient in memory under the "unique owner" assumption (i.e., the source buffers all go away).

The new behavior will result in (for each 2 appends) 128K capacity fragments, but with every other one being half-empty.

Similarly with 1-byte and 128K alternating appends, the new one will be mostly memory efficient but with 1 byte buffers alternating with 128K ones (and needing an extra free+alloc to trim down to 1 byte).

I don't think you can totally solve these problems without an oracle though: an oracle that knows both about the future append pattern and also the lifetime of the source buffers (i.e., zero-copy is relatively more favorable if the source buffer lifetime extends beyond the lifetime of this buffer, since then it really saves you memory footprint).

Even w/o that though some of the methods could be smarter: like this append(io_buf) method just has a loop where it appends the underlying fragments 1-by-1 and then within that append (which hits the code in this PR) we have to make a local decision about copying or not, and growth: however the outer append method already knows the full size of the incoming iobuf so it could just see that it's 400K or whatever and skip straight to 128K fragments (assuming linearizatoin was what you wanted), avoiding a bunch of pointless small allocations.

@piyushredpanda piyushredpanda added this to the v23.3.1-rc5 milestone Dec 19, 2023
@travisdowns travisdowns merged commit 50d0446 into redpanda-data:dev Dec 19, 2023
22 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@emaxerrno
Copy link
Contributor

@travisdowns awesome work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants