-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit fragmentation in write_vectored
#1640
Conversation
write_vectored
b79a446
to
cc4d000
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1640 +/- ##
========================================
Coverage 95.94% 95.95%
========================================
Files 81 81
Lines 18590 18827 +237
========================================
+ Hits 17837 18065 +228
- Misses 753 762 +9 ☔ View full report in Codecov by Sentry. |
On the topic of allocation/copy, we put a lot of effort into Sozu to handle HTTP messages (parsing and proxying) in zero-copy in most cases. Very few dynamic allocations are also necessary, as Sozu preallocates and reuses a pool of buffers. Analyzing the Rustls codebase, we found that in order to encrypt a payload, the entire plaintext message is copied by As explained in this PR description, aggregating the vector chunks on our side would mean a third allocation and copy. Instead, we placed the responsibility of aggregation on the copy of On the other hand, My remarks are not directly related to this PR, but to the overall talk on memory allocation. As @Keksoj said we are eager to contribute and would happily make this change as well. |
Thanks for the investigation! We will review your changes shortly -- meanwhile, I just wanted to acknowledge that we'd be happy to review a PR for this, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm too interested in this change. I've left a couple of comments (mostly nitpicks)
cc4d000
to
175a26d
Compare
Thanks for the reviews! We appreciate your reception. We reproduced the icount test of the CI locally and found a lot of variance. Should we take them into account? |
I did a quick try of that, and it improves our send-direction transfer benchmarks by almost 20% in some cases:
So this is certainly worth doing! |
@rustls-benchmarking bench (experimenting with our new benchmarking integration) |
Benchmark resultsInstruction countsSignificant differencesThere are no significant instruction count differences Other differencesClick to expand
Wall-timeSignificant differencesClick to expand
Other differencesClick to expand
Additional informationCheckout details:
|
@Keksoj @Wonshtrum I'm still interested in your take on this. #1597 has landed in main since I left this comment. We would be interested in resurrecting this work but it looks like there are changes necessary (in particular in the deframer layer). I'm starting to look through what might need to be adjusted but if you had time to take a look as well your input would be appreciated. |
Here's a branch I started to try and rebase this on main: https://github.com/cpu/rustls/tree/cpu-borrowed-payload I made some minor changes along the way, none of which feel especially substantial:
Most importantly, as it stands now my branch doesn't build because of the I'm very open to the possibility I'm completely on the wrong track here but as it stands I'm stuck and could use guidance. |
175a26d
to
f31f35d
Compare
Just a shower thought: should we change At the current time, once we do #1723 we would have to implement edit: this is probably not a request for a change to this PR. Footnotes
|
Hello @cpu, we finished your attempt at rebasing on main, all the while encountering the same problem that you have. https://github.com/Wonshtrum/rustls/blob/f31f35d910a40e36c65654dbf97eafa1829b82bb/rustls/src/msgs/message.rs#L412-L427 In both cases, they should be truly unreachable. This stems from the fact that we designed Coming to this realization I think we have 4 different options:
In my opinion, options 2 and 3 are the most elegant ones, but it's a lot of work for a feature originally meant to be well delimited. |
Hi @Wonshtrum, thanks for taking a look at this.
💡 that does seem like the crux of the issues we were both running into. This is a good insight.
Of the four options presented (well, 3 since I don't think we should give up yet 😆) I think option 2 seems the most reasonable to me, but I'm curious what others think. I think option 1 should be a last resort, and option 3 seems like it might be more complexity than is merited.
Absolutely, you can find us in the "crates" discord Djc created, in the #rustls channel. |
I would also gravitate towards option 2 probably at least as an initial approach. If we find there's a lot of duplication, can always merge them again into something more like option 3? But I think it will be helpful in terms of understanding, too, to figure out what the needs on the receive side are vs the needs on the send side (not sure between "receive" vs "send" and "inbound" vs "outbound" which is the clearer framing? I have a slight preference to the former). |
I also think option 2 is the cleaner one. I quickly looked into the code, it seems there is only |
I agree with you all, and on top of that, option 2 seems relatively doable. As for the naming, I would go for But maybe there are other namings available. Maybe |
@rustls-benchmarking bench |
The last push looks good to me 👍 I think the branch needs a |
c1dd941
to
6c2bbed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for picking this branch up and iterating with me. I think we're just about ready to follow up with some other folks for review.
Beyond the comments I left in my review I think we should clean up some of the intermediate work in the commit history. I think the right end state would be two commits:
- The first, doing the
BorrowedPlainMessage
split intoBorrowedPlainMessageInbound
/BorrowedPlainMessageOutbound
leaving the existing code as-is as much as possible and just introducing the new trait/types. I think this commit message should touch on the rationale for the change, e.g. that we want to handle the payload differently for one of the two w.r.t chunks. - The second, implementing the borrowed payload/chunker. I think this commit message should keep the existing comment from the original first commit in this branch, but also be augmented with a bit more of the context from the PR description. It doesn't need to have all the
strace
output but I think it's important to mention the performance motivation.
Does that make sense to you?
We certainly intend to edit the commit history. We thought of having two commits:
But it is more elegant the other way around, as you suggest. Thanks for the review! We're pretty excited about this! |
I rebased on main. |
I think this diff link is showing the opposite change of what you actually meant. I like the actual change, though! |
I don't have a very strong opinion, but removing |
Unfortunately, this is the right diff, I changed the
I removed First I removed the "at least two slices in Multiple" invariant. This simplifies the pub fn copy_to_vec(&self, vec: &mut Vec<u8>) {
match self {
Self::Single(chunk) => vec.extend_from_slice(chunk),
Self::Multiple { chunks, start, end } => {
let mut size = 0;
let last = chunks.len() - 1;
for (i, chunk) in chunks.iter().enumerate() {
let start = if i == 0 { *start } else { 0 };
let end = if i == last { end - size } else { chunk.len() };
vec.extend_from_slice(&chunk[start..end]);
size += chunk.len();
}
}
}
}
pub fn split_at(&self, mid: usize) -> (Self, Self) {
match self {
Self::Single(chunk) => {
if chunk.len() <= mid { return (self.clone(), Self::Single(&[]); }
(Self::Single(&chunk[..mid]), Self::Single(&chunk[mid..]))
}
Self::Multiple { chunks, start, end } => {
for i in 0..chunks.len() {
if size + chunks[i].len() >= (mid + start) {
return (
Self::Multiple { chunks: chunks[..=i], start: *start, end: start + mid },
Self::Multiple { chunks: chunks[i..], start: start + mid, end: *end },
);
}
size += chunks[i].len();
}
(self.clone(), Self::new_empty())
}
}
} Then I removed the "start must be in the first chunk" and "end must be in the last chunk" invariants. The only invariant this version assumes is that "start is smaller than end". This simplifies again pub fn copy_to_vec(&self, vec: &mut Vec<u8>) {
match self {
Self::Single(chunk) => vec.extend_from_slice(chunk),
Self::Multiple { chunks, start, end } => {
let mut size = 0;
for chunk in *chunks {
let psize = size;
let len = chunk.len();
size += len;
if size <= *start || psize >= *end { continue; }
let start = if psize < *start { start - psize } else { 0 };
let end = if end - psize < len { end - psize } else { len };
vec.extend_from_slice(&chunk[start..end]);
}
}
}
}
pub fn split_at(&self, mid: usize) -> (Self, Self) {
match self {
Self::Single(chunk) => {
if chunk.len() <= mid { return (self.clone(), Self::Single(&[]); }
(Self::Single(&chunk[..mid]), Self::Single(&chunk[mid..]))
}
Self::Multiple { chunks, start, end } => (
Self::Multiple { chunks, start: *start, end: start + mid },
Self::Multiple { chunks, start: start + mid, end: *end },
)
}
} It's hard to create a representative benchmark for |
This sounds good to me! Thanks for diving in some more. Maybe |
9c8197e
to
de1fdf7
Compare
@Wonshtrum one more clippy issue to address! Also looks like our external types workflow check broke, so we'll need to look into that (but not need to block this PR I think). |
de1fdf7
to
e9bfe40
Compare
Sorry about that. |
This was fixed in main w/ #1791 I think we probably just need Ctz to do another pass on this branch and then it's ready for merge? I suspect based on OOB conversations this will be doable early next week 🤞 |
Signed-off-by: Eloi DEMOLIS <eloi.demolis@clever-cloud.com>
The ConnectionCommon<T>::write_vectored was implemented by processing each chunk, fragmenting them and wrapping each fragment in a OutboundMessage before encrypting and sending it as separate TLS frames. For very fragmented payloads this generates a lot of very small payloads with most of the data being TLS headers. OutboundChunks can contain an arbitrary amount of fragmented chunks. This allows write_vectored to process all its chunks at once, fragmenting it in place if needed and wrapping it in a OutboundMessage. All the chunks are merged in a contiguous vector (taking atvantage of an already existent copy) before being encrypted and sent as a single TLS frame. Signed-off-by: Eloi DEMOLIS <eloi.demolis@clever-cloud.com> Co-Authored-By: Emmanuel Bosquet <bjokac@gmail.com>
Signed-off-by: Eloi DEMOLIS <eloi.demolis@clever-cloud.com>
Head branch was pushed to by a user without write access
e9bfe40
to
71746d6
Compare
@Wonshtrum thanks for all your work on this! |
We, Wonshtrum and myself, work at Clever Cloud on a custom-made reverse proxy called Sōzu. We switched from using both OpenSSL and Rustls to being Rustls-only, one year ago, and we are very happy about it. We appreciate the security orientation of Rustls, as well as its native Rust nature.
Our issue
In Sōzu, our reverse proxy, we strive to optimize system calls and throughput. Here is the syscall we found with strace when transmitting a simple HTTPS response to a client:
What we see here is a number of small pieces of data, headers mostly, being encoded into TLS frames, one at a time. The
\27\3\3\0
pattern is the beginning of a TLS frame.This repetition is a waste of bandwitdth.
Our investigation
The system call above is due to how the Writer's
write_vectored
method encodes the chunks one by one.In our case, our reverse proxy creates a number of smaller chunks. Half of the throughput is monopolized by TLS headers.
Our proposal
We could have solved our issue on the proxy side, by accumulating chunks and calling
write()
instead of callingwrite_vectored
, but we seek to avoid dynamic memory allocation and copying.In this pull request on Rustls, instead of treating the chunks individually in
write_vectored
, we suggest to pass them down together in a BorrowedPayload form, until they are encrypted in one single TLS frame (or several if the payload exceedsMAX_FRAGMENT_SIZE
). This new typeBorrowedPayload
does not copy the data but manipulates slices of bytes.Syscall improvement
This is the same response as above: a simple "hi" and some headers.
We can see that there is only one
iov_base
(corresponding to 1 TLS frame), instead of many prior.The syscall has dimished in size, from 563 to 145.
Benchmarks
Our modification reduces bandwidth usage significantly when benchmarking our reverse proxy.
These benchmarks run on a simple desktop machine, with 4 backends to route HTTPS traffic to. The backends reply with a simple "hi" and some headers.
Sōzu + Rustls 0.21.9
Sōzu + Rustls (this branch)
Discussion
The strace and the benchmark suggest that our addition to the rustls codebase diminishes bandwidth use and improves overall performance in a meaningful way. Requests per second, and latency, are the key metrics for us.
Diving in Rustls, we found a number of memory allocation that could be avoided. If you would like, we would be more than happy to work with you in this direction.
(edited for clarity)