Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance and benchmarks #1

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

auterium
Copy link

@auterium auterium commented May 7, 2021

Current implementation has a a few downsides:

  • It uses an unsafe block to skip the utf-8 validation of bytes. Although a reasonable thing to do, it's avoidable
  • Uses full features of dependencies. This is a minor thing, but causes longer compile times and larger resulting binary
  • Filtering causes allocations. As the filter deals with data as strings and returns a String, this requires allocations that could be avoided.
  • Is limited to buffers of 8kb. This also causes a constraint on the max size of the datagrams
  • Inappropriate benchmark. It depends on a Python script that could produce data slower than what the Rust application can really handle, yielding inaccurate results.
  • Naive multi-threading that causes more overhead than potential benefit. Other forms of multi-threading would be better fitted than the proposed one, however the filtering algorithm can be considerably improved rendering the need for multi-threading unnecessary

Although not all of the downsides were addressed and further improvements are possible, this MR proposes the following improvements:

  • Use pure bytes comparison so no utf-8 validation/conversion is ever required and removes the unsafe block. This also removes the need for allocation when (unnecessarily) converting to String
  • Switch from full to limited features in the Tokio dependency to improve compile times and binary size
  • Use tokio_util to build a codec and use UdpFramed to make code more ergonomic
  • Build a codec that process the BytesMut buffer (provided by tokio_util) to drop the unwanted byte slices and concatenate the desired ones.
  • By using the codec approach from tokio_util, a BytesMut buffer that can automatically grow if needed is provided "for free" (as in: no need to manually manage buffers)
  • Include micro-benchmarking of the filtering implementations with Criterion to reliably compare their speeds. Spoiler: the new implementation is almost 3 times faster (~182 ns/iter vs ~465 ns/iter)
  • Use a BTreeSet<Bytes> instead of a Vec<String> for faster comparisons. This comes at the expense of only working for exact matches, but could be changed to use regex

@askldjd
Copy link

askldjd commented May 7, 2021

Thanks for the feedback! I love this iteration. Let me review this a bit deeper this weekend.

@auterium
Copy link
Author

auterium commented May 8, 2021

After a review of my code I must admit I made a mistake that mixed the results, causing in a claim of a ~3x improvement when in reality it was ~3x slower, I apologize for that. I spent a bit more time refining this and playing with different approaches and I was able to get it down to a ~5% difference, but at the same time I tried the same new logic in a separate function called filter_2 that could replace your current one almost "as is" and got a whooping ~25% improvement (YMMV) in speed there:
image

Of course, the comparison might not be entirely fair as the proposed changes are not only on the filtering function, but on other parts as well, so an interesting comparison would be to check the results with the full running server. I'm worried that the Python runner adds too much overhead that the comparison of both Rust solutions might not be realistic, so perhaps a fully Rust-based option would be a much fairer comparison as it would introduce much less overhead, don't you think?

@auterium
Copy link
Author

auterium commented May 8, 2021

I've pushed an integration test that spawns 4 threads that each will send 1000 messages and tried 3 options:

  1. Original code without multi-threading
  2. Original code with new filter function, without multi-threading
  3. Codec-based new code (the one being proposed)

I'm not an expert in UDP, so I'm not sure why I couldn't send more than 1000 messages per thread. I tried 2k (although nothing in between 1 and 2k) and it hung up, so not sure if this is saturating the UDP port or something.

Here are the results on my machine (YMMV):

[Filter classic] Processed 4000 messages in 276.0672ms | 69.016µs/msg
[Filter 2] Processed 4000 messages in 237.6573ms | 59.414µs/msg
[Codec] Processed 4000 messages in 204.9502ms | 51.237µs/msg

I think that the proposed code has not yet reached it's full potential as it's still allocating memory for a secondary buffer

for prefix in block_list {
if line.starts_with(prefix) {
continue 'outer;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current impl (as well as the initial one) runs in linear time of block list size.
Since this is main functionality it makes sense to optimize it. It is possible to implement it in constant time (of block list size), for example by using Trie (prefix tree) data structure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my first commit I was actually using BTreeSet to store the keys, but it required exact matches and was slower than the Vec + starts_with() approach, though I think it was due to other slow areas. I've read about Tries before, but I'm not experienced in them, so feel free to propose some samples if you have 😃

I'm currently focusing on a way to remove the extra BytesMut allocation, which I think it's entirely possible, leaving the prefix match to be the only remaining bottleneck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants