Improved performance and benchmarks #1

auterium · 2021-05-07T13:13:23Z

Current implementation has a a few downsides:

It uses an unsafe block to skip the utf-8 validation of bytes. Although a reasonable thing to do, it's avoidable
Uses full features of dependencies. This is a minor thing, but causes longer compile times and larger resulting binary
Filtering causes allocations. As the filter deals with data as strings and returns a String, this requires allocations that could be avoided.
Is limited to buffers of 8kb. This also causes a constraint on the max size of the datagrams
Inappropriate benchmark. It depends on a Python script that could produce data slower than what the Rust application can really handle, yielding inaccurate results.
Naive multi-threading that causes more overhead than potential benefit. Other forms of multi-threading would be better fitted than the proposed one, however the filtering algorithm can be considerably improved rendering the need for multi-threading unnecessary

Although not all of the downsides were addressed and further improvements are possible, this MR proposes the following improvements:

Use pure bytes comparison so no utf-8 validation/conversion is ever required and removes the unsafe block. This also removes the need for allocation when (unnecessarily) converting to String
Switch from full to limited features in the Tokio dependency to improve compile times and binary size
Use tokio_util to build a codec and use UdpFramed to make code more ergonomic
Build a codec that process the BytesMut buffer (provided by tokio_util) to drop the unwanted byte slices and concatenate the desired ones.
By using the codec approach from tokio_util, a BytesMut buffer that can automatically grow if needed is provided "for free" (as in: no need to manually manage buffers)
Include micro-benchmarking of the filtering implementations with Criterion to reliably compare their speeds. Spoiler: the new implementation is almost 3 times faster (~182 ns/iter vs ~465 ns/iter)
Use a BTreeSet<Bytes> instead of a Vec<String> for faster comparisons. This comes at the expense of only working for exact matches, but could be changed to use regex

askldjd · 2021-05-07T21:59:26Z

Thanks for the feedback! I love this iteration. Let me review this a bit deeper this weekend.

auterium · 2021-05-08T05:19:42Z

After a review of my code I must admit I made a mistake that mixed the results, causing in a claim of a ~3x improvement when in reality it was ~3x slower, I apologize for that. I spent a bit more time refining this and playing with different approaches and I was able to get it down to a ~5% difference, but at the same time I tried the same new logic in a separate function called filter_2 that could replace your current one almost "as is" and got a whooping ~25% improvement (YMMV) in speed there:

Of course, the comparison might not be entirely fair as the proposed changes are not only on the filtering function, but on other parts as well, so an interesting comparison would be to check the results with the full running server. I'm worried that the Python runner adds too much overhead that the comparison of both Rust solutions might not be realistic, so perhaps a fully Rust-based option would be a much fairer comparison as it would introduce much less overhead, don't you think?

auterium · 2021-05-08T08:35:50Z

I've pushed an integration test that spawns 4 threads that each will send 1000 messages and tried 3 options:

Original code without multi-threading
Original code with new filter function, without multi-threading
Codec-based new code (the one being proposed)

I'm not an expert in UDP, so I'm not sure why I couldn't send more than 1000 messages per thread. I tried 2k (although nothing in between 1 and 2k) and it hung up, so not sure if this is saturating the UDP port or something.

Here are the results on my machine (YMMV):

[Filter classic] Processed 4000 messages in 276.0672ms | 69.016µs/msg
[Filter 2] Processed 4000 messages in 237.6573ms | 59.414µs/msg
[Codec] Processed 4000 messages in 204.9502ms | 51.237µs/msg

I think that the proposed code has not yet reached it's full potential as it's still allocating memory for a secondary buffer

mkulak · 2021-05-08T13:33:48Z

src/filter.rs

+        for prefix in block_list {
+            if line.starts_with(prefix) {
+                continue 'outer;
+            }


Current impl (as well as the initial one) runs in linear time of block list size.
Since this is main functionality it makes sense to optimize it. It is possible to implement it in constant time (of block list size), for example by using Trie (prefix tree) data structure

On my first commit I was actually using BTreeSet to store the keys, but it required exact matches and was slower than the Vec + starts_with() approach, though I think it was due to other slow areas. I've read about Tries before, but I'm not experienced in them, so feel free to propose some samples if you have 😃

I'm currently focusing on a way to remove the extra BytesMut allocation, which I think it's entirely possible, leaving the prefix match to be the only remaining bottleneck

Improved performance and benchmarks

77c031a

auterium added 2 commits May 8, 2021 06:58

Fix and extend benchamrk. Improve code, filter_2

0ebc3e1

Bugfix and unit tests

5bfcdde

Integration benchmark

79c09f0

mkulak reviewed May 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved performance and benchmarks #1

Improved performance and benchmarks #1

auterium commented May 7, 2021

askldjd commented May 7, 2021

auterium commented May 8, 2021

auterium commented May 8, 2021

mkulak May 8, 2021

auterium May 8, 2021

Improved performance and benchmarks #1

Are you sure you want to change the base?

Improved performance and benchmarks #1

Conversation

auterium commented May 7, 2021

askldjd commented May 7, 2021

auterium commented May 8, 2021

auterium commented May 8, 2021

mkulak May 8, 2021

Choose a reason for hiding this comment

auterium May 8, 2021

Choose a reason for hiding this comment