Deploy copy-on-write for more efficient messaging #38

Neverlord · 2019-03-02T15:54:11Z

This is a breaking API change.

Sorry for the long delay. I had this in the pipeline for quite a while now, but I hadn't the cycles since we were working non-stop on releasing our first version for Tenzir.

Summary

One of the outcomes of the performance evaluation we did for BroCon 2018 was that Broker wastes a lot of cycles by making unnecessary copies. This is because CAF assumes that individual elements in a stream are cheap to copy but broker::data is not.

To fix this, we wrap the topic/data pairs that travel through Broker's pub/sub channels into copy-on-write optimized tuples. This turns the previous deep copies for each individual pair into a simple reference increase.

Is it worth it? I think it's a crucial step into the right direction and drastically improves performance for multiple local subscribers, but Broker's performance in a distributed setup isn't affected as much as I'd hoped.

API Changes

From a user-perspective, Broker now returns a data_message (please see the new message.hh header) wherever it used to return a std::pair<topic, data>. I've added convenience functions to interact with the new COW-tuple in a more descriptive manner than using get<0>(...) and get<1>(...):

  auto sub = ep.make_subscriber({"/topic/test"});
  auto msg = sub.get();
- auto topic = msg.first;
- auto data_ = msg.second;
+ auto topic = get_topic(msg);
+ auto data_ = get_data(msg);

Please note that I did not look into the Python bindings. Maybe the C++ changes can even be hidden in the Python API without loss of efficiency.

Evaluation

I've used my personal MacMini for the evaluation (Late 2012, 4 cores), so by all means please feel free to re-run this on more sophisticated hardware.

`broker-stream-benchmark`

This benchmark uses endpoint::subscribe and endpoint::publish_all. It's as close to the underlying stream processing when using the both mode to run everything in one process. Hence, this benchmark should show the biggest improvement.

First up is master, hitting CTRL+C after the first 10 messages.

[master] $ broker-stream-benchmark -m both
132055 msgs/s
181842 msgs/s
198092 msgs/s
190648 msgs/s
185640 msgs/s
189253 msgs/s
173780 msgs/s
150490 msgs/s
133849 msgs/s
106028 msgs/s

106k to 198k messages per second.

[PR] $ broker-stream-benchmark -m both
3495604 msgs/s
2719193 msgs/s
1065131 msgs/s
1466757 msgs/s
2632291 msgs/s
2253084 msgs/s
1112171 msgs/s
2491704 msgs/s
1492858 msgs/s
791025 msgs/s

We can see that the values fluctuate more heavily this time around. However, we get 790k to 3.4M messages per second out of Broker this time. The worst case of the new implementation is a 4x speedup over the previous best case.

BroCon18 Throughput Setup

I've also revisited the BroCon18 setup for the throughput measurement. The scripts can be found in our events repository. I tweaked it slightly to stop at 1k payload and only used 1-2 nodes.

This time around, we are measuring Broker in a distributed setup: Publisher -> Node_1 -> ... -> Node_n -> Subscriber with varying payloads. Because we are dealing with serialization and deserialization this time around, we won't see a similar jump in performance.

At BroCon, we've seen that serialization overhead starts dominating at 1k payloads. So I've only looked at payloads of 100.

First up is master again. At the beginning of each run there's a couple of 0 entries in the CSV entries, so we take the last 10 measurements.

$ grep "2,100," MacMiniMaster.csv | tail -n 10 | awk -F , '{ print $3 }'
55574
42143
43507
45365
45128
49288
47180
37242
47334
48977

42k to 55k messages per second with two nodes between Ping and Pong. How is the topic branch performing?

$ grep "2,100," MacMiniBranch.csv | tail -n 10 | awk -F , '{ print $3 }'
76578
74185
69911
61505
64530
68059
64589
63636
68188
67829

No 4x speedup this time, but 61k to 76k is still an increase of more than 10% over the previous best case. Keep in mind that this is purely based on pruning unnecessary copies.

Benchmarking Broker revealed high copying overhead, because Broker internally copies topics and data frequently the streams between actors. Using a copy-on-write tuple over topic and data instead of a pair avoids copies as long as messages are accessed as read-only.

Neverlord · 2019-03-02T16:10:09Z

A quick follow-up on the serialization/deserialization topic. We're already working on improving the performance on CAF applications by pushing deserialization out of the critical path in the I/O loop: actor-framework/actor-framework#821.

Simply switching to the CAF topic branch is not enough, though, since Broker gives the scheduler too few threads to play with. However, I ran a quick experiment by recompiling against the topic/basp-worker branch of CAF and using this config:

[scheduler]
max-threads=4
[middleman]
num-workers=2

Here's what I get:

$ grep "2,100," throughput.csv | tail -n 10 | awk -F , '{ print $3 }'
96803
98710
97163
100201
102045
102476
103044
103067
102602
98722

Throughput around 100k messages per second with two relays. Almost double what I get with current CAF+Broker master.

rsmmr · 2019-03-18T00:32:56Z

Looks promising. The API breakage is unfortunate, but hard to avoid; and overall it makes the differentiation between the different message types more clear.

Can we move the accessors for topic and data into methods? Would need a new class deriving from cow_tuple, but that should work, no? Free functions for such specific functionality feel unintuitive to me (same then for the other new free functions, is_data_message, etc.)

jsiwek · 2019-03-19T18:11:16Z

I don't like the API breakage in principle and seems technically possible to avoid, but it's also the main entry points of the subscriber and endpoint class methods make_subscriber(), subscriber() and publish() that we want people using the fast path for right away and not convoluted by alternate legacy classes/methods. So I'm inclined to go with it since it's early on enough, don't plan to make it a habit, and won't change the Zeek interface itself, just a smaller subset of people using Broker directly.

I can work on finalizing a merge this week or next.

jsiwek · 2019-03-27T17:51:05Z

Thanks for the PR, I updated the Python bindings and merged to master. Decided to keep the use of free-functions in API, which is maybe still consistent/expected what with other pre-existing API used for data access doing similarly (don't see that changing), and also found that since cow_tuple already has some use of the name data for other purposes, that might just get confusing/troublesome if things accidentally start hiding each other.

Related to the changes from zeek/broker#38

Neverlord · 2019-03-27T17:59:15Z

Sorry for the silence on my end. I was ill last week and I'm still catching up.

I'm glad that you share the view that a clean break now is better than accumulating inefficient APIs.

Can we move the accessors for topic and data into methods?

It's already merged, but I still would like to share some thoughts. The C++ community as a whole is moving towards more free functions (e.g., see std::begin etc) and a more functional style in general. I lean towards free functions as well, but here it's a bit more than "just" preference: classes such as cow_tuple, std::shared_ptr, or std::vector etc. are not meant as base types and they don't have a virtual destructor. These types offer rudimentary building blocks that are better extended with free functions. Technically, you can "get away" with using them as base types as long as you don't add any member variables but adding free functions to extend the interface really is the way to go.

Plus consistency, as pointed out by @jsiwek. 🙂

I updated the Python bindings

Thanks a lot!

jsiwek · 2019-03-27T18:24:43Z

Thanks for the shared thoughts, I had some additional ones just for completeness:

The one benefit of member methods I like relates to discoverability: it's easier to see/find the collection of functions that are meant to be associated with a given class/struct
I didn't even think about the lack of virtual dtor here, good point, but I do have the general notion that C++ APIs meant for public consumption don't always intend or anticipate very well for people subclassing in all sorts of crazy ways.
Instead of subclassing, we could have wrapped in our own class/struct and added methods to that, but IMO more worth it to just keep things simple and direct.

Neverlord and others added 4 commits December 13, 2018 14:49

Port examples to new API

b4234aa

Merge into topic/neverlord/cow-tuple

a89bc9f

Improve documentation and simplify getters

a7e61a7

jsiwek merged commit a7e61a7 into master Mar 27, 2019

jsiwek added a commit to zeek/zeek that referenced this pull request Mar 27, 2019

Update now-broken Broker API usages

ee00483

Related to the changes from zeek/broker#38

Neverlord deleted the topic/neverlord/cow-tuple branch August 14, 2019 08:29

Neverlord mentioned this pull request Oct 19, 2019

Remove zeek::Batch and simplify API #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy copy-on-write for more efficient messaging #38

Deploy copy-on-write for more efficient messaging #38

Neverlord commented Mar 2, 2019

Neverlord commented Mar 2, 2019

rsmmr commented Mar 18, 2019

jsiwek commented Mar 19, 2019

jsiwek commented Mar 27, 2019

Neverlord commented Mar 27, 2019

jsiwek commented Mar 27, 2019

Deploy copy-on-write for more efficient messaging #38

Deploy copy-on-write for more efficient messaging #38

Conversation

Neverlord commented Mar 2, 2019

Summary

API Changes

Evaluation

broker-stream-benchmark

BroCon18 Throughput Setup

Neverlord commented Mar 2, 2019

rsmmr commented Mar 18, 2019

jsiwek commented Mar 19, 2019

jsiwek commented Mar 27, 2019

Neverlord commented Mar 27, 2019

jsiwek commented Mar 27, 2019

`broker-stream-benchmark`