Add new ALM transport #89

Neverlord · 2020-02-16T15:58:09Z

Currently, Broker requires its users to provide loop-free deployments. The new ALM transport makes it much easier to deploy Broker, since Broker no longer simply floods published data but instead discovers and manages available paths.

I've added a new doc/devs.rst file to the documentation to cover how Broker works internally as well as how the (new) code is structured. At this point, feedback to this would be most welcomed.

Remaining ToDos:

Extend the routing table to store paths rather than distances
Implement source routing
Add path revocation to propagate loss of connectivities
Implement ordering & loss detection (for data stores)
~~Implement re-attaching of clones on message loss~~ (not essential, implement as followup)

rsmmr · 2020-02-24T10:54:24Z

I'm seeing compiler errors: ```
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:105:15: error: no member named 'remote_lookup_failed' in
'caf::sec'; did you mean 'remote_linking_failed'?
case sec::remote_lookup_failed:
~~~~~^~~~~~~~~~~~~~~~~~~~
remote_linking_failed
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/caf/sec.hpp:92:3: note: 'remote_linking_failed' declared here
remote_linking_failed,
^
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:105:10: error: duplicate case value 'remote_linking_failed'
case sec::remote_lookup_failed:
^
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:69:10: note: previous case defined here
case sec::remote_linking_failed:
^
2 errors generated.


Submodules are up to date, am I missing something else?

rsmmr · 2020-02-24T11:51:57Z

Focussed on the new dev guide for now. That's nice, certainly very helpful. Couple suggestions:

I'd turn the sections around and start with the architecture, then the ALM can refer back to that as helpful.
A couple diagrams would be helpful: one for the architecture showing the main components and data flows (core actor, mixins, etc.; also where it goes down into CAF); and one (or two) for the flow of messages, ideally with an example of what happens as they get forwarded.

I'll leave some more smaller comments/questions on the RST code.

doc/devs.rst

Neverlord · 2020-02-25T19:39:15Z

Thanks for the feedback! Just a quick note (content updates in progress):

Submodules are up to date, am I missing something else?

Sorry, the 3rdparty submodule was lagging behind. Should work now.

Neverlord · 2020-03-15T16:29:53Z

@rsmmr I've integrated your feedback on the devs section. Let me know what you think of the additions / changes and how I can improve the section further. 🙂

rsmmr · 2020-03-18T11:01:23Z

Nice, thanks for the updates to the devs section. The ALM description sounds all good, and the architecture section is quite helpful - I'm feeling like I'm starting to understand Broker finally ;-)

rsmmr · 2020-03-18T11:02:12Z

(... and no further particular comments)

Neverlord · 2020-03-28T08:48:54Z

A quick update, I've wanted to share: while some unit tests still fail (working on it), the ALM branch runs stable enough for a first quick performance comparison with current master. I've used the broker-benchmark tool with Broker compiled as Release version:

broker-benchmark -r 100000000 -s 1 -t 2 localhost:9090

For the quick overview, I've let it run for some time, then took 30 values (output of the server for received messages per second) each. Here are the results:

Branch	Median	Average
master	31,588.5	30,992.6
ALM	32,471.0	32,368.5

I wouldn't read too much into this, since the values do fluctuate. However, the take-away is that the new source routing seems to have no negative effect on the performance.

Here is the raw data I've compiled this from: 2020-03-28 Broker ALM branch comparison.txt.

Of course, most insights will come when looking at a full cluster. Once I have ported the new cluster benchmark, I'll do a more thorough comparison.

Neverlord · 2020-04-24T11:49:11Z

@rsmmr I was thinking about a path forward for this PR. I think as it stands, this is hard to review / integrate in its entirety. The branch contains several big changes to the entire code base plus a new communication backend. I think there are two options we could choose: 1) do some incremental reviews (you've already did one) and eventually merge everything at once or 2) separate refactoring from actual new features with individual PRs.

For option two, I would factor out at least these two steps as separate PR:

The switch to a mixin-based design. This encompasses the first chunk of commits minus the routing table.
Extending status with the new codes endpoint_discovered and endpoint_unreachable. This affected more code than I originally though (almost done with that refactoring). While they'd be unused in master for a while, I think we could still merge this change ahead of the actual ALM implementation.

Personally, I favor option 2. Aside from making reviews more manageable and focused, merging individual parts earlier also helps to avoid this PR running out of sync.

rsmmr · 2020-04-28T10:36:45Z

Yeah, let's go with Option 2, and merge the refactoring & static notifications first.

If you can split things out further into more, smaller chunks, that would be worth the effort; both for the refactoring and then for the new functionality. We'll review each to the degree we can, with a particular focus on not breaking existing cluster topologies.

Also, let's remain flexible on timeline: The closer we get to the release, the more risky a merge of complex changes will be. Depending on how things progress, a viable model could be getting the foundational commits in before 2.2, and then the new logic after the release early in the next cycle. Let's see.

Neverlord · 2020-07-10T07:27:44Z

While porting the data store actors to the new channels, I've also streamlined the communication between master and clone. So this PR closes #99 and closes #125.

Squashed commit, the original sketch for the new routing table design started Dec 9, 2019.

Squashed commit, the first steps toward the new ALM layer started Dec 9, 2019.

By using a common prefix for all actors in Broker, we can more easily enable metrics collection and also make it easier to find relevant actors in log output.

Neverlord · 2021-03-14T13:34:04Z

After re-integrating the upstream changes, I did some benchmarking with broker-benchmark today.

As a baseline, this is 5 Minutes for the current master version of Broker. It sits around ~150k events per second on my local machine:

With the ALM branch, I can see performance go down to ~80k events per second:

The "Message Processing Time" is a heat map. With metrics enables, CAF measures how long an actor takes to process its input messages and then puts each sample into its (configurable) bucket.

On the master branch, we can see that the core actor in Broker processes most messages (a batch is counted as a single message) in under 10µs. In the ALM branch, most messages now take between 10µs and 100µs instead.

Some computational overhead is unavoidable with an overlay, but I'll see where Broker spends most of the extra time now to see if we can reduce the cost by choosing different data structures or algorithms.

The measurements above were using the default message type in broker-benchmark, which is just a vector containing some small dummy data. As a comparison, here are two runs with message type 2, which tries to resemble some Zeek log lines.

Due to much larger data per element, master sits around 40k events per second:

The ALM branch is still slower, hovering above 30k events per second:

The gap isn't as big this time, since the per-element overhead is smaller when compared to the overall cost for sending larger elements. Even in this version, we can see that a larger percentage of elements is in the >10µs bucket.

Since a real Zeek cluster is probably closer to the second run, the ALM version of Broker wouldn't cut max performance in half. This setup only considers a one to one setup at the moment to look at peak performance. I think ideally we get the ALM performance a bit closer to the current Broker version or at least understand the tradeoffs we make better (i.e., where we are losing the time and whether it's worth it) before merging / integrating into Zeek.

Neverlord · 2021-04-22T07:47:31Z

With the ALM branch, I can see performance go down to ~80k events per second:

Broker now contains a new micro-benchmark suite that measures individual building blocks of Broker such as the serialization overhead and "raw" streaming performance. With the new performance data, I could pinpoint where the drop in performance came from. Turned out to be the extra data we need to include to each Broker event for the source routing (i.e., a multipath).

Just optimizing memory allocation strategies and merging the receiver list into the multipath got me back to slightly above 100k events per second. @rsmmr suggested maybe Broker could do some clever batching. Broker already does batching implicitly via the CAF streams, so I'd be cautious to add another batching layer on top (especially since Zeek already does another level batching on top of Broker). However, the suggestion got me thinking where performance benefits of batching done by Broker might come from. Essentially, Broker could avoid redundant data on the wire by first sending the source routing information and then the data. If we look at a Broker batch, it looks somewhat like this (ignoring command messages for simplicity):

Topic	Data	Path
t1	v1	p1
t1	v2	p2
t2	v3	p1

To CAF, a batch is simply a std::vector<broker::node_message>. The default serialization results in a lot of redundant information on the wire if events share paths or topics. Not only is it redundant information on the wire, but we also waste time by serializing and deserializing identical objects multiple times. So I've added a custom type inspection specialization for std::vector<broker::node_message> that splits the data into three parts: topics, paths, and data. The last part then only references topic and path by "key" (really just the index in the topics/paths list). With this optimization, I see ~130k events per second in the Broker benchmark. Of course the benchmark is the absolute best case for Broker since all events have the same topic and the same path. In the worst case (no shared topic and no shared paths), this new encoding would actually add overhead. However, I think it's unlikely in practice that there is no sharing at all. Unless for very small batches with just a couple of events that Broker/CAF sends after the maximum batch delay. We don't need to optimize for this case, because this means the system is mostly idle.

Neverlord · 2021-05-25T15:30:35Z

As discussed with @rsmmr today, we are going to make one more change before this is ready to merge: get rid of the remote actor indirections and instead have Broker exchange message directly via a custom protocol on the wire. Closing until then.

Neverlord added the Type: Enhancement label Feb 16, 2020

rsmmr reviewed Feb 24, 2020

View reviewed changes

doc/devs.rst Outdated Show resolved Hide resolved

Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 8bc3796 to d8d1db8 Compare March 27, 2020 14:03

Neverlord mentioned this pull request Mar 28, 2020

Shutdown issue with downstream managers actor-framework/actor-framework#1076

Closed

Neverlord mentioned this pull request Apr 30, 2020

Redesign the core actor #112

Merged

Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 94c4c4b to 7417b33 Compare May 31, 2020 07:09

Neverlord added this to the Publish Broker events over ALM milestone Jun 12, 2020

Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from d227221 to bdc21b5 Compare June 14, 2020 08:54

This was linked to issues Jul 10, 2020

Ordered, reliable channels for store-to-store communication #125

Closed

Consider sending only put messages to clones #99

Closed

Neverlord added 9 commits July 12, 2020 10:42

Add new type for peer-local timestamps

2aad496

Implement new routing table type

52bbdb4

Squashed commit, the original sketch for the new routing table design started Dec 9, 2019.

Add new type for source routing paths

487e711

Implement new ALM layer for peer-to-peer messages

073a33f

Squashed commit, the first steps toward the new ALM layer started Dec 9, 2019.

Re-integrate mixin unit tests

4f2bf07

Fix path-blacklist flags, add disable-forwarding

7d33e65

Re-integrate peer unit test

474ea65

Implement disable-forwarding flag

f0c92d6

Fix handling of configuration parameters

4762d22

Merge branch 'master'

059cc4f

Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 7669595 to 059cc4f Compare February 23, 2021 14:40

Prefix actor names with 'broker.'

6cbf4cf

By using a common prefix for all actors in Broker, we can more easily enable metrics collection and also make it easier to find relevant actors in log output.

Neverlord added 2 commits March 15, 2021 13:56

Add micro benchmarks for routing table algorithms

a4cb170

Use more efficient data structures for ALM layer

6623073

rsmmr modified the milestones: Zeek 4.0, Zeek 4.1 Mar 30, 2021

Neverlord added 16 commits April 17, 2021 12:39

Remove unnecessary include

6d71951

Add new micro benchmark for serialization

94de3cd

Add new micro benchmark for streaming

5c9b026

Remove dead code / comments

6eff279

Fix integration test suite with latest CAF version

84b7361

Encode receivers directly in the multipath object

092186e

Extend micro benchmark suites

6dd538d

Use the term "revocation" consistently

5d8039c

Reduce redundancy in type ID setup

db3f9cb

Add benchmarks for alternative routing tables

0919dd6

Fix dispatching of command messages

51ed23d

Fix equality operator for alm::multipath

67ee323

Pack node messages more efficiently on the wire

32f9ab2

Remove extra semicolon

f486e31

Fix deprecation warnings

c2e1ac8

Fix build on GCC

fb945ce

Update CAF submodule

9fcb3e4

Neverlord closed this May 25, 2021

Neverlord mentioned this pull request Oct 2, 2021

Implement new ALM transport #201

Merged

Neverlord deleted the topic/neverlord/multi-hop-routing branch November 30, 2022 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new ALM transport #89

Add new ALM transport #89

Neverlord commented Feb 16, 2020 •

edited

Loading

rsmmr commented Feb 24, 2020

rsmmr commented Feb 24, 2020

Neverlord commented Feb 25, 2020

Neverlord commented Mar 15, 2020

rsmmr commented Mar 18, 2020

rsmmr commented Mar 18, 2020 •

edited

Loading

Neverlord commented Mar 28, 2020

Neverlord commented Apr 24, 2020

rsmmr commented Apr 28, 2020

Neverlord commented Jul 10, 2020

Neverlord commented Mar 14, 2021

Neverlord commented Apr 22, 2021

Neverlord commented May 25, 2021

Add new ALM transport #89

Add new ALM transport #89

Conversation

Neverlord commented Feb 16, 2020 • edited Loading

rsmmr commented Feb 24, 2020

rsmmr commented Feb 24, 2020

Neverlord commented Feb 25, 2020

Neverlord commented Mar 15, 2020

rsmmr commented Mar 18, 2020

rsmmr commented Mar 18, 2020 • edited Loading

Neverlord commented Mar 28, 2020

Neverlord commented Apr 24, 2020

rsmmr commented Apr 28, 2020

Neverlord commented Jul 10, 2020

Neverlord commented Mar 14, 2021

Neverlord commented Apr 22, 2021

Neverlord commented May 25, 2021

Neverlord commented Feb 16, 2020 •

edited

Loading

rsmmr commented Mar 18, 2020 •

edited

Loading