Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new ALM transport #89

Closed
wants to merge 106 commits into from
Closed

Conversation

Neverlord
Copy link
Member

@Neverlord Neverlord commented Feb 16, 2020

Currently, Broker requires its users to provide loop-free deployments. The new ALM transport makes it much easier to deploy Broker, since Broker no longer simply floods published data but instead discovers and manages available paths.

I've added a new doc/devs.rst file to the documentation to cover how Broker works internally as well as how the (new) code is structured. At this point, feedback to this would be most welcomed.

Remaining ToDos:

  • Extend the routing table to store paths rather than distances
  • Implement source routing
  • Add path revocation to propagate loss of connectivities
  • Implement ordering & loss detection (for data stores)
  • Implement re-attaching of clones on message loss (not essential, implement as followup)

@rsmmr
Copy link
Member

rsmmr commented Feb 24, 2020

I'm seeing compiler errors: ```
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:105:15: error: no member named 'remote_lookup_failed' in
'caf::sec'; did you mean 'remote_linking_failed'?
case sec::remote_lookup_failed:
~~~~~^~~~~~~~~~~~~~~~~~~~
remote_linking_failed
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/caf/sec.hpp:92:3: note: 'remote_linking_failed' declared here
remote_linking_failed,
^
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:105:10: error: duplicate case value 'remote_linking_failed'
case sec::remote_lookup_failed:
^
/Users/robin/bro/master/aux/broker/build/caf-build/libcaf_core/sec_strings.cpp:69:10: note: previous case defined here
case sec::remote_linking_failed:
^
2 errors generated.


Submodules are up to date, am I missing something else?

@rsmmr
Copy link
Member

rsmmr commented Feb 24, 2020

Focussed on the new dev guide for now. That's nice, certainly very helpful. Couple suggestions:

  • I'd turn the sections around and start with the architecture, then the ALM can refer back to that as helpful.

  • A couple diagrams would be helpful: one for the architecture showing the main components and data flows (core actor, mixins, etc.; also where it goes down into CAF); and one (or two) for the flow of messages, ideally with an example of what happens as they get forwarded.

I'll leave some more smaller comments/questions on the RST code.

doc/devs.rst Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
doc/devs.rst Outdated Show resolved Hide resolved
@Neverlord
Copy link
Member Author

Thanks for the feedback! Just a quick note (content updates in progress):

Submodules are up to date, am I missing something else?

Sorry, the 3rdparty submodule was lagging behind. Should work now.

@Neverlord
Copy link
Member Author

@rsmmr I've integrated your feedback on the devs section. Let me know what you think of the additions / changes and how I can improve the section further. 🙂

@rsmmr
Copy link
Member

rsmmr commented Mar 18, 2020

Nice, thanks for the updates to the devs section. The ALM description sounds all good, and the architecture section is quite helpful - I'm feeling like I'm starting to understand Broker finally ;-)

@rsmmr
Copy link
Member

rsmmr commented Mar 18, 2020

(... and no further particular comments)

@Neverlord Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 8bc3796 to d8d1db8 Compare March 27, 2020 14:03
@Neverlord
Copy link
Member Author

A quick update, I've wanted to share: while some unit tests still fail (working on it), the ALM branch runs stable enough for a first quick performance comparison with current master. I've used the broker-benchmark tool with Broker compiled as Release version:

broker-benchmark -r 100000000 -s 1 -t 2 localhost:9090

For the quick overview, I've let it run for some time, then took 30 values (output of the server for received messages per second) each. Here are the results:

Branch Median Average
master 31,588.5 30,992.6
ALM 32,471.0 32,368.5

I wouldn't read too much into this, since the values do fluctuate. However, the take-away is that the new source routing seems to have no negative effect on the performance.

Here is the raw data I've compiled this from: 2020-03-28 Broker ALM branch comparison.txt.

Of course, most insights will come when looking at a full cluster. Once I have ported the new cluster benchmark, I'll do a more thorough comparison.

@Neverlord
Copy link
Member Author

@rsmmr I was thinking about a path forward for this PR. I think as it stands, this is hard to review / integrate in its entirety. The branch contains several big changes to the entire code base plus a new communication backend. I think there are two options we could choose: 1) do some incremental reviews (you've already did one) and eventually merge everything at once or 2) separate refactoring from actual new features with individual PRs.

For option two, I would factor out at least these two steps as separate PR:

  • The switch to a mixin-based design. This encompasses the first chunk of commits minus the routing table.
  • Extending status with the new codes endpoint_discovered and endpoint_unreachable. This affected more code than I originally though (almost done with that refactoring). While they'd be unused in master for a while, I think we could still merge this change ahead of the actual ALM implementation.

Personally, I favor option 2. Aside from making reviews more manageable and focused, merging individual parts earlier also helps to avoid this PR running out of sync.

@rsmmr
Copy link
Member

rsmmr commented Apr 28, 2020

Yeah, let's go with Option 2, and merge the refactoring & static notifications first.

If you can split things out further into more, smaller chunks, that would be worth the effort; both for the refactoring and then for the new functionality. We'll review each to the degree we can, with a particular focus on not breaking existing cluster topologies.

Also, let's remain flexible on timeline: The closer we get to the release, the more risky a merge of complex changes will be. Depending on how things progress, a viable model could be getting the foundational commits in before 2.2, and then the new logic after the release early in the next cycle. Let's see.

@Neverlord Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 94c4c4b to 7417b33 Compare May 31, 2020 07:09
@Neverlord Neverlord added this to the Publish Broker events over ALM milestone Jun 12, 2020
@Neverlord Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from d227221 to bdc21b5 Compare June 14, 2020 08:54
@Neverlord
Copy link
Member Author

While porting the data store actors to the new channels, I've also streamlined the communication between master and clone. So this PR closes #99 and closes #125.

@Neverlord Neverlord force-pushed the topic/neverlord/multi-hop-routing branch from 7669595 to 059cc4f Compare February 23, 2021 14:40
By using a common prefix for all actors in Broker, we can more easily
enable metrics collection and also make it easier to find relevant
actors in log output.
@Neverlord
Copy link
Member Author

After re-integrating the upstream changes, I did some benchmarking with broker-benchmark today.

As a baseline, this is 5 Minutes for the current master version of Broker. It sits around ~150k events per second on my local machine:

Screen Shot 2021-03-14 at 13 43 21

With the ALM branch, I can see performance go down to ~80k events per second:

Screen Shot 2021-03-14 at 13 43 42

The "Message Processing Time" is a heat map. With metrics enables, CAF measures how long an actor takes to process its input messages and then puts each sample into its (configurable) bucket.

On the master branch, we can see that the core actor in Broker processes most messages (a batch is counted as a single message) in under 10µs. In the ALM branch, most messages now take between 10µs and 100µs instead.

Some computational overhead is unavoidable with an overlay, but I'll see where Broker spends most of the extra time now to see if we can reduce the cost by choosing different data structures or algorithms.

The measurements above were using the default message type in broker-benchmark, which is just a vector containing some small dummy data. As a comparison, here are two runs with message type 2, which tries to resemble some Zeek log lines.

Due to much larger data per element, master sits around 40k events per second:

Screen Shot 2021-03-14 at 14 21 56

The ALM branch is still slower, hovering above 30k events per second:

Screen Shot 2021-03-14 at 14 13 19

The gap isn't as big this time, since the per-element overhead is smaller when compared to the overall cost for sending larger elements. Even in this version, we can see that a larger percentage of elements is in the >10µs bucket.

Since a real Zeek cluster is probably closer to the second run, the ALM version of Broker wouldn't cut max performance in half. This setup only considers a one to one setup at the moment to look at peak performance. I think ideally we get the ALM performance a bit closer to the current Broker version or at least understand the tradeoffs we make better (i.e., where we are losing the time and whether it's worth it) before merging / integrating into Zeek.

@rsmmr rsmmr modified the milestones: Zeek 4.0, Zeek 4.1 Mar 30, 2021
@Neverlord
Copy link
Member Author

With the ALM branch, I can see performance go down to ~80k events per second:

Broker now contains a new micro-benchmark suite that measures individual building blocks of Broker such as the serialization overhead and "raw" streaming performance. With the new performance data, I could pinpoint where the drop in performance came from. Turned out to be the extra data we need to include to each Broker event for the source routing (i.e., a multipath).

Just optimizing memory allocation strategies and merging the receiver list into the multipath got me back to slightly above 100k events per second. @rsmmr suggested maybe Broker could do some clever batching. Broker already does batching implicitly via the CAF streams, so I'd be cautious to add another batching layer on top (especially since Zeek already does another level batching on top of Broker). However, the suggestion got me thinking where performance benefits of batching done by Broker might come from. Essentially, Broker could avoid redundant data on the wire by first sending the source routing information and then the data. If we look at a Broker batch, it looks somewhat like this (ignoring command messages for simplicity):

Topic Data Path
t1 v1 p1
t1 v2 p2
t2 v3 p1

To CAF, a batch is simply a std::vector<broker::node_message>. The default serialization results in a lot of redundant information on the wire if events share paths or topics. Not only is it redundant information on the wire, but we also waste time by serializing and deserializing identical objects multiple times. So I've added a custom type inspection specialization for std::vector<broker::node_message> that splits the data into three parts: topics, paths, and data. The last part then only references topic and path by "key" (really just the index in the topics/paths list). With this optimization, I see ~130k events per second in the Broker benchmark. Of course the benchmark is the absolute best case for Broker since all events have the same topic and the same path. In the worst case (no shared topic and no shared paths), this new encoding would actually add overhead. However, I think it's unlikely in practice that there is no sharing at all. Unless for very small batches with just a couple of events that Broker/CAF sends after the maximum batch delay. We don't need to optimize for this case, because this means the system is mostly idle.

@Neverlord
Copy link
Member Author

As discussed with @rsmmr today, we are going to make one more change before this is ready to merge: get rid of the remote actor indirections and instead have Broker exchange message directly via a custom protocol on the wire. Closing until then.

@Neverlord Neverlord closed this May 25, 2021
@Neverlord Neverlord deleted the topic/neverlord/multi-hop-routing branch November 30, 2022 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ordered, reliable channels for store-to-store communication Consider sending only put messages to clones
3 participants