New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a generic aggregation transform step #2076
Conversation
b47fbc8
to
19f4bdc
Compare
Awesome to see the Arrow Compute API being leveraged! |
0dbf8a2
to
1ae64b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-window: 10 seconds
Can we rename this to something less confusing? E.g., timestamp-granularity
and duration-granularity
? Or s/granularity/resolution/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me! Made a minor comment, also assuming you're going to repair the build 🗡️
Digging through the arrow APIs, I found https://arrow.apache.org/docs/dev/cpp/compute.html#grouped-aggregations-group-by which may one day become an off-the-shelf way to do this without explicitly handling the grouping ourselves.
Great to see that generic and powerful aggregation plugin 👍
@dispanser the main issue is that this only supports grouping on a single column, which is insufficient for our use cases. |
This implements a generic aggregation transform step plugin. E.g., to configure it for flow aggregation on export, the following may be used: ```yaml vast: transforms: suricata-flow-aggregate: - aggregate: time-window: 10 seconds group-by: - timestamp - src_ip - dest_ip - dest_port - proto - event_type sum: - pcap_cnt - flow.pkts_toserver - flow.pkts_toclient - flow.bytes_toserver - flow.bytes_toclient min: - flow.start max: - flow.end any: - flow.alerted # all: layout-name: suricata.aggregated_flow - replace: field: event_type value: aggregated_flow transform-triggers: export: - transform: suricata-flow-aggregate location: client events: - suricata.flow - suricata.aggregated_flow ```
This is a rewrite of the aggregate transform step and its underlying algorithm to have two passes, each of which can be disabled: An eager pass when adding record batches, and a lazy pass when retrieving the results. Additionally, it refactors the per-layout aggregation into a separate data structure that is easier to maintain, and adds explanatory comments throughout the file.
This fixes some minor issues in the aggregate plugin noticed during a review session.
11893a9
to
1a8ab52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
0fb7540
to
0c98662
Compare
This implements a generic aggregation transform step plugin. See the attached README file for more information.
📝 Checklist
🎯 Review Instructions
File-by-file. Go over the algorithm in a call with me. Test locally alongside compaction.