Skip to content

Implement a generic aggregation transform step #2076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 15, 2022

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Feb 9, 2022

This implements a generic aggregation transform step plugin. See the attached README file for more information.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

File-by-file. Go over the algorithm in a call with me. Test locally alongside compaction.

@dominiklohmann dominiklohmann added the feature New functionality label Feb 9, 2022
@dominiklohmann dominiklohmann requested a review from lava February 9, 2022 20:45
@dominiklohmann dominiklohmann force-pushed the topic/generic-aggregation-plugin branch from b47fbc8 to 19f4bdc Compare February 9, 2022 20:50
@mavam
Copy link
Member

mavam commented Feb 10, 2022

Awesome to see the Arrow Compute API being leveraged!

@dominiklohmann dominiklohmann force-pushed the topic/generic-aggregation-plugin branch 2 times, most recently from 0dbf8a2 to 1ae64b0 Compare February 10, 2022 15:55
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          time-window: 10 seconds

Can we rename this to something less confusing? E.g., timestamp-granularity and duration-granularity? Or s/granularity/resolution/?

Copy link
Contributor

@dispanser dispanser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me! Made a minor comment, also assuming you're going to repair the build 🗡️

Digging through the arrow APIs, I found https://arrow.apache.org/docs/dev/cpp/compute.html#grouped-aggregations-group-by which may one day become an off-the-shelf way to do this without explicitly handling the grouping ourselves.

Great to see that generic and powerful aggregation plugin 👍

@dominiklohmann
Copy link
Member Author

Digging through the arrow APIs, I found https://arrow.apache.org/docs/dev/cpp/compute.html#grouped-aggregations-group-by which may one day become an off-the-shelf way to do this without explicitly handling the grouping ourselves.

@dispanser the main issue is that this only supports grouping on a single column, which is insufficient for our use cases.

This implements a generic aggregation transform step plugin. E.g., to configure
it for flow aggregation on export, the following may be used:

```yaml
vast:
  transforms:
    suricata-flow-aggregate:
      - aggregate:
          time-window: 10 seconds
          group-by:
            - timestamp
            - src_ip
            - dest_ip
            - dest_port
            - proto
            - event_type
          sum:
            - pcap_cnt
            - flow.pkts_toserver
            - flow.pkts_toclient
            - flow.bytes_toserver
            - flow.bytes_toclient
          min:
            - flow.start
          max:
            - flow.end
          any:
            - flow.alerted
          # all:
          layout-name: suricata.aggregated_flow
      - replace:
          field: event_type
          value: aggregated_flow
  transform-triggers:
    export:
      - transform: suricata-flow-aggregate
        location: client
        events:
          - suricata.flow
          - suricata.aggregated_flow
```
This is a rewrite of the aggregate transform step and its underlying algorithm
to have two passes, each of which can be disabled: An eager pass when adding
record batches, and a lazy pass when retrieving the results. Additionally, it
refactors the per-layout aggregation into a separate data structure that is
easier to maintain, and adds explanatory comments throughout the file.
This fixes some minor issues in the aggregate plugin noticed during a review
session.
@dominiklohmann dominiklohmann force-pushed the topic/generic-aggregation-plugin branch from 11893a9 to 1a8ab52 Compare February 14, 2022 15:16
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@dominiklohmann dominiklohmann force-pushed the topic/generic-aggregation-plugin branch from 0fb7540 to 0c98662 Compare February 15, 2022 11:06
@dominiklohmann dominiklohmann merged commit bcf21cd into master Feb 15, 2022
@dominiklohmann dominiklohmann deleted the topic/generic-aggregation-plugin branch February 15, 2022 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants