Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a generic aggregation transform step #2076

Merged
merged 16 commits into from Feb 15, 2022

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Feb 9, 2022

This implements a generic aggregation transform step plugin. See the attached README file for more information.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

File-by-file. Go over the algorithm in a call with me. Test locally alongside compaction.

@dominiklohmann dominiklohmann added the feature New functionality label Feb 9, 2022
@mavam
Copy link
Member

mavam commented Feb 10, 2022

Awesome to see the Arrow Compute API being leveraged!

@dominiklohmann dominiklohmann force-pushed the topic/generic-aggregation-plugin branch 2 times, most recently from 0dbf8a2 to 1ae64b0 Compare February 10, 2022 15:55
plugins/aggregate/aggregate.cpp Outdated Show resolved Hide resolved
plugins/aggregate/aggregate.cpp Outdated Show resolved Hide resolved
plugins/aggregate/aggregate.cpp Outdated Show resolved Hide resolved
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          time-window: 10 seconds

Can we rename this to something less confusing? E.g., timestamp-granularity and duration-granularity? Or s/granularity/resolution/?

Copy link
Contributor

@dispanser dispanser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me! Made a minor comment, also assuming you're going to repair the build 🗡️

Digging through the arrow APIs, I found https://arrow.apache.org/docs/dev/cpp/compute.html#grouped-aggregations-group-by which may one day become an off-the-shelf way to do this without explicitly handling the grouping ourselves.

Great to see that generic and powerful aggregation plugin 👍

plugins/aggregate/aggregate.cpp Outdated Show resolved Hide resolved
@dominiklohmann
Copy link
Member Author

Digging through the arrow APIs, I found https://arrow.apache.org/docs/dev/cpp/compute.html#grouped-aggregations-group-by which may one day become an off-the-shelf way to do this without explicitly handling the grouping ourselves.

@dispanser the main issue is that this only supports grouping on a single column, which is insufficient for our use cases.

This implements a generic aggregation transform step plugin. E.g., to configure
it for flow aggregation on export, the following may be used:

```yaml
vast:
  transforms:
    suricata-flow-aggregate:
      - aggregate:
          time-window: 10 seconds
          group-by:
            - timestamp
            - src_ip
            - dest_ip
            - dest_port
            - proto
            - event_type
          sum:
            - pcap_cnt
            - flow.pkts_toserver
            - flow.pkts_toclient
            - flow.bytes_toserver
            - flow.bytes_toclient
          min:
            - flow.start
          max:
            - flow.end
          any:
            - flow.alerted
          # all:
          layout-name: suricata.aggregated_flow
      - replace:
          field: event_type
          value: aggregated_flow
  transform-triggers:
    export:
      - transform: suricata-flow-aggregate
        location: client
        events:
          - suricata.flow
          - suricata.aggregated_flow
```
This is a rewrite of the aggregate transform step and its underlying algorithm
to have two passes, each of which can be disabled: An eager pass when adding
record batches, and a lazy pass when retrieving the results. Additionally, it
refactors the per-layout aggregation into a separate data structure that is
easier to maintain, and adds explanatory comments throughout the file.
This fixes some minor issues in the aggregate plugin noticed during a review
session.
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

libvast/src/transform_steps/rename.cpp Outdated Show resolved Hide resolved
plugins/aggregate/README.md Outdated Show resolved Hide resolved
plugins/aggregate/aggregate.cpp Show resolved Hide resolved
plugins/aggregate/tests/aggregate.cpp Outdated Show resolved Hide resolved
.github/workflows/vast.yaml Outdated Show resolved Hide resolved
plugins/aggregate/README.md Outdated Show resolved Hide resolved
plugins/aggregate/README.md Outdated Show resolved Hide resolved
@dominiklohmann dominiklohmann merged commit bcf21cd into master Feb 15, 2022
@dominiklohmann dominiklohmann deleted the topic/generic-aggregation-plugin branch February 15, 2022 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
4 participants