Skip to content

Summarize operator with pluggable aggregation functions #2417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 11, 2022

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Jul 9, 2022

This rewrites the summarize pipeline operator to have pluggable aggregation functions through the new aggregation_function_plugin plugin type, which each implement a simple virtual aggregation_function interface for incremental aggregation.

In addition, this converts both the summarize plugin and the existing aggregation functions to native plugins, making them always available when using VAST.

The configuration for the aggregation also changed, allowing users to specify output field names explicitly and aggregating multiple columns into one. This is the new configuration that I primarily used for testing with suricata.flow events:

summarize:
  group-by:
    - timestamp
    - proto
    - event_type
  time-resolution: 1 hour
  aggregate:
    timestamp_min:
      min: timestamp
    timestamp_max:
      max: timestamp
    pkts_toserver: sum
    pkts_toclient: sum
    bytes_toserver: sum
    bytes_toclient: sum
    start: min
    end: max
    alerted: any
    ips:
      distinct:
        - src_ip
        - dest_ip

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on vast.io, if necessary.
    • Document summarize changes.
    • Document aggregation_function_plugin plugin type.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

I recommend going file-by-file, and testing locally with compaction. The following order is ideal:

  1. Review the API of aggregation_function_plugin.
  2. Review the API of aggregation_function
  3. Review the summarize pipeline operator implementation.
  4. Review the summarize configuration, and how it's bound to a schema to form an aggregation.
  5. Review the individual aggregation function plugins.

This PR also includes the work from #2391, which I've closed accordingly.

@dominiklohmann dominiklohmann added the feature New functionality label Jul 9, 2022
@dominiklohmann dominiklohmann force-pushed the story/sc-35104/next-gen-summarize branch from 02ea344 to 9723975 Compare July 9, 2022 10:33
@dominiklohmann dominiklohmann force-pushed the story/sc-35104/next-gen-summarize branch 2 times, most recently from 754785a to fb30b6f Compare July 9, 2022 11:52
This rewrites the `summarize` pipeline operator to have pluggable aggregation
functions through the new `aggregation_function_plugin` plugin type, which each
implement a simple virtual `aggregation_function` interface for incremental
aggregation.

In addition, this converts both the summarize plugin and the existing
aggregation functions to native plugins, making them always available when using
VAST.

The configuration for the aggregation also changed, allowing users to specify
output field names explicitly and aggregating multiple columns into one. This is
the new configuration that I primarily used for testing with `suricata.flow`
events:

```yaml
summarize:
  group-by:
    - timestamp
    - proto
    - event_type
  time-resolution: 1 hour
  aggregate:
    timestamp_min:
      min: timestamp
    timestamp_max:
      max: timestamp
    pkts_toserver: sum
    pkts_toclient: sum
    bytes_toserver: sum
    bytes_toclient: sum
    start: min
    end: max
    alerted: any
    ips:
      distinct:
        - src_ip
        - dest_ip
```
@dominiklohmann dominiklohmann force-pushed the story/sc-35104/next-gen-summarize branch from fb30b6f to 0973775 Compare July 9, 2022 12:05
@mavam
Copy link
Member

mavam commented Jul 9, 2022

Just wondering: should sample take an argument n that specifies the sample size? Right now, we have n = 1 hard-coded. To me, a sample function suggests that it's configurable in its desired output length.

@dominiklohmann
Copy link
Member Author

Just wondering: should sample take an argument n that specifies the sample size? Right now, we have n = 1 hard-coded. To me, a sample function suggests that it's configurable in its desired output length.

I don't think so to be honest, at least not for the foreseeable future while we still use the YAML based configuration. It's hard to fit additional parametrization for the aggregations functions in it, and I'd want to wait with doing so until we have an actual use case or request for it.

@dominiklohmann dominiklohmann requested a review from dispanser July 11, 2022 10:11
@dominiklohmann dominiklohmann marked this pull request as ready for review July 11, 2022 10:11
Copy link
Contributor

@dispanser dispanser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over this together, minor comments inline

This did concatenation, which was certainly not expected.
@dominiklohmann dominiklohmann enabled auto-merge July 11, 2022 13:50
@dominiklohmann dominiklohmann merged commit 1da1c41 into master Jul 11, 2022
@dominiklohmann dominiklohmann deleted the story/sc-35104/next-gen-summarize branch July 11, 2022 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants