Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for inline remap and filter options #3463

Closed
binarylogic opened this issue Aug 14, 2020 · 3 comments
Closed

Allow for inline remap and filter options #3463

binarylogic opened this issue Aug 14, 2020 · 3 comments
Assignees
Labels
domain: config Anything related to configuring Vector domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) domain: setup Anything related to setting up or installing Vector meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. needs: approval Needs review & approval before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Aug 14, 2020

I'm submitting this as a mini RFC since I think it helps to express the problem best. An idea that has some overlap with #257 and #406 is the ability to express inline filtering and remapping.

Motivation

Building pipelines in Vector can be very verbose, which is largely caused by the need to perform basic operations like filtering and remapping. For example, the following pipeline attempts to add a type field depending on the field the event originated from:

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]

[transforms.split-by-file]
  type = "swimlanes"
  inputs = ["log-files"]

  [transforms.split-by-file.lanes.nginx-logs]
    "file.eq" = "/var/log/nginx.log"

  [transforms.split-by-file.lanes.postgres-logs]
    "file.eq" = "/var/log/postgres.log"

  [transforms.split-by-file.lanes.app-logs]
    "file.eq" = "/var/log/app.log"

[transforms.add-nginx-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "nginx"

[transforms.add-postgres-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "postgres"

[transforms.add-app-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "app"

# ...

This config is very verbose.

Proposal

I propose that we allow for filtering and remapping directly within components. We've set precedence for this with the coerce transform being embedded as a types option within various parsing transforms. We should do something similar with filter and remap:

Inlining the filter option

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]

[transforms.add-nginx-type]
  type = "add_fields"
  inputs = ["log-files"]
  fields.type = "nginx"
  filter."file.eq" = "/var/log/nginx.log"

[transforms.add-postgres-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "postgres"
  filter."file.eq" = "/var/log/postgres.log"

[transforms.add-app-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "app"
  filter."file.eq" = "/var/log/app.log"

# ...

Inlining the remap option

And inlining a remap option reduces it even further:

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]
  remap = """
if .file = "/var/log/nginx.log" {
  .type = "nginx"
} else if .file = "/var/log/postgres.log" {
  .type = "postgres"
} else if .file = "/var/log/app.log" {
  .type = "app"
}
"""

# ...

Rationale

  1. It reduces the boilerplate required to build Vector pipelines making them easier to read and understand.
  2. It'll facilitate the transition from Logstash pipelines since they have a similar concept with their conditionals.

Drawbacks

  1. There are now 2 ways to do something, which could make Vector's config files harder to understand and follow.

Open Questions

  1. Do we want to generalize this solution for all transforms or continue to cherry-pick specific transforms that we know have value? It's interesting to think about this as a succinct snytax for describing pipelines instead of options.
  2. Are we fighting with TOML? Would we be better off adopting another config syntax that allows for succinct, easier to read pipelines.
@binarylogic binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. domain: config Anything related to configuring Vector needs: approval Needs review & approval before work can begin. domain: setup Anything related to setting up or installing Vector domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) labels Aug 14, 2020
@MOZGIII
Copy link
Contributor

MOZGIII commented Aug 27, 2020

I like the idea! 👍

Here's a potential additional use case - with logstash, I used add_tag functionality a lot to annotate events propagation through the pipeline for later debugging.

I'm talking about something like this:

[transforms.parse_json]
  type = "json_parser"
  inputs = ["my-source-or-transform-id"]
  drop_invalid = false
  remap = """
  if status.is_valid {
    .json_parsed = "yes"
  } else {
    .json_parsed = "no"
  }
  """

I would then look for the events where json_parsed was no and investigate the code that emits invalid JSON.


  1. Do we want to generalize this solution for all transforms or continue to cherry-pick specific transforms that we know have value? It's interesting to think about this as a succinct snytax for describing pipelines instead of options.

Both of those seem to work at the LogEvent (or even Event) level. Imo, it's quite unlikely we'll find any meaningful optimization if we hand-write the implementations at each application location, thus I'm for generalization.

In addition, I think it might worth thinking about other possible enhancements like these two. I feel like this is a very powerful way of compositing functionality - a whole extra dimension. It might be useful to revisit the question "how do we want to configure our topology and event pipelines" once again to get a more coherent view on the design space.

  1. Are we fighting with TOML? Would we be better off adopting another config syntax that allows for succinct, easier to read pipelines.

I often have a feeling that I'm fighting TOML once in a while, not specifically in vector. Maybe it's just me, but I think TOML is a complicated format. Nonetheless, I think it works for us, and it is mostly fine. Adopting another syntax seems like a good idea - especially if we would have the ability to mix multiple syntaxes in one multi-file config.

@lukesteensen
Copy link
Member

I'm pretty sure your example can be simplified to just one remap transform even without allowing any inlining. Does that seem right? So the benefit, in that case, is really just losing the overhead of naming the transform and stringing together the inputs properly.

@binarylogic
Copy link
Contributor Author

binarylogic commented Aug 31, 2020

That's true. I'm happy to start there as a middle ground and see where we get. Closing as a result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: config Anything related to configuring Vector domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) domain: setup Anything related to setting up or installing Vector meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. needs: approval Needs review & approval before work can begin. type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

3 participants