Allow for inline `remap` and `filter` options #3463

binarylogic · 2020-08-14T18:15:33Z

I'm submitting this as a mini RFC since I think it helps to express the problem best. An idea that has some overlap with #257 and #406 is the ability to express inline filtering and remapping.

Motivation

Building pipelines in Vector can be very verbose, which is largely caused by the need to perform basic operations like filtering and remapping. For example, the following pipeline attempts to add a type field depending on the field the event originated from:

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]

[transforms.split-by-file]
  type = "swimlanes"
  inputs = ["log-files"]

  [transforms.split-by-file.lanes.nginx-logs]
    "file.eq" = "/var/log/nginx.log"

  [transforms.split-by-file.lanes.postgres-logs]
    "file.eq" = "/var/log/postgres.log"

  [transforms.split-by-file.lanes.app-logs]
    "file.eq" = "/var/log/app.log"

[transforms.add-nginx-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "nginx"

[transforms.add-postgres-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "postgres"

[transforms.add-app-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "app"

# ...

This config is very verbose.

Proposal

I propose that we allow for filtering and remapping directly within components. We've set precedence for this with the coerce transform being embedded as a types option within various parsing transforms. We should do something similar with filter and remap:

Inlining the `filter` option

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]

[transforms.add-nginx-type]
  type = "add_fields"
  inputs = ["log-files"]
  fields.type = "nginx"
  filter."file.eq" = "/var/log/nginx.log"

[transforms.add-postgres-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "postgres"
  filter."file.eq" = "/var/log/postgres.log"

[transforms.add-app-type]
  type = "add_fields"
  inputs = ["nginx-logs"]
  fields.type = "app"
  filter."file.eq" = "/var/log/app.log"

# ...

Inlining the `remap` option

And inlining a remap option reduces it even further:

[sources.log-files]
  type = "file"
  include = ["/var/log/*.log"]
  remap = """
if .file = "/var/log/nginx.log" {
  .type = "nginx"
} else if .file = "/var/log/postgres.log" {
  .type = "postgres"
} else if .file = "/var/log/app.log" {
  .type = "app"
}
"""

# ...

Rationale

It reduces the boilerplate required to build Vector pipelines making them easier to read and understand.
It'll facilitate the transition from Logstash pipelines since they have a similar concept with their conditionals.

Drawbacks

There are now 2 ways to do something, which could make Vector's config files harder to understand and follow.

Open Questions

Do we want to generalize this solution for all transforms or continue to cherry-pick specific transforms that we know have value? It's interesting to think about this as a succinct snytax for describing pipelines instead of options.
Are we fighting with TOML? Would we be better off adopting another config syntax that allows for succinct, easier to read pipelines.

The text was updated successfully, but these errors were encountered:

MOZGIII · 2020-08-27T17:22:52Z

I like the idea! 👍

Here's a potential additional use case - with logstash, I used add_tag functionality a lot to annotate events propagation through the pipeline for later debugging.

I'm talking about something like this:

[transforms.parse_json]
  type = "json_parser"
  inputs = ["my-source-or-transform-id"]
  drop_invalid = false
  remap = """
  if status.is_valid {
    .json_parsed = "yes"
  } else {
    .json_parsed = "no"
  }
  """

I would then look for the events where json_parsed was no and investigate the code that emits invalid JSON.

Do we want to generalize this solution for all transforms or continue to cherry-pick specific transforms that we know have value? It's interesting to think about this as a succinct snytax for describing pipelines instead of options.

Both of those seem to work at the LogEvent (or even Event) level. Imo, it's quite unlikely we'll find any meaningful optimization if we hand-write the implementations at each application location, thus I'm for generalization.

In addition, I think it might worth thinking about other possible enhancements like these two. I feel like this is a very powerful way of compositing functionality - a whole extra dimension. It might be useful to revisit the question "how do we want to configure our topology and event pipelines" once again to get a more coherent view on the design space.

Are we fighting with TOML? Would we be better off adopting another config syntax that allows for succinct, easier to read pipelines.

I often have a feeling that I'm fighting TOML once in a while, not specifically in vector. Maybe it's just me, but I think TOML is a complicated format. Nonetheless, I think it works for us, and it is mostly fine. Adopting another syntax seems like a good idea - especially if we would have the ability to mix multiple syntaxes in one multi-file config.

lukesteensen · 2020-08-31T20:41:48Z

I'm pretty sure your example can be simplified to just one remap transform even without allowing any inlining. Does that seem right? So the benefit, in that case, is really just losing the overhead of naming the transform and stringing together the inputs properly.

binarylogic · 2020-08-31T20:46:45Z

That's true. I'm happy to start there as a middle ground and see where we get. Closing as a result.

binarylogic mentioned this issue Aug 27, 2020

docs(kubernetes_logs source): Add reference docs #3555

Merged

This was referenced Aug 27, 2020

chore(rfcs): Add RFC for Apache HTTP Server metrics source #3519

Merged

Drop *_key options in favor of new remap option #3633

Closed

binarylogic assigned lukesteensen Aug 31, 2020

binarylogic closed this as completed Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for inline `remap` and `filter` options #3463

Allow for inline `remap` and `filter` options #3463

binarylogic commented Aug 14, 2020 •

edited

Loading

MOZGIII commented Aug 27, 2020 •

edited

Loading

lukesteensen commented Aug 31, 2020

binarylogic commented Aug 31, 2020 •

edited

Loading

Allow for inline remap and filter options #3463

Allow for inline remap and filter options #3463

Comments

binarylogic commented Aug 14, 2020 • edited Loading

Motivation

Proposal

Inlining the filter option

Inlining the remap option

Rationale

Drawbacks

Open Questions

MOZGIII commented Aug 27, 2020 • edited Loading

lukesteensen commented Aug 31, 2020

binarylogic commented Aug 31, 2020 • edited Loading

Allow for inline `remap` and `filter` options #3463

Allow for inline `remap` and `filter` options #3463

binarylogic commented Aug 14, 2020 •

edited

Loading

Inlining the `filter` option

Inlining the `remap` option

MOZGIII commented Aug 27, 2020 •

edited

Loading

binarylogic commented Aug 31, 2020 •

edited

Loading