Support a new reusable `pipelines` directive when configuring Vector #1447

binarylogic · 2019-12-27T17:43:14Z

Following up on #1327 (comment), I'd like to consider a new pipelines directive in Vector's configuration files that would make a component's inputs key optional.

Examples

Below are a few examples to show how this would work in practice:

Simple

[pipelines]
  nginx_logs = ["nginx_logs", "parse_json", "print"]

[sources.nginx_logs]
  type = "file"
  include = "/var/log/nginx/*.log"

[transforms.parse_json]
  type = "json_parser"

[transforms.print]
  type = "console"

Default components

Taking the above example, a user could simply specify the component type if they do not want to customize it further.

[pipelines]
  nginx_logs = ["nginx_logs", "json_parser", "console"] # notice the use of direct component names

[sources.nginx_logs]
  type = "file"
  include = "/var/log/nginx/*.log"

This is very close to @lukesteensen's bash syntax idea 😄 .

Chaining pipelines

It should also be possible to chain pipelines together. This is very similar to #1061, and is probably the solution to that.

/etc/vector/scrub_sensitive_data.toml:

[pipelines]
  scrub_sensitive_data = ["drop_sensitive_fields", "scrub_message"]

[transforms.drop_sensitive_fields]
  type = "remove_fields"
  fields = ["user_name", "user_email", "last_4"]

[transforms.scrub_message]
  type = "lua"
  source = """
  // special code to scrube sensitive info from the message field
  ""

/etc/vector/vector.toml:

[pipelines]
  main = ["app_logs", "scrub_sensitive_data", "archive"]

[sources.app_logs]
  type = "file"
  include = "/var/log/my_app/*.log"

[sinks.archive]
  type = "aws_s3"
  bucket = "my-bucket"

The text was updated successfully, but these errors were encountered:

ghost · 2019-12-27T17:57:30Z

I was recently thinking that our components can be seen as functions (even though some of them, such as sources or sinks, are not always pure). This pipeline approach fits nicely in this mental framework, as a pipeline can be seen as a composition of these functions.

I'm thinking could we experiment with new approaches to configs like this without modifying Vector core by just writing an external transpiler script (in some scripting language) which would take a config in the new format and convert it to the config in the current format.

Jeffail · 2019-12-28T13:54:17Z

Yeah I like this. Would be interesting to gather some large configs used in the wild and reproduce them as pipelines. A diamond shaped filter config might look like this (omitting the actual transforms):

[pipelines]
  process_common = [
    "this_common_thing",
    "another_common_thing",
    "console",
  ]
  process_foos = [
    "nginx_logs",
    "filter_just_foos",
    "do_thing",
    "do_another_thing",
    "process_common",
  ]
  process_bars = [
    "nginx_logs",
    "filter_just_bars",
    "do_something_else",
    "and_then_this",
    "process_common",
  ]

I'm also in favor of a transpiling, I think at least internally we should only ever have one config format, and one way of defining a topology.

However, another thing to consider is whether this is a syntax for advanced users or beginners. If it's intended for beginners then we ought to emphasize it within the first few pages of our docs, it may well be the first syntax that new users experiment with and therefore it's more likely to become the (adopted) standard config format for Vector configs.

This might become an issue if the underlying "canonical" config differs. For example, which format do we expose with #1039, the canonical form, or the one the user provided?

binarylogic · 2019-12-28T18:33:52Z

I think at least internally we should only ever have one config format, and one way of defining a topology.

👍

However, another thing to consider is whether this is a syntax for advanced users or beginners.

I'm leaning towards making this the single/only way to define a topology and deprecating the inputs field, but I want to see how this comes together first.

emphasize it within the first few pages of our docs

We'll update all of the docs

For example, which format do we expose with #1039, the canonical form, or the one the user provided?

I think we expose our internal normalized format. In other words, the topology map encoded as JSON. Happy to hear reasons to not do this.

ghost · 2019-12-29T12:09:25Z

I'm leaning towards making this the single/only way to define a topology and deprecating the inputs field, but I want to see how this comes together first.

Note that the pipelines syntax as discussed here allows to define only subset of currently possible topologies because it doesn't seem possible to define multiple inputs for a single transform or sink with it.

Jeffail · 2019-12-29T14:20:20Z

Depends on how it's implemented, my initial assumption with the syntax was that:

[pipelines]
  first = [
    "foo",
    "common_proc",
    "baz",
  ]
  second = [
    "bar",
    "common_proc",
    "baz",
  ]

Would translate to:

[sources.foo]
  type = "todo"

[sources.bar]
  type = "todo"

[transforms.common_proc]
  inputs = [ "foo", "bar" ]
  type = "todo"

[sinks.baz]
  inputs = [ "common_proc" ]
  type = "todo"

That assumption is a stretch as it requires the knowledge that Vector would create one single instance of common_proc and hook it up within two separate pipelines. It's not intuitive behavior based on the pipeline syntax itself, where we might assume we have two isolated event feeds, but with the already established understanding of how Vector components work (with inputs) we might make the leap.

In that sense we're relying on the fact that transforms (currently) in practice aren't pure functions as they're wrapped with our message routing mechanisms (consuming events from a list of input components) and therefore have managed state.

However, if we were to rely on this pipeline syntax as the exclusive way to define topologies then there's no longer any reason to assume a transform would be shared across pipelines. In which case I would assume the pipeline config above to translate to:

[sources.foo]
  type = "todo"

[sources.bar]
  type = "todo"

[transforms.common_proc_1]
  inputs = [ "foo" ]
  type = "todo"

[transforms.common_proc_2]
  inputs = [ "bar" ]
  type = "todo"

[sinks.baz_1]
  inputs = [ "common_proc_1" ]
  type = "todo"

[sinks.baz_2]
  inputs = [ "common_proc_2" ]
  type = "todo"

For stateless transformations this difference is inconsequential as the composition is the same. However, we're already starting to propose transforms that carry state (#1200) so it's entirely possible that this becomes a very confusing issue. It's also a problem with naming collisions as if these are isolated components then their metrics and logging should also be separate.

I personally think if we were to lean fully into a hierarchical config format (instead of our current flattened syntax) then there's a lot more to consider. We're trading off some strengths that Vector has in making it very clean to structure complex topologies in favor of making it more concise to express the much more simple serial pipelines.

Interestingly, I've gone through the process of taking Benthos from the opposite direction. I started from a hierarchical config structure where my equivalent of a transform (a processor) is a pure function and you can simply list them:

pipeline:
  processors:
    - jmespath:
        query: '{ nested: @, links: len(urls) }'
    - if:
        operator: im_bored
        then:
          - text:
              operator: to_upper

Then gradually added support for flattening the structure with named global processors that you compose:

pipeline:
  processors:
    - resource: foo
    - if:
        operator: im_bored
        then:
          - resource: bar

resources:
  processors:
    foo:
      jmespath:
        query: '{ nested: @, links: len(urls) }'
    bar:
      text:
        operator: to_upper

The former is better for simplified pipelines, where the structure is simple enough to express as a list of steps. The latter is better for large and heavily structured pipelines where it's preferential to break out certain functions and configure them outside of the structure definition.

ghost · 2019-12-29T17:12:13Z

However, if we were to rely on this pipeline syntax as the exclusive way to define topologies then there's no longer any reason to assume a transform would be shared across pipelines.

Yeah, that's how I understood it initially! I assumed that that reusing a named component without the inputs field between different pipelines would create isolated copies of that component for each pipeline.

I think if the opposite approach is used, so that there is only one instance of each named component shared between pipelines, it could lead to some unexpected behavior.

For example, this pipeline config

[pipelines]
nginx_logs = ["nginx_file_source", "generic_http_logs_parser", "s3_sink_1"]
apache_logs = ["apache_file_source", "generic_http_logs_parser", "s3_sink_2"]

would write both NGINX and Apache logs to each of the sinks, which is probably not what the user expects.

The former is better for simplified pipelines, where the structure is simple enough to express as a list of steps. The latter is better for large and heavily structured pipelines where it's preferential to break out certain functions and configure them outside of the structure definition.

I support the idea of having two different syntaxes, "common" which is easy to use and works for most of simple topologies and "advanced" which might be harder to use, but works for all possible topologies. However, it might be useful to have some kind of evidence that "common" syntax is actually easier to grasp for new users than "advanced", but I'm not sure how to collect it.

binarylogic · 2019-12-29T18:46:59Z

These are excellent points. I'd like to take a step back and share a few UX rules we've decided on as part of Vector's configuration design:

Config topology != Vector's internal topology

The graph in a user's Vector configuration does not have to match exactly with the graph Vector constructs in memory (if that makes sense). In other words, the user specifies the flow of data and we optimize that flow without violating those rules. Meaning concurrency and other performance-related tactics (where possible) are entirely up to us. This is a much simpler UX and I do not want the user thinking about performance/concurrency/etc.
IDs are for humans also

The component IDs set by the user should be reflected in all observability data. If we decide that cloning components are the way to go, then that should be hidden from the user. The IDs in the logs should not contain surprise suffixes like _1, _2, etc.
State is localized to the component, not the pipeline.

For example, in Ashley's first pipeline example, the common_proc transform would have shared state thatt both inputs foo and bar would operate on. If this is not desired, the user can explicitly define 2 separate common_proc components. This makes the most sense to me since it relates to the data flow and processing. If a transform is using state, the user will know that and can make those decisions accordingly.
Happy new year!

lukesteensen · 2019-12-31T00:27:49Z

There's a lot to think about here. Some initial thoughts:

The syntax is less general than inputs, so I don't think it will work as the exclusive syntax or our internal representation.
The example @a-rodin brought up is a real doozy. Sending events from both sources to both sinks seems obviously wrong, but so does silently cloning a potentially-stateful transform that the user has identified by name. This is fundamentally a mismatch between the pipelines model and our inputs representation, and I'm really not sure how we'd resolve it in a way that doesn't create a bunch of surprising edge cases.
The reusability aspect seems to imply cloning, which opens up a lot of questions (similar to above). We'd need to come up with internal names for the clones, and we'd need to decide how to expose that to users. It would be confusing to combine logs and metrics from separate components under one name, but we also don't want to confuse the users with opaque _1, _2 names. We'd probably want something like nested names based on the pipeline name.
The snippets use case is compelling, but we'd need to detect and error on recursive definitions.
I've also had the idea of components as functions (or at least component definition as function invocation). If you had something like my_stateful_transform(config: foo), it would be more obvious when a new instance is being constructed vs a reference to a named instance. You could also inline those definitions so your pipeline could look more like [tcp(port: 9092), syslog_parser(drop_invalid: false), http(encoding: json, endpoint: foo)]. That's obviously not toml, but could work in something like jsonnet/dhall/starlark (which can output json/toml). Functions are an obvious solution to "snippets" as well.

binarylogic · 2020-01-01T16:34:19Z

I'm realizing I didn't fully understand @a-rodin's edge case :(. That is unfortunate and demonstrates @lukesteensen's point around surprising edge-cases.

Even with its downsides, this syntax, in my opinion, is quite a bit easier to write, read, and manage. I'd like to see if we can push through and come up with a similarly clear syntax or rules that eliminate edge cases for this one. For example:

We could raise a topology error for @a-rodin's edge case.
We could support the ability to split streams, which could be used to direct output properly avoiding @a-rodin's edge case.

We'll probably benefit from taking a step back and thinking about this holistically in the context of the Configuration Development project.

Jeffail · 2020-02-03T12:11:02Z

Superseded by #1679

binarylogic changed the title ~~Support a new pipelines directive when configuring Vector~~ Support a new reusable pipelines directive when configuring Vector Dec 27, 2019

binarylogic added this to the Improve configuration development milestone Dec 27, 2019

binarylogic mentioned this issue Dec 27, 2019

Allow other configuration file formats besides toml #1327

Closed

ghost mentioned this issue Dec 27, 2019

Building topology from scripts #1328

Open

binarylogic mentioned this issue Dec 29, 2019

Merge back docker logs #1436

Closed

lukesteensen self-assigned this Dec 30, 2019

binarylogic unassigned lukesteensen Jan 13, 2020

binarylogic assigned Jeffail Jan 22, 2020

binarylogic mentioned this issue Jan 25, 2020

Ability to whitelist and rename journald fields #1605

Closed

This was referenced Jan 31, 2020

Proposal: Compose Transform #1653

Closed

Proposal: Pipelines V2 #1679

Closed

Jeffail closed this as completed Feb 3, 2020

binarylogic mentioned this issue Jun 29, 2020

chore: Architecture Revisit RFC #2855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a new reusable `pipelines` directive when configuring Vector #1447

Support a new reusable `pipelines` directive when configuring Vector #1447

binarylogic commented Dec 27, 2019 •

edited

Loading

ghost commented Dec 27, 2019 •

edited by ghost

Loading

Jeffail commented Dec 28, 2019

binarylogic commented Dec 28, 2019

ghost commented Dec 29, 2019

Jeffail commented Dec 29, 2019

ghost commented Dec 29, 2019 •

edited by ghost

Loading

binarylogic commented Dec 29, 2019 •

edited

Loading

lukesteensen commented Dec 31, 2019

binarylogic commented Jan 1, 2020

Jeffail commented Feb 3, 2020

Support a new reusable pipelines directive when configuring Vector #1447

Support a new reusable pipelines directive when configuring Vector #1447

Comments

binarylogic commented Dec 27, 2019 • edited Loading

Examples

Simple

Default components

Chaining pipelines

ghost commented Dec 27, 2019 • edited by ghost Loading

Jeffail commented Dec 28, 2019

binarylogic commented Dec 28, 2019

ghost commented Dec 29, 2019

Jeffail commented Dec 29, 2019

ghost commented Dec 29, 2019 • edited by ghost Loading

binarylogic commented Dec 29, 2019 • edited Loading

lukesteensen commented Dec 31, 2019

binarylogic commented Jan 1, 2020

Jeffail commented Feb 3, 2020

Support a new reusable `pipelines` directive when configuring Vector #1447

Support a new reusable `pipelines` directive when configuring Vector #1447

binarylogic commented Dec 27, 2019 •

edited

Loading

ghost commented Dec 27, 2019 •

edited by ghost

Loading

ghost commented Dec 29, 2019 •

edited by ghost

Loading

binarylogic commented Dec 29, 2019 •

edited

Loading