Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a new reusable pipelines directive when configuring Vector #1447

Closed
binarylogic opened this issue Dec 27, 2019 · 10 comments
Closed

Support a new reusable pipelines directive when configuring Vector #1447

binarylogic opened this issue Dec 27, 2019 · 10 comments
Assignees
Labels
domain: config Anything related to configuring Vector needs: approval Needs review & approval before work can begin. needs: requirements Needs a a list of requirements before work can be begin type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Dec 27, 2019

Following up on #1327 (comment), I'd like to consider a new pipelines directive in Vector's configuration files that would make a component's inputs key optional.

Examples

Below are a few examples to show how this would work in practice:

Simple

[pipelines]
  nginx_logs = ["nginx_logs", "parse_json", "print"]

[sources.nginx_logs]
  type = "file"
  include = "/var/log/nginx/*.log"

[transforms.parse_json]
  type = "json_parser"

[transforms.print]
  type = "console"

Default components

Taking the above example, a user could simply specify the component type if they do not want to customize it further.

[pipelines]
  nginx_logs = ["nginx_logs", "json_parser", "console"] # notice the use of direct component names

[sources.nginx_logs]
  type = "file"
  include = "/var/log/nginx/*.log"

This is very close to @lukesteensen's bash syntax idea 😄 .

Chaining pipelines

It should also be possible to chain pipelines together. This is very similar to #1061, and is probably the solution to that.

/etc/vector/scrub_sensitive_data.toml:

[pipelines]
  scrub_sensitive_data = ["drop_sensitive_fields", "scrub_message"]

[transforms.drop_sensitive_fields]
  type = "remove_fields"
  fields = ["user_name", "user_email", "last_4"]

[transforms.scrub_message]
  type = "lua"
  source = """
  // special code to scrube sensitive info from the message field
  ""

/etc/vector/vector.toml:

[pipelines]
  main = ["app_logs", "scrub_sensitive_data", "archive"]

[sources.app_logs]
  type = "file"
  include = "/var/log/my_app/*.log"

[sinks.archive]
  type = "aws_s3"
  bucket = "my-bucket"
@binarylogic binarylogic added type: enhancement A value-adding code change that enhances its existing functionality. domain: config Anything related to configuring Vector needs: approval Needs review & approval before work can begin. needs: requirements Needs a a list of requirements before work can be begin labels Dec 27, 2019
@binarylogic binarylogic changed the title Support a new pipelines directive when configuring Vector Support a new reusable pipelines directive when configuring Vector Dec 27, 2019
@ghost
Copy link

ghost commented Dec 27, 2019

I was recently thinking that our components can be seen as functions (even though some of them, such as sources or sinks, are not always pure). This pipeline approach fits nicely in this mental framework, as a pipeline can be seen as a composition of these functions.

I'm thinking could we experiment with new approaches to configs like this without modifying Vector core by just writing an external transpiler script (in some scripting language) which would take a config in the new format and convert it to the config in the current format.

@Jeffail
Copy link
Contributor

Jeffail commented Dec 28, 2019

Yeah I like this. Would be interesting to gather some large configs used in the wild and reproduce them as pipelines. A diamond shaped filter config might look like this (omitting the actual transforms):

[pipelines]
  process_common = [
    "this_common_thing",
    "another_common_thing",
    "console",
  ]
  process_foos = [
    "nginx_logs",
    "filter_just_foos",
    "do_thing",
    "do_another_thing",
    "process_common",
  ]
  process_bars = [
    "nginx_logs",
    "filter_just_bars",
    "do_something_else",
    "and_then_this",
    "process_common",
  ]

I'm also in favor of a transpiling, I think at least internally we should only ever have one config format, and one way of defining a topology.

However, another thing to consider is whether this is a syntax for advanced users or beginners. If it's intended for beginners then we ought to emphasize it within the first few pages of our docs, it may well be the first syntax that new users experiment with and therefore it's more likely to become the (adopted) standard config format for Vector configs.

This might become an issue if the underlying "canonical" config differs. For example, which format do we expose with #1039, the canonical form, or the one the user provided?

@binarylogic
Copy link
Contributor Author

I think at least internally we should only ever have one config format, and one way of defining a topology.

👍

However, another thing to consider is whether this is a syntax for advanced users or beginners.

I'm leaning towards making this the single/only way to define a topology and deprecating the inputs field, but I want to see how this comes together first.

emphasize it within the first few pages of our docs

We'll update all of the docs

For example, which format do we expose with #1039, the canonical form, or the one the user provided?

I think we expose our internal normalized format. In other words, the topology map encoded as JSON. Happy to hear reasons to not do this.

@ghost
Copy link

ghost commented Dec 29, 2019

I'm leaning towards making this the single/only way to define a topology and deprecating the inputs field, but I want to see how this comes together first.

Note that the pipelines syntax as discussed here allows to define only subset of currently possible topologies because it doesn't seem possible to define multiple inputs for a single transform or sink with it.

@Jeffail
Copy link
Contributor

Jeffail commented Dec 29, 2019

Depends on how it's implemented, my initial assumption with the syntax was that:

[pipelines]
  first = [
    "foo",
    "common_proc",
    "baz",
  ]
  second = [
    "bar",
    "common_proc",
    "baz",
  ]

Would translate to:

[sources.foo]
  type = "todo"

[sources.bar]
  type = "todo"

[transforms.common_proc]
  inputs = [ "foo", "bar" ]
  type = "todo"

[sinks.baz]
  inputs = [ "common_proc" ]
  type = "todo"

That assumption is a stretch as it requires the knowledge that Vector would create one single instance of common_proc and hook it up within two separate pipelines. It's not intuitive behavior based on the pipeline syntax itself, where we might assume we have two isolated event feeds, but with the already established understanding of how Vector components work (with inputs) we might make the leap.

In that sense we're relying on the fact that transforms (currently) in practice aren't pure functions as they're wrapped with our message routing mechanisms (consuming events from a list of input components) and therefore have managed state.

However, if we were to rely on this pipeline syntax as the exclusive way to define topologies then there's no longer any reason to assume a transform would be shared across pipelines. In which case I would assume the pipeline config above to translate to:

[sources.foo]
  type = "todo"

[sources.bar]
  type = "todo"

[transforms.common_proc_1]
  inputs = [ "foo" ]
  type = "todo"

[transforms.common_proc_2]
  inputs = [ "bar" ]
  type = "todo"

[sinks.baz_1]
  inputs = [ "common_proc_1" ]
  type = "todo"

[sinks.baz_2]
  inputs = [ "common_proc_2" ]
  type = "todo"

For stateless transformations this difference is inconsequential as the composition is the same. However, we're already starting to propose transforms that carry state (#1200) so it's entirely possible that this becomes a very confusing issue. It's also a problem with naming collisions as if these are isolated components then their metrics and logging should also be separate.

I personally think if we were to lean fully into a hierarchical config format (instead of our current flattened syntax) then there's a lot more to consider. We're trading off some strengths that Vector has in making it very clean to structure complex topologies in favor of making it more concise to express the much more simple serial pipelines.

Interestingly, I've gone through the process of taking Benthos from the opposite direction. I started from a hierarchical config structure where my equivalent of a transform (a processor) is a pure function and you can simply list them:

pipeline:
  processors:
    - jmespath:
        query: '{ nested: @, links: len(urls) }'
    - if:
        operator: im_bored
        then:
          - text:
              operator: to_upper

Then gradually added support for flattening the structure with named global processors that you compose:

pipeline:
  processors:
    - resource: foo
    - if:
        operator: im_bored
        then:
          - resource: bar

resources:
  processors:
    foo:
      jmespath:
        query: '{ nested: @, links: len(urls) }'
    bar:
      text:
        operator: to_upper

The former is better for simplified pipelines, where the structure is simple enough to express as a list of steps. The latter is better for large and heavily structured pipelines where it's preferential to break out certain functions and configure them outside of the structure definition.

@ghost
Copy link

ghost commented Dec 29, 2019

However, if we were to rely on this pipeline syntax as the exclusive way to define topologies then there's no longer any reason to assume a transform would be shared across pipelines.

Yeah, that's how I understood it initially! I assumed that that reusing a named component without the inputs field between different pipelines would create isolated copies of that component for each pipeline.

I think if the opposite approach is used, so that there is only one instance of each named component shared between pipelines, it could lead to some unexpected behavior.

For example, this pipeline config

[pipelines]
nginx_logs = ["nginx_file_source", "generic_http_logs_parser", "s3_sink_1"]
apache_logs = ["apache_file_source", "generic_http_logs_parser", "s3_sink_2"]

would write both NGINX and Apache logs to each of the sinks, which is probably not what the user expects.

The former is better for simplified pipelines, where the structure is simple enough to express as a list of steps. The latter is better for large and heavily structured pipelines where it's preferential to break out certain functions and configure them outside of the structure definition.

I support the idea of having two different syntaxes, "common" which is easy to use and works for most of simple topologies and "advanced" which might be harder to use, but works for all possible topologies. However, it might be useful to have some kind of evidence that "common" syntax is actually easier to grasp for new users than "advanced", but I'm not sure how to collect it.

@binarylogic
Copy link
Contributor Author

binarylogic commented Dec 29, 2019

These are excellent points. I'd like to take a step back and share a few UX rules we've decided on as part of Vector's configuration design:

  1. Config topology != Vector's internal topology

    The graph in a user's Vector configuration does not have to match exactly with the graph Vector constructs in memory (if that makes sense). In other words, the user specifies the flow of data and we optimize that flow without violating those rules. Meaning concurrency and other performance-related tactics (where possible) are entirely up to us. This is a much simpler UX and I do not want the user thinking about performance/concurrency/etc.

  2. IDs are for humans also

    The component IDs set by the user should be reflected in all observability data. If we decide that cloning components are the way to go, then that should be hidden from the user. The IDs in the logs should not contain surprise suffixes like _1, _2, etc.

  3. State is localized to the component, not the pipeline.

    For example, in Ashley's first pipeline example, the common_proc transform would have shared state thatt both inputs foo and bar would operate on. If this is not desired, the user can explicitly define 2 separate common_proc components. This makes the most sense to me since it relates to the data flow and processing. If a transform is using state, the user will know that and can make those decisions accordingly.

  4. Happy new year!

@lukesteensen
Copy link
Member

There's a lot to think about here. Some initial thoughts:

  1. The syntax is less general than inputs, so I don't think it will work as the exclusive syntax or our internal representation.

  2. The example @a-rodin brought up is a real doozy. Sending events from both sources to both sinks seems obviously wrong, but so does silently cloning a potentially-stateful transform that the user has identified by name. This is fundamentally a mismatch between the pipelines model and our inputs representation, and I'm really not sure how we'd resolve it in a way that doesn't create a bunch of surprising edge cases.

  3. The reusability aspect seems to imply cloning, which opens up a lot of questions (similar to above). We'd need to come up with internal names for the clones, and we'd need to decide how to expose that to users. It would be confusing to combine logs and metrics from separate components under one name, but we also don't want to confuse the users with opaque _1, _2 names. We'd probably want something like nested names based on the pipeline name.

  4. The snippets use case is compelling, but we'd need to detect and error on recursive definitions.

  5. I've also had the idea of components as functions (or at least component definition as function invocation). If you had something like my_stateful_transform(config: foo), it would be more obvious when a new instance is being constructed vs a reference to a named instance. You could also inline those definitions so your pipeline could look more like [tcp(port: 9092), syslog_parser(drop_invalid: false), http(encoding: json, endpoint: foo)]. That's obviously not toml, but could work in something like jsonnet/dhall/starlark (which can output json/toml). Functions are an obvious solution to "snippets" as well.

@binarylogic
Copy link
Contributor Author

I'm realizing I didn't fully understand @a-rodin's edge case :(. That is unfortunate and demonstrates @lukesteensen's point around surprising edge-cases.

Even with its downsides, this syntax, in my opinion, is quite a bit easier to write, read, and manage. I'd like to see if we can push through and come up with a similarly clear syntax or rules that eliminate edge cases for this one. For example:

  1. We could raise a topology error for @a-rodin's edge case.
  2. We could support the ability to split streams, which could be used to direct output properly avoiding @a-rodin's edge case.

We'll probably benefit from taking a step back and thinking about this holistically in the context of the Configuration Development project.

@Jeffail
Copy link
Contributor

Jeffail commented Feb 3, 2020

Superseded by #1679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: config Anything related to configuring Vector needs: approval Needs review & approval before work can begin. needs: requirements Needs a a list of requirements before work can be begin type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

3 participants