-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a new reusable pipelines
directive when configuring Vector
#1447
Comments
pipelines
directive when configuring Vectorpipelines
directive when configuring Vector
I was recently thinking that our components can be seen as functions (even though some of them, such as sources or sinks, are not always pure). This pipeline approach fits nicely in this mental framework, as a pipeline can be seen as a composition of these functions. I'm thinking could we experiment with new approaches to configs like this without modifying Vector core by just writing an external transpiler script (in some scripting language) which would take a config in the new format and convert it to the config in the current format. |
Yeah I like this. Would be interesting to gather some large configs used in the wild and reproduce them as pipelines. A diamond shaped filter config might look like this (omitting the actual transforms): [pipelines]
process_common = [
"this_common_thing",
"another_common_thing",
"console",
]
process_foos = [
"nginx_logs",
"filter_just_foos",
"do_thing",
"do_another_thing",
"process_common",
]
process_bars = [
"nginx_logs",
"filter_just_bars",
"do_something_else",
"and_then_this",
"process_common",
] I'm also in favor of a transpiling, I think at least internally we should only ever have one config format, and one way of defining a topology. However, another thing to consider is whether this is a syntax for advanced users or beginners. If it's intended for beginners then we ought to emphasize it within the first few pages of our docs, it may well be the first syntax that new users experiment with and therefore it's more likely to become the (adopted) standard config format for Vector configs. This might become an issue if the underlying "canonical" config differs. For example, which format do we expose with #1039, the canonical form, or the one the user provided? |
👍
I'm leaning towards making this the single/only way to define a topology and deprecating the
We'll update all of the docs
I think we expose our internal normalized format. In other words, the topology map encoded as JSON. Happy to hear reasons to not do this. |
Note that the pipelines syntax as discussed here allows to define only subset of currently possible topologies because it doesn't seem possible to define multiple inputs for a single transform or sink with it. |
Depends on how it's implemented, my initial assumption with the syntax was that: [pipelines]
first = [
"foo",
"common_proc",
"baz",
]
second = [
"bar",
"common_proc",
"baz",
] Would translate to: [sources.foo]
type = "todo"
[sources.bar]
type = "todo"
[transforms.common_proc]
inputs = [ "foo", "bar" ]
type = "todo"
[sinks.baz]
inputs = [ "common_proc" ]
type = "todo" That assumption is a stretch as it requires the knowledge that Vector would create one single instance of In that sense we're relying on the fact that transforms (currently) in practice aren't pure functions as they're wrapped with our message routing mechanisms (consuming events from a list of input components) and therefore have managed state. However, if we were to rely on this pipeline syntax as the exclusive way to define topologies then there's no longer any reason to assume a transform would be shared across pipelines. In which case I would assume the pipeline config above to translate to: [sources.foo]
type = "todo"
[sources.bar]
type = "todo"
[transforms.common_proc_1]
inputs = [ "foo" ]
type = "todo"
[transforms.common_proc_2]
inputs = [ "bar" ]
type = "todo"
[sinks.baz_1]
inputs = [ "common_proc_1" ]
type = "todo"
[sinks.baz_2]
inputs = [ "common_proc_2" ]
type = "todo" For stateless transformations this difference is inconsequential as the composition is the same. However, we're already starting to propose transforms that carry state (#1200) so it's entirely possible that this becomes a very confusing issue. It's also a problem with naming collisions as if these are isolated components then their metrics and logging should also be separate. I personally think if we were to lean fully into a hierarchical config format (instead of our current flattened syntax) then there's a lot more to consider. We're trading off some strengths that Vector has in making it very clean to structure complex topologies in favor of making it more concise to express the much more simple serial pipelines. Interestingly, I've gone through the process of taking Benthos from the opposite direction. I started from a hierarchical config structure where my equivalent of a transform (a processor) is a pure function and you can simply list them: pipeline:
processors:
- jmespath:
query: '{ nested: @, links: len(urls) }'
- if:
operator: im_bored
then:
- text:
operator: to_upper Then gradually added support for flattening the structure with named global processors that you compose: pipeline:
processors:
- resource: foo
- if:
operator: im_bored
then:
- resource: bar
resources:
processors:
foo:
jmespath:
query: '{ nested: @, links: len(urls) }'
bar:
text:
operator: to_upper The former is better for simplified pipelines, where the structure is simple enough to express as a list of steps. The latter is better for large and heavily structured pipelines where it's preferential to break out certain functions and configure them outside of the structure definition. |
Yeah, that's how I understood it initially! I assumed that that reusing a named component without the I think if the opposite approach is used, so that there is only one instance of each named component shared between pipelines, it could lead to some unexpected behavior. For example, this pipeline config [pipelines]
nginx_logs = ["nginx_file_source", "generic_http_logs_parser", "s3_sink_1"]
apache_logs = ["apache_file_source", "generic_http_logs_parser", "s3_sink_2"] would write both NGINX and Apache logs to each of the sinks, which is probably not what the user expects.
I support the idea of having two different syntaxes, "common" which is easy to use and works for most of simple topologies and "advanced" which might be harder to use, but works for all possible topologies. However, it might be useful to have some kind of evidence that "common" syntax is actually easier to grasp for new users than "advanced", but I'm not sure how to collect it. |
These are excellent points. I'd like to take a step back and share a few UX rules we've decided on as part of Vector's configuration design:
|
There's a lot to think about here. Some initial thoughts:
|
I'm realizing I didn't fully understand @a-rodin's edge case :(. That is unfortunate and demonstrates @lukesteensen's point around surprising edge-cases. Even with its downsides, this syntax, in my opinion, is quite a bit easier to write, read, and manage. I'd like to see if we can push through and come up with a similarly clear syntax or rules that eliminate edge cases for this one. For example:
We'll probably benefit from taking a step back and thinking about this holistically in the context of the Configuration Development project. |
Superseded by #1679 |
Following up on #1327 (comment), I'd like to consider a new
pipelines
directive in Vector's configuration files that would make a component'sinputs
key optional.Examples
Below are a few examples to show how this would work in practice:
Simple
Default components
Taking the above example, a user could simply specify the component type if they do not want to customize it further.
This is very close to @lukesteensen's bash syntax idea 😄 .
Chaining pipelines
It should also be possible to chain pipelines together. This is very similar to #1061, and is probably the solution to that.
/etc/vector/scrub_sensitive_data.toml
:/etc/vector/vector.toml
:The text was updated successfully, but these errors were encountered: