Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
We've already had a proposal for a new pipelines directive (#1447) which allows a user to simply list component names in order to create a topology:
[pipelines] p1 = ["tfm1", "tfm2"] p2 = ["src1", "p1", "tfm3", "snk1"] p3 = ["src2", "tfm1", "snk2"]
And based on these pipelines the
With the original spec the above snippet would create unexpected side effects:
One of the key strengths of our current spec is that it supports a wide range of topologies whilst retaining a flat configuration spec. If a new spec is to replace the current
[pipelines] p1 = ["src1", "tfm1", "tfm2"] p2 = ["src2", "p1", "snk1"]
It would make sense for Vector to construct the following topology:
But it could also look like this:
Even if we have a clear spec, can we expect a user seeing this config for the first time to grok it? Multiple sinks have the same issue.
Sub Proposal 1
Based on the simpler spec for a compose transform (#1653) we introduce a concept of component copies (versus references). This allows us to refer to the configuration of a component but create a copy for our pipeline rather than mutate the
Next, we expand the
If a pipeline does not follow this spec then we are able to deliver a clear error message.
With this spec I'm fairly confident that we can support all of the same topologies as we currently do without any unintended side effects. However, if we decide to go ahead with this proposal we need to investigate further.
I still feel as though this spec is somewhat hostile to new users. This is mostly down to the fact that you're looking at a linear list of component names as if they're all equivalent:
[pipelines] p1 = [ "src2", "tfm1", "tfm2" ] p2 = [ "src1", "p1", "tfm3", "snk1" ]
Whereas in reality you're looking at a combined list of three different element types, more clearly represented as:
[pipelines.p1] inputs = [ "src2" ] transforms = [ "tfm1", "tfm2" ] [pipelines.p2] inputs = [ "src1", "p1" ] transforms = [ "tfm3" ] outputs = [ "snk1" ]
(NOTE: I'm NOT suggesting this as a spec)
We're pushing the spec in favor of writing speed at the cost of readability, and I'm not 100% convinced the sacrifices we're making aren't going to sting new users trying to grok Vector configs (i.e. does
Sub Proposal 2
Exactly the same as proposal 1 except we make it more explicit:
The change being that source and sink lists within a pipeline must themselves be in an array. The purpose of this requirement is purely for the sake of distinguishing the tiers:
[pipelines] p1 = [ [ "src2" ], "tfm1", "tfm2" ] p2 = [ [ "src1", "p1" ], "tfm3", [ "snk1" ] ]
One key technical advantage over proposal 1 is that because we are explicitly declaring which components are inputs and which are simply transformations of the pipeline, we are now able to specify a transform as an input (and therefore a reference). This makes it possible to add the pipeline syntax into existing configs with transforms in the topology.
From the usability perspective a user familiar with the
This syntax still doesn't provide a full picture, but merely a hint of what's going on. Adding brackets also adds more opportunities for typos to break the topology.
There's also the (unlikely) problem of pipelines that are only a list of sinks. Imagine if we were to create a group of sinks that all want to consume data from the same range of pipelines. For convenience we might group them in their own pipeline with something like:
[pipelines] p1 = [ [ "src1" ], "tfm1", "tfm2" ] p2 = [ [ "src2" ], "tfm3" ] p3 = [ [ "snk1", "snk2", "snk3" ] ] p4 = [ [ "p1" ], [ "p3" ] ] p5 = [ [ "p2" ], "tfm4", [ "p3" ] ]
There's a more concise way of expressing these pipelines, but assuming that this were the best way to structure it then
Sub Proposal 3
Roughly the same as sub proposal 2 except in a structured format, with three fields:
[pipelines.p1] inputs = [ "src1" ] pipe = [ "tfm1", "tfm2" ] outputs = [ "snk1" ]
[pipelines] p1.pipe = [ "tfm1", "tfm2" ] p2.inputs = [ "src1" ] p2.pipe = [ "tfm3", "tfm4" ] p3.inputs = [ "src2", "p2" ] p3.pipe = [ "p1", "tfm5" ] p3.outputs = [ "snk1" ]
Other name candidates are
This has all of the advantages of proposal 2 along with clear naming in order to distinguish the three tiers of the pipeline even further. A new user not necessarily familiar with pipeline syntax is likely able to fully comprehend the topology expressed here.
It's more words.
Sub Proposal 4
I'd like to throw another proposal into the mix. One that explicitly uses
[pipelines] p1 = ["tfm1", ["tfm2", "tfm3"], "tfm4"] p2 = ["&src1", "p1", "tfm3", "&snk1"] p3 = ["&src2", "tfm1", "&snk2"]
Identifiers and observability
It's worth noting that a copied component will get a unique ID that is used in logs, metrics, etc.
I dislike exposing the pointer/copy syntax at all to the user, but these are developers and I don't think this concept is too advanced. Alternatively, we could just "make it work" by assuming users want to copy transforms and reference sources/sinks.
These are all really interesting! I appreciate the time and effort spent trying to munge TOML into a useful graph language
My biggest question around all of these proposals is whether we're making our TOML complex enough that we lose the benefits of using TOML in the first place (simplicity, familiarity, etc). Because if that's the case, we'll end up with the worst of both worlds: an awkward and unnatural language for expressing graphs, and a config format that's difficult for new users to pick up.
I know writing our own config language is the nuclear option, but it is at least a valuable strawman to compare these proposals against.
Ideally, I'd like to avoid conflating the format of our config (TOML, YAML, DOT, custom, etc) with the structure (
For example, we could explore DOT (#1699) as an alternative to pipelines. However, in terms of structure it actually puts us in the same situation as the original pipelines spec, where we need to add more syntax (or assumptions) on top in order to distinguish between references/copies of a subgraph, otherwise we can't support snippet reuse.
This digression leads us into the exploration of syntax alone which I don't think is helpful unless we're committed to a certain structure. Vector components aren't generalised nodes on a graph, they have different types (source, transform, sink), which each have their own rules. So when we create a structure for expressing chains of components we need to take that into account somehow. We also want to support snippet reuse without causing unexpected side effects.
If we can defer the decision of our config format then it allows us to choose the right structure for Vector, and then afterwards select a format that suits it well, instead of confusing the two and using one as a crutch for the other.
With that said I think it's worth doing a review of the structure concepts we currently have so that we're not comparing apples with oranges. I'm picking arbitrary names for these:
This is what we currently have. Each component is defined globally and selects the global siblings it wishes to consume from. This results in a flat list of components where the way in which they interact isn't immediately clear, and changing that often requires editing multiple places, giving ample opportunity for errors.
The compose transform proposal (#1653) is an attempt to mitigate some of the pain points of writing and maintaining lists of transforms with this structure, but is a complement to the spec rather than a solution.
Stemming from the pipelines proposals, taking a lot of inspiration from graph syntaxes. Topologies are defined as linkable lists of component names. This allows the definition of complex graphs from linear arrays, making them easy to parse for both humans and machines.
This is something we haven't really explored yet as it's pretty much the opposite of the existing flattened structure, and is therefore the most extreme change. In a hierarchical structure there aren't necessarily any global components, just pipelines themselves, where each one specifies its sources, transforms, and sinks:
pipeline: sources: - type: foo some: field - type: bar some_other: dumb_field transforms: - type: a_thing: do_it: "like this" - type: a_fork if: "field.type in [ doc, article, comment ]" then: - type: do_this wat: "this is another transform" else: - type: do_this_instead huh: "this is yet another transform" sinks: - type: baz - type: shared_channel called: foo
Pipelines can be linked to each other, which is how we might decide to handle content based multiplexing:
pipeline: sources: - type: shared_channel which: foo transforms: - type: remove_stuff_i_dont_want like: "field.message contains 'nah m8'" sinks: - type: boo
Note that this may seem very similar to sub proposal 3, but in fact it also requires the ability to inline transforms in order to have forked processing. This also means transforms themselves as part of their spec need to be able to define their children, so in reality this is still a far cry from pipelines.
Just noting, that we've decided to defer this change, once again, because it is not obvious that this is a clear win. A couple of reasons:
It'll be obvious a few months from now if we want to do this. It should continue to pop up in conversations.