You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we discussed implementation details of counters we came across that they may be implemented as side-output from each operator. Details in issue #31. But it seems side-outputs is more general concept deserving separate issue.
The purpose of this issue is to think about implementing similar concept to Apache Beam additional outputs or Apache Flink side outputs
We could design a parallel outputs to the main output, by naming them (and therefore construct what looks like counter, but essentially is nothing more than another Dataset). The code would then look like this:
Flowflow = Flow.create();
Dataset<Integer> input = ...;
Dataset<Integer> output = FlatMap.of(input)
.using((in, ctx) -> {
ctx.collect( /* do whatever transformation of `in` */ );
ctx.collect("input-elements", 1L);
})
.output();
Dataset<Long> inputElements = flow.getNamedStream("input-elements");
// now I can do whatever i want with this stream, I can window it as I wish, aggregate by a function// of my choice and so on, and finally, persist the dataset where I wish
So,
in the example above I used strings to identify the corresponding outputs, but of course, it was just an example - this would need to be modified a little to incorporate strong typing of the output Datasets - this goes in the direction of tags in the sense of Beam
the output would probably be neither keyed nor windowed, it would, ofcourse, carry timestamp
the output would be available via the Flow
the executor can know if an output is not used, because user code has to read it (via getTaggedStream)
what you do with the output Dataset is left on the user code, so you can use it for bussiness logic (joining it with some "main" dataset) or monitoring and debugging (storing it into appropriate sink - e.g. elastic search)
A little modified example, which covers the above topics:
Flowflow = Flow.create();
Dataset<Integer> input = ...;
NamedTag<Long> elementsTag = NamedTag.named("input-elements").typed(Long.class);
Dataset<Integer> output = FlatMap.of(input)
.using((in, ctx) -> {
ctx.collect( /* do whatever transformation of `in` */ );
ctx.collect(elementsTag, 1L);
})
.withNamedTags(elementsTag)
.output();
Dataset<Long> inputElements = flow.getTaggedStream(elementsTag);
// now I can do whatever i want with this stream, I can window it as I wish, aggregate by a function// of my choice and so on, and finally, persist the dataset where I wish
The text was updated successfully, but these errors were encountered:
When we discussed implementation details of counters we came across that they may be implemented as side-output from each operator. Details in issue #31. But it seems side-outputs is more general concept deserving separate issue.
The purpose of this issue is to think about implementing similar concept to Apache Beam additional outputs or Apache Flink side outputs
According to @je-ik:
We could design a parallel outputs to the main output, by naming them (and therefore construct what looks like counter, but essentially is nothing more than another
Dataset
). The code would then look like this:So,
Dataset
s - this goes in the direction oftags
in the sense of BeamFlow
getTaggedStream
)Dataset
is left on the user code, so you can use it for bussiness logic (joining it with some "main" dataset) or monitoring and debugging (storing it into appropriate sink - e.g. elastic search)A little modified example, which covers the above topics:
The text was updated successfully, but these errors were encountered: