Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Support scatter/gather operations with unknown number of splits #20
Support Scatter operation: Produce more than one input per task (sub-stream)
Support gather operation: Receiving more than one input before triggering task (sub-stream) #19
Note that scatter with known number of splits is already possible, using parameter channels.
changed the title from
Support scatter operation: Produce more than one input per task (sub-stream)
Support scatter/gather operations with unknown number of splits
Apr 22, 2016
referenced this issue
Apr 22, 2016
Excerpts from notes saved in my journal today:
I'm just reading your notes here and I find them very interesting. I'm at the moment trying to figure out how to solve a very similar problem, how to integrate reduce operation naturally into a Go's FBP data processing pipeline.
Something that I'm considering is to specify in each IPs belonging to the same reduce group a reference to a channel that's used to perform the reduce operation. The tricky part is that the process in charge of doing the reduce operation has to maintain state during the operation (cannot be shared) and reset itself or die once it completes and sends the result back to the network. This would mean creating processes on demand per reduce operation or group of IPs to be reduced and I'm not sure how all this would fit within the FBP paradigm.
The proposal is still not clear in my head so this is just a vague idea that has come up reading your comments. I'll try to think more about this so I can propose something more elaborated, but I'd love to discuss more about this problem.
Hi @monkeybutter and thanks for very interesting input!
It is interesting that you propose using a channel for the reduce operation. I had not even thought about that option, so I will need to think more about this :)
Off arm though, it seems that specifying how to produce the reduce operation in the IPs in an early stage somewhat defeats the idea of keeping processes self contained: That is, you would want to keep all the logic for a particular reduce operation fully contained within one process. Therefor, I think I'm leaning towards the key-idea so far, since keys for data don't explicitly define how to do a reduce operation, but only provides the raw-data that can be used for it. Please correct me if I misunderstood!
Regarding multiple reduce operations, and what you call "creating processes on demand", this is, AFAIS, what the concept of FBP sub-networks are for, tightly coupled with sub-streams. That is, given some kind of delimiter for sub-streams (in JP Morrisons FBP book, they are special open/close bracket IPs), a new sub-network will be "lauched" (exactly as you are on to) once a sub-stream is opened. It will then run until the sub-stream is closed, whereafter it is terminated.
Many thanks for the input! This issue is my biggest concern at the moment, so I'm very eager to find a generic and elegant solution to it very soon. Let's keep thinking! :)
I agree that having to define the reduce operation as part of an IP defeats the idea of FBP. It's kind of introducing state in IPs instead of just keeping state at the network level.
I've been thinking more about the concept of gather/reduce operations inside networks. My first approach was to think about networks as a long-lived processes in which IPs from different "tasks" (group of IPs belonging to a reduce operation), arrive asynchronously. This is why I was thinking that IPs had to have some kind of identification to perform the reduce operation. Also, this is where my previous proposal of specifying a channel to perform the reduce operation comes from.
After some more thought, I've realised that if networks are conceived as short-lived processes and are dynamically created per "task", the reduce operation happens naturally as another process in the network. A reduce operation only produces an output when it has gone through all the IPs that constitute a "task" and the whole network terminates after this.
I don't know how this short-lived networks concept fits with your SciPipe design. As far as I know, instantiating a network seems to be very light in your design, it just requires instantiating a few channels, structs and goroutines. I wonder how much this is related to the concept of "sub-networks" that you mention in your comment. Reading your description of them it seems quite related to the idea of short-lived networks that I'm trying to describe.
Anyway, I'm looking forward to read your comments on this. I don't know if this is helping you to figure out a solution for your original problem: "scatter/gather with unknown number of splits" but it has helped me a lot to think about how reduce operations can be introduced in a FBP network. Thanks for that.
Yes, I think your "short-lived networks" and FBP sub-networks overlap quite exactly.
In SciPipe, I have example code on how to wire up a sub-network (workflow), in the factory method of a "subnetwork process" here. But it does not yet support sub-streams, which is a big deal, and this is what I hope to solve with a sub-stream channel field in the IPs as described above. I just hope I'm not introducing state in IPs that way somehow :)
Optimally we would like to allow most of these operations, using the standard SciProcess type. With this in mind, this issue depends on #30 , and maybe another feature, of creating parameter sweeps based on taking parameters via channels instead of as fixed values (so that a cross-product of all parameter combinations can be produced).