-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement depth-first traversal for pipelines #47
Comments
So once again, I'm back to wondering if I should just go ahead and implement parallelism for each pipeline. This was considered in #42 and now it's come up again, less than a week later. Thought process is that if we're going to support depth-first traversal, we're already going confuse the tracing output. And it probably makes sense just to do depth-first always for consistency and to minimize confusion. At that point, there's really nothing holding us back from just making the whole thing async. Currently investigating the TPL Dataflow library for this: One big change (regardless of if we go async or jsync depth-first) would be that modules can no longer access the set of documents from their own pipeline. For example, if a blog post accesses other blog posts to display the next and previous ones, that wouldn't work because the first post would get all the way to the end of the pipeline before the second post is processed. Mitigation would be to read enough metadata for all posts in one pipeline, then continue processing in a second pipeline for layout. The switch in pipelines would act as a chokepoint, letting all posts get processed for metadata before continuing to layout and output. |
Another problem: how to deal with modules that operate on all input documents as a unit? I.e., a (hypothetical) In Dataflow, there's no multiple-input, multiple-output - each block operates on a single input, regardless of the number of outputs. It's also not clear that when returning multiple outputs in Dataflow how to make the iteration lazy - that is, if a module returns several outputs (such as Current thinking now is to implement a custom asynchronous pipeline. An internal class will wrap the module and provide |
After attempting to implement both asynchronous pipeline processing and then synchronous depth-first (by relying on lazy iteration), there are just too many compromises in both cases. In addition to the loss of easily understood sequential trace output, there are complications with ensuring full iteration, synchronizing metadata access (in the case of asynchronous processing), dealing with aggregate modules (as described above), dealing with modules like Instead, I'd like to continue using the breadth-first synchronous processing model that was originally designed. That said, there certainly is a need to process documents one at a time for use cases like multiple large image processing. This will hopefully be the exception, so the support can be opt-in. I've created a new module,
|
See discussion in #25
The text was updated successfully, but these errors were encountered: