Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement depth-first traversal for pipelines #47

Closed
daveaglick opened this issue Aug 7, 2015 · 3 comments
Closed

Implement depth-first traversal for pipelines #47

daveaglick opened this issue Aug 7, 2015 · 3 comments

Comments

@daveaglick
Copy link
Member

See discussion in #25

@daveaglick
Copy link
Member Author

So once again, I'm back to wondering if I should just go ahead and implement parallelism for each pipeline. This was considered in #42 and now it's come up again, less than a week later. Thought process is that if we're going to support depth-first traversal, we're already going confuse the tracing output. And it probably makes sense just to do depth-first always for consistency and to minimize confusion. At that point, there's really nothing holding us back from just making the whole thing async.

Currently investigating the TPL Dataflow library for this:
https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx
http://www.michaelfcollins3.me/blog/2013/07/18/introduction-to-the-tpl-dataflow-framework.html

One big change (regardless of if we go async or jsync depth-first) would be that modules can no longer access the set of documents from their own pipeline. For example, if a blog post accesses other blog posts to display the next and previous ones, that wouldn't work because the first post would get all the way to the end of the pipeline before the second post is processed. Mitigation would be to read enough metadata for all posts in one pipeline, then continue processing in a second pipeline for layout. The switch in pipelines would act as a chokepoint, letting all posts get processed for metadata before continuing to layout and output.

@daveaglick
Copy link
Member Author

Another problem: how to deal with modules that operate on all input documents as a unit? I.e., a (hypothetical) Aggregate module that combines all input documents into a single output document (come to think of it, should probably make this module)?

In Dataflow, there's no multiple-input, multiple-output - each block operates on a single input, regardless of the number of outputs. It's also not clear that when returning multiple outputs in Dataflow how to make the iteration lazy - that is, if a module returns several outputs (such as ReadFiles), the corresponding Dataflow block would return all the file documents at once and only then send them, one at a time, to the next module block.

Current thinking now is to implement a custom asynchronous pipeline. An internal class will wrap the module and provide BlockingCollection<IDocument> collections for both input and output. Add a new IAsyncModule interface that has an async Execute(...) method. The wrapper should lazily evaluate the enumerable returned from the module and add items to the BlockingCollection as they're available (at which point the next module will pick up and go). The satisfy the aggregate use case above, modules should be able to either block waiting for BlockingCollection.IsComplete or signal in some other way they need all the documents to be available before executing (perhaps with another interface?). Likewise, the wrapper should be sure to set IsComplete when enumeration of the results from the module is done.

@daveaglick
Copy link
Member Author

After attempting to implement both asynchronous pipeline processing and then synchronous depth-first (by relying on lazy iteration), there are just too many compromises in both cases. In addition to the loss of easily understood sequential trace output, there are complications with ensuring full iteration, synchronizing metadata access (in the case of asynchronous processing), dealing with aggregate modules (as described above), dealing with modules like Branch and If that potentially require multiple iteration, etc.

Instead, I'd like to continue using the breadth-first synchronous processing model that was originally designed. That said, there certainly is a need to process documents one at a time for use cases like multiple large image processing. This will hopefully be the exception, so the support can be opt-in. I've created a new module, ForEach that should work in this situation by essentially running it's child module sequence multiple times, once for each input document (instead of the normal process of feeding all input documents to the next module at once). In the case of image processing, it should be used like this:

Pipelines.Add("ImageProcessing",
    // ReadFiles will create N new documents with a Stream (but nothing will be read into memory yet)
    ReadFiles("*")
        .Where(x => new []{".png", ".jpg", ".jpeg", "gif"}.Contains(Path.GetExtension(x))),
    // Each document in N will be individually sent through the sequence of ForEach child pipelines
    ForEach(
        // This will load the *current* document into a MemoryStream (?)
        ImageProcessor()
            .Resize(100,100).ApplyFilters(ImageFilter.GreyScale, ImageFilter.Comic)
            .And.Resize(60, null).Brighten(30)
            .And.Resize(0, 600).Darken(88)
            .And.Constrain(100,100),
        // and this will save the stream to disk, replacing it with a file stream,
        // thus freeing up the memory for the next file
        WriteFiles()
    )
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant