-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use streams for document content #42
Comments
I'm toying with the idea of using See http://blogs.msdn.com/b/pfxteam/archive/2010/04/14/9995613.aspx and https://github.com/slashdotdash/ParallelExtensionsExtras/blob/4df9a0843901d6449ee519a6cad828eb5a54a602/src/CoordinationDataStructures/Pipeline.cs |
The more I think about this, I do keep coming back to just using byte arrays (or plain old strings) and buffering the data in one big block from module to module. Consider this scenario: you have an image that needs to be resized to two different sizes. With streams (or some sort of stream-like collection) you'll have to read the entire stream to perform the first resize. Then you'll need to re-read the stream to perform the second resize. That means you're either going back to disk (slow!) or buffering the stream - which would be more efficient just by storing an passing the byte array to begin with. Consider also operations like string manipulation, find and replace, etc. It's all easier to code against with single primitive objects. Of course, this isn't without problems (hence the consideration of streams in the first place).
|
My questions are:
|
I've been giving this a lot of thought the last couple days and have finally decided on a way forward (thanks, as always, for the input @JimBobSquarePants and @dodyg). Normally I wouldn't go through so much hand-wringing and would just ship, but this is pretty fundamental aspect of a young project so I want to make sure to get it right. This is also going to be another long comment because I want to document the decision for my future self. Wyam was created first and foremost because I saw a lack of static generators that could be used in more sophisticated scenarios with the ability to easily customize the content flow. Other generators are either so focused on a specific use case (like blogs) or require too much complicated up front work. The concept of easily manipulating string content is fundamental to this design goal, so I'm going to make sure that stays in. I don't want users to have to worry about manipulating streams if all they want to do is a search and replace or other simple mutations. That said, there are also very good reasons why using strings under the hood won't be the best long-term solution. There are memory issues to contend with. There's also the matter of sending binary content through the pipeline. And it's been pointed out that encoding will become a factor too. The system has to accommodate streaming data in order to make sure we address all these potential pitfalls. So, here's what I'm going to do:
|
Great.
|
…t.ContentStream and additional IDocument.Clone(...) methods, also made Document disposable so it could clean up internally created streams
…d a wrapper for non-seekable streams and reset streams before each module - cleaned up disposable implementations - refactored some namespaces for cleaner separation
@dodyg - I still have some work to do to get all the in-built modules to use streams instead of strings, but both |
Going to go ahead and close this out. I decided against converting the other modules over to stream use. Reasoning is that they're either control flow (in which case they don't access content directly anyway) or string-based ( Even if I did want to address contiguous memory issues, it's unclear how to do so. Either the seekable stream would have to be chunked (since a simple The important thing is that both |
Instead of strings, use streams for binary content (see discussion in #25). This will allow documents to contain string or binary content. It should also yield better performance in some cases where transformations are optimized for streaming data (instead of having to read into a string). Will need to convert all existing modules over to streams. Should also add convenience getters and setters to convert from the stream from/to byte arrays and strings (in the case of getting data, handle reading the stream into the array and vise versa for setters).
The text was updated successfully, but these errors were encountered: