New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic buffering for plugins #150
Comments
It's a very good idea. As part of it, providing a suggested read size for next read helps reducing unnecessary buffering, which is good for performance. |
I think it makes a lot of sense, especially your suggestion about At least the above is true for preferred input sizes I think. Output sizes are a bit trickier. The actual output size is generally not I guess you could have to be clear about whether using the desired buffered On Sun, Oct 25, 2015 at 12:19 PM, Evan Nemerson notifications@github.com
|
Most (all?) libraries with a streaming API allow for a function to be called multiple times to fill the output buffer repeatedly until it has written everything it wants to before giving it more input. That's how zlib, lzma, bzip2, and density work (I can't remember for lz4f, but it's probably the same), Squash exposes this to the consumer, so it shouldn't create any unnecessary copying. Having to allocate enough space to store the entire decompressed buffer is only really an issue for plugins using the all-in-one interface, and this issue is only about streaming. For plugins which only support the all-in-one interface natively we already buffer the entire input and output if the consumer uses a streaming API (see https://github.com/quixdb/squash/blob/master/docs/internals.md, also note the difference between the streaming and splicing APIs).
I think that, if the plugin implements that callback, it would be guaranteed that process operations provide that much input at a time. Flush and finish operations can be smaller, but everyone who has a streaming API distinguishes between process, finish, and (if they have it) flush, so I don't see that being an issue. A more interesting question is what to do when Squash is provided with more than the requested amount of input. Many plugins will buffer the excess input, but from Squash's POV it will seem to have been consumed. When that happens, Squash's view of the plugin will be out of sync and you might end up with an extra buffer (one in the library, one in Squash). It might be better to only provide the requested amount of data to the plugin and buffer the rest (unless there is another whole block extra, in which case it would just keep sending the input one block at a time). I don't see how passing along a larger output buffer could hurt, though, so it would probably make sense to just pass that along as-is as long as it is larger than the requested size; if it is smaller, then Squash would pass its internal output buffer. |
Right, but in the ideal case for compression you often want to supply an output buffer that is large enough to accept the output of a single input chunk. E.g., if I take input in 16K chunks, it may be the case that a 17K output buffer will always be enough to do the compression of a chunk in "one shot". Depending on the implementation of caller, it may be the case that using a 10K buffer will never cause additional copying, but it would at least cause multiple calls back into the plugin and underlying codec, which often has its own overhead. BTW when looking at my own codecs, what they really often want to know is the total size of the input, if available. On compression, this could only be available if the outer layers can provide it, which in the context of squash means if the all-at-once API is used, or if the streaming API allows the user a way to provide a size hint. On decompression, it is conceivable that the size is embedded in the compressed format, but putting it at the front of the format is difficult/impossible in a streaming context since it would mean buffering everything on compression, until "finish" occurs, at which point the size is known (OTOH, if the size was provided up front, not only as a hint but as a true size, you could write it immediately). Perhaps that last part is a bit off topic, but I mention it because I think it relates at least tangentially to the "bigger buffer" question. Providing the entire buffer, if it happens to be available (e.g., because big-bang was used, or because you mmap'd a whole file, e.g., in the benchmark) gives that hint to the compressor - or at least the compressor knows the data will be at least that large. That said, over the design space of all compressors, I can certainly imagine that some of them will want the input size hint to be a "not larger than", even if most compressors are fine with larger input buffers. |
Some libraries like to receive input from, and/or write output to, fixed-size chunks. For example, density and lz4f (and snappy-framed until I removed it). Right now plugins have to add logic to buffer things on their end, which makes writing plugins more difficult. It should be possible to adjust the plugin API and move the buffering logic into Squash.
My initial thought is that we could add a single callback for plugins, something like
If present, Squash would invoke it when a stream is created and create buffers of the requested sizes (input/output_size are out params). Then, whenver
squash_process
is invoked Squash would just buffer the input until itinput_size
bytes, and pass that to the plugin'sprocess_stream
callback (unless there is no data buffered andavail_in
>=input_size
, in which case we could circumvent the buffer). Similarly, if Squash would buffer the output (ifavail_out
<output_size
) for the plugin.We could also expose this information in the API as a suggestion for the next read size, which could help cut down on unnecessary buffering. That way even plugins which can take arbitrary inputs could benefit from this (e.g., bzip2).
Of course, this also makes writing plugins for certain libraries much simpler.
@gpnuma, @Cyan4973, @travisdowns, maybe you have some thoughts on this?
The text was updated successfully, but these errors were encountered: