Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

fedy-cz · 2023-10-15T14:15:51Z

(simple forwarding)

…ultiGzDecoder (simple forwarding)

Byron

Thanks a lot!

Like mentioned in the issue, let's have tests for that to assure there is no unexpected reason that this wasn't forwarded before. There should be tests somewhere in the codebase to exercise these, I'd hope.

Thank you.

fedy-cz · 2023-10-15T16:11:06Z

Well. I adapted one of the test from zlib:

#[test]
    fn total_in() {
        let mut real = Vec::new();
        let mut w = write::GzEncoder::new(Vec::new(), Compression::default());
        let v = crate::random_bytes().take(1024).collect::<Vec<_>>();
        for _ in 0..200 {
            let to_write = &v[..thread_rng().gen_range(0..v.len())];
            real.extend(to_write.iter().copied());
            w.write_all(to_write).unwrap();
        }
        let mut result = w.finish().unwrap();

        let result_len = result.len();

        for _ in 0..200 {
            result.extend(v.iter().copied());
        }

        let mut r = read::GzDecoder::new(&result[..]);
        let mut ret = Vec::new();
        r.read_to_end(&mut ret).unwrap();
        assert_eq!(ret, real);
        assert_eq!(r.total_in(), result_len as u64);
    }

... and sure enough it fails.

The question is if it's initial assumptions are correct. Should the total_in() report the current position in the underlying
Reader or the compressed data-stream itself (without CRC)? The doc comment is unclear:

    /// Returns the number of bytes that have been read into this compressor.
    ///
    /// Note that not all bytes read from the underlying object may be accounted
    /// for, there may still be some active buffering.

We should decide, document it and then we can write the relevant tests and adapt the code if necessary.

fedy-cz · 2023-10-15T17:12:16Z

If we decided to indicate the BufRead position (to make the test posted above pass as is) then GzHeaderParser would need to store the header size somewhere (which is currently not the case).

Byron · 2023-10-16T05:43:24Z

Thanks for giving it a shot!

I would have expected that total_in and total_out is solely something to be forwarded from the underlying implementation, which should keep track of it all by itself, and would not expect it to be emulated by asking the input stream for its position instead.

Remembering my initial question in the related issue #381 as to why these methods are missing, then maybe this is the answer: Straightforward forwarding, for yet to be discovered reason, isn't possible or working as expected.

fedy-cz · 2023-10-16T12:13:59Z

As I described in the note above (#382 (comment)), it's not all hopeless 😄

First let me state a few facts:

It is (so far) not an existing API for the Gz* instances and not part of any generic interface (trait) so there is no risk of breaking existing code.
That being said, it is probably not wise to introduce widely different behavior to a similarly named methods within one library.
While (in my opinion) it is not currently fully stated in the documentation, the current implementation of those functions for both zlib::* and deflate::* represent the (relative) position increase in the input/output stream
Ideally you would be able to get this info without depending on this library at all. For example the commonly used BufReader implements the Seek trait with it's stream_position method, but due to some unfortunate decisions in it's design it actually depends on the seek() metod by default and henceforth requires a mutable borrow 😞 (some initial attempts to try to rectify that here: Tracking issue for Seek::{stream_len, stream_position} (feature seek_convenience) rust#59359 (comment)) . While nothing prevents you from implementing an adapter impl Read trait and counting the data passing through yourself, it's another layer of indirection and not very convenient.
While this simple forwarding implementation doesn't meet the expectation in relation to the "contracts" offered by Zlib and Deflate variants, it could still be immensely useful and one could argue it actually offers a better estimation of the compression ratio (by not including the header & footer).

What would I suggest:

Expose these (forwarded) values under some new (descriptive and documented) function names.
Alternatively: Expose a borrow of the contained DeflateDecoder / DeflateEncoder to the users and let them call these functions directly, but that is probably not very clean (exposing a private member) and also less discover-able.
Implement Gzip header/footer size tracking/reporting. From initial investigation, at least the header is dynamic size (can contain filename ...)
Note: It would introduce additional memory requirements to the Decoder / Encoder (probably 2 x u64). Not sure how much these are critical.
Implement the relevant tests and the total_in / total_out methods as expected (using the header / footer size data)

This could be done incrementally and at least point no. 1 shouldn't be too controversial.

What do you think?

Byron · 2023-10-16T19:52:12Z

Despite being the most active right now, and truth to be told, I feel very uncomfortable to make any API decisions in this crate. Maybe it's possible for you to figure out who wrote what or maybe even most of it, and reach out to them via GitHub or email when lacking a user name or linked PR?

Otherwise I'd actually prefer to just expose the actual compressor (as answer to 1.), but I also think there was probably a good reason not to expose it.

From my own experience (at gitoxide), I never managed to use the high-level abstractions but went right to the underlying Decompressor instances to steer them directly. The question here would be if a higher-level API could also be used if the documentation would be improved or if certain utility functions would be added.

With that said, to be more comfortable with making such decision, I would definitely have to understand what the new API would be good for. Tests can help, and so could full-blown example programs. Thanks for your understanding.

Byron · 2024-04-26T08:18:45Z

Is this PR still active?

Implemented total_in() and total_out() for GzDecoder, GzEncoder and M…

8a2e44e

…ultiGzDecoder (simple forwarding)

fedy-cz mentioned this pull request Oct 15, 2023

total_in(&self) / total_out(&self) implementation for GzDecoder / GzEncoder / MultiGzDecoder #381

Open

Byron requested changes Oct 15, 2023

View reviewed changes

Byron added the question label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

fedy-cz commented Oct 15, 2023

Byron left a comment

fedy-cz commented Oct 15, 2023

fedy-cz commented Oct 15, 2023

Byron commented Oct 16, 2023

fedy-cz commented Oct 16, 2023

Byron commented Oct 16, 2023

Byron commented Apr 26, 2024

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

Are you sure you want to change the base?

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

Conversation

fedy-cz commented Oct 15, 2023

Byron left a comment

Choose a reason for hiding this comment

fedy-cz commented Oct 15, 2023

fedy-cz commented Oct 15, 2023

Byron commented Oct 16, 2023

fedy-cz commented Oct 16, 2023

First let me state a few facts:

What would I suggest:

Byron commented Oct 16, 2023

Byron commented Apr 26, 2024