Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fedy-cz
Copy link

@fedy-cz fedy-cz commented Oct 15, 2023

(simple forwarding)

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Like mentioned in the issue, let's have tests for that to assure there is no unexpected reason that this wasn't forwarded before. There should be tests somewhere in the codebase to exercise these, I'd hope.

Thank you.

@fedy-cz
Copy link
Author

fedy-cz commented Oct 15, 2023

Well. I adapted one of the test from zlib:

#[test]
    fn total_in() {
        let mut real = Vec::new();
        let mut w = write::GzEncoder::new(Vec::new(), Compression::default());
        let v = crate::random_bytes().take(1024).collect::<Vec<_>>();
        for _ in 0..200 {
            let to_write = &v[..thread_rng().gen_range(0..v.len())];
            real.extend(to_write.iter().copied());
            w.write_all(to_write).unwrap();
        }
        let mut result = w.finish().unwrap();

        let result_len = result.len();

        for _ in 0..200 {
            result.extend(v.iter().copied());
        }

        let mut r = read::GzDecoder::new(&result[..]);
        let mut ret = Vec::new();
        r.read_to_end(&mut ret).unwrap();
        assert_eq!(ret, real);
        assert_eq!(r.total_in(), result_len as u64);
    }

... and sure enough it fails.

The question is if it's initial assumptions are correct. Should the total_in() report the current position in the underlying
Reader or the compressed data-stream itself (without CRC)? The doc comment is unclear:

    /// Returns the number of bytes that have been read into this compressor.
    ///
    /// Note that not all bytes read from the underlying object may be accounted
    /// for, there may still be some active buffering.

We should decide, document it and then we can write the relevant tests and adapt the code if necessary.

@fedy-cz
Copy link
Author

fedy-cz commented Oct 15, 2023

If we decided to indicate the BufRead position (to make the test posted above pass as is) then GzHeaderParser would need to store the header size somewhere (which is currently not the case).

@Byron
Copy link
Member

Byron commented Oct 16, 2023

Thanks for giving it a shot!

I would have expected that total_in and total_out is solely something to be forwarded from the underlying implementation, which should keep track of it all by itself, and would not expect it to be emulated by asking the input stream for its position instead.

Remembering my initial question in the related issue #381 as to why these methods are missing, then maybe this is the answer: Straightforward forwarding, for yet to be discovered reason, isn't possible or working as expected.

@fedy-cz
Copy link
Author

fedy-cz commented Oct 16, 2023

As I described in the note above (#382 (comment)), it's not all hopeless 😄

First let me state a few facts:

  1. It is (so far) not an existing API for the Gz* instances and not part of any generic interface (trait) so there is no risk of breaking existing code.
  2. That being said, it is probably not wise to introduce widely different behavior to a similarly named methods within one library.
  3. While (in my opinion) it is not currently fully stated in the documentation, the current implementation of those functions for both zlib::* and deflate::* represent the (relative) position increase in the input/output stream
  4. Ideally you would be able to get this info without depending on this library at all. For example the commonly used BufReader implements the Seek trait with it's stream_position method, but due to some unfortunate decisions in it's design it actually depends on the seek() metod by default and henceforth requires a mutable borrow 😞 (some initial attempts to try to rectify that here: Tracking issue for Seek::{stream_len, stream_position} (feature seek_convenience) rust#59359 (comment)) . While nothing prevents you from implementing an adapter impl Read trait and counting the data passing through yourself, it's another layer of indirection and not very convenient.
  5. While this simple forwarding implementation doesn't meet the expectation in relation to the "contracts" offered by Zlib and Deflate variants, it could still be immensely useful and one could argue it actually offers a better estimation of the compression ratio (by not including the header & footer).

What would I suggest:

  1. Expose these (forwarded) values under some new (descriptive and documented) function names.
    Alternatively: Expose a borrow of the contained DeflateDecoder / DeflateEncoder to the users and let them call these functions directly, but that is probably not very clean (exposing a private member) and also less discover-able.
  2. Implement Gzip header/footer size tracking/reporting. From initial investigation, at least the header is dynamic size (can contain filename ...)
    Note: It would introduce additional memory requirements to the Decoder / Encoder (probably 2 x u64). Not sure how much these are critical.
  3. Implement the relevant tests and the total_in / total_out methods as expected (using the header / footer size data)

This could be done incrementally and at least point no. 1 shouldn't be too controversial.

What do you think?

@Byron
Copy link
Member

Byron commented Oct 16, 2023

Despite being the most active right now, and truth to be told, I feel very uncomfortable to make any API decisions in this crate. Maybe it's possible for you to figure out who wrote what or maybe even most of it, and reach out to them via GitHub or email when lacking a user name or linked PR?

Otherwise I'd actually prefer to just expose the actual compressor (as answer to 1.), but I also think there was probably a good reason not to expose it.

From my own experience (at gitoxide), I never managed to use the high-level abstractions but went right to the underlying Decompressor instances to steer them directly. The question here would be if a higher-level API could also be used if the documentation would be improved or if certain utility functions would be added.

With that said, to be more comfortable with making such decision, I would definitely have to understand what the new API would be good for. Tests can help, and so could full-blown example programs. Thanks for your understanding.

@Byron Byron added the question label Apr 26, 2024
@Byron
Copy link
Member

Byron commented Apr 26, 2024

Is this PR still active?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants