Skip to content
This repository has been archived by the owner on Jun 2, 2024. It is now read-only.

WIP: API Design #156

Closed
Plecra opened this issue Jun 21, 2020 · 14 comments
Closed

WIP: API Design #156

Plecra opened this issue Jun 21, 2020 · 14 comments

Comments

@Plecra
Copy link
Member

Plecra commented Jun 21, 2020

After sifting through the specification and the issues left in the repository, I've mocked up what I think would be a good API to work towards. It's incomplete, and I mostly bring it up as food for better ideas. Anyway, goals...

  • Lossless modification: We want to be able to reliably modify archives we can't fully parse
  • In-place modification: Interpreting and editing the contents of an existing ZIP file
    • It's what the format was originally designed to be good at
    • ...but also less useful for today's purposes. I just think this would be fun to pull off.
    • It would give many uses a significant speed boost, since we'd need to make less reads and writes
  • Guiding users towards correct implementations
    • There are lots of warts in the spec, and we should let as few of them leak through our API as possible
    • This is text encoding, encryption underspecification, time handling, and whatever else I've missed
  • Descriptive documentation!
    • This is something that we can address in the shorter term, no need to wait for a new API
mod zip {
    use std::{io, fs, path, time};
    
    pub struct Archive<S> {
        inner: S
    }
    impl<S: io::Read + io::Seek> Archive<S> {
        pub fn open(source: S) -> io::Result<Archive<S>>;
    }
    impl<S> Archive<S> {
        pub fn get(&mut self, path: impl Eq<PathBuf>) -> Option<File<S>>; 
        pub fn find(&mut self, predicate: impl FnMut(&File<S>) -> bool) -> Option<File<S>>;
    }
    impl<S: io::Write> Archive<S> {
        pub fn create(sink: S) -> Archive<S>;
        pub fn insert(&mut self, path: PathBuf) -> FileMut<S>;
    }
    pub enum CompressionMethod {
        Store,
        Shrink,
        Reduce1,
        Reduce2,
        Reduce3,
        Reduce4,
        Implode,
        Deflate,
        Deflate64,
        PkwareImplode,
        Bzip2,
        Lzma,
        IbmZosCmpsc,
        IbmTerse,
        Lz77,
        Zstd,
        Mp3,
        Xz,
        Jpeg,
        WavPack,
        PPMd,
        
        /// The data is stored in a proprietary PKWARE format
        Reserved,
    }
    /// Text from a ZIP file
    ///
    /// While ZIP files should contain either CP437 or UTF-8 encoded text, many
    /// tools do not properly convert their file names, or sometimes even
    /// comments.
    ///
    /// To allow lossless modification of ZIP files, this can be treated as an
    /// opaque type and placed into a new archive.
    ///
    /// Otherwise, you can convert the contents to UTF-8 with
    /// [`String::from`]. Keep in mind that the text may still have been
    /// encoded incorrectly.
    pub struct Text;
    pub struct PathBuf;
    pub struct Attributes;
    impl Attributes {
        pub fn readonly(&self) -> bool;
        pub fn from_unix_mode(mode: u32) -> Self;
    }
    impl Eq<PathBuf> for &str {}
    impl From<fs::Permissions> for Attributes {}
    impl From<Text> for String {}
    impl From<PathBuf> for path::PathBuf {}
    impl fmt::Display for Text {}
    impl fmt::Debug for Text {}
    impl Text {
        /// Yields a [`&str`] if the text can be interpreted as UTF-8
        pub fn to_str(&self) -> Option<&str>;
    }
    impl<S> File<'_, S> {
        pub fn path(&self) -> &PathBuf;
        pub fn comment(&self) -> &Text;
        pub fn last_modified(&self) -> &time::SystemTime;
        pub fn attributes(&self) -> &Attributes;
    }
    impl<S: io::Write> File<'_, S> {
        pub fn set_path(&mut self, path: PathBuf);
        pub fn set_comment(&mut self, text: Text);
        pub fn set_last_modified(&mut self, time: time::SystemTime);
        pub fn set_attributes(&self, attributes: Attributes);
    }
}

Examples

Normalization

Feeding files through the library lets it strip unnecessary data from the archive, and recompress the files where appropriate. This also gives it a chance to adjust the version needed to extract down.

fn Archive::rebuild<W: io::Write>(&mut self, sink: W, filter: impl FnMut(&File) -> bool);

// Bundle all the files added to `orig.zip` in the last hour.
// This will ensure the archive is created with the minimum features necessary
// to read it.
zip::Archive::open(File::open("orig.zip")?)?
    .rebuild(File::create("new.zip"), |file|
        file
            .last_modified()
            .elapsed()
            .map(|dur| dur < Duration::from_secs(60 * 60))
            .unwrap_or(false)
    })?
    .finish()
Cons
  • Can't modify the files at all

Edit: I've created the 0.7 branch to explorer the API redesign. We can make proposals as PRs to it

@Plecra
Copy link
Member Author

Plecra commented Jun 23, 2020

There's also the possibility of supporting concurrent/parallel archives. We'd need to use custom traits in the place of io::{Read, Write} for the concurrent handling. It's harder to come up for a design for that in advance, but here are some things we'll want to expose:

  • Infallible reads for data that supports it (Vec<u8>)
  • Decompression libraries require io traits
  • Read handle duplication (File::try_clone & Cursor::clone)

@rylev
Copy link
Contributor

rylev commented Jun 23, 2020

I would also like us to address #147

@Plecra
Copy link
Member Author

Plecra commented Jun 23, 2020

There's another option which might be able to give the user more control: If we could just emit owned Headers when reading the central directory, the user would be able to handle all the parallelisation issues. It's trickier to isolate the library's responsibilities in this model, but I think it could be a lot more powerful.

let source = File::open("orig.zip")?;
let mut out = zip::Archive::create(File::create("new.zip")?);
zip::headers(source.try_clone()?)
    .par_iter()
    .try_for_each_init(
        || source.try_clone().unwrap(),
        |source, r| r.and_then(|header| match header.last_modified().elapsed() {
            // the file was last modified in the last hour
            Ok(dur) if dur < Duration::from_secs(60 * 60) => header.copy(source, out),
            // This `copy` method would have to be magic to actually work
            _ => Ok(()),
        },
    ))?;

EDIT: I believe this addresses the concerns brought up in #147 - is there any particular use case you'd like to improve?

@rylev
Copy link
Contributor

rylev commented Jun 23, 2020

I'm worried that this API, while powerful, will be tough to use.

@Plecra
Copy link
Member Author

Plecra commented Jun 26, 2020

We should be able to draw from #91 to improve the parser's startup time (It should be constant for the archives without comments)

I don't think there's any way to avoid parsing the files in order, since the names and comments have a variable length, but I imagine it's common not to need to parse the whole central directory

@Plecra
Copy link
Member Author

Plecra commented Jul 2, 2020

Time

When working with time, we'd ideally be able to state the moment in time at which something happened. In the case of ZIP files, we want to know when a file was added to the archive.

Unfortunately, ZIP files' last mod date/time fields are... not very helpful in achieving that. In the MS-DOS datetime format, they can only identify every other second between 1980 and 2108. More importantly, the time can be in any time zone, rendering it effectively meaningless.

There are a few ZIP extensions which provide alternatives:

It's possible for the parser to encounter any combination of these, so we'll need to be able to resolve conflicts:

  • What to do when the timestamps disagree?
  • What to do when the last mod time is in another time zone? (indistinguishable from above)
  • An API for the user to assert the time zone of the unspecified dates

@Plecra
Copy link
Member Author

Plecra commented Aug 24, 2020

io::Seek writers

See #114

When writing a ZIP file, the only reason to seek through the output is to update the crc-32, compressed size, and uncompressed size fields, and is entirely optional:

     Bit 3: If this bit is set, the fields crc-32, compressed 
           size and uncompressed size are set to zero in the 
           local header.  The correct values are put in the 
           data descriptor immediately following the compressed
           data.  (Note: PKZIP version 2.04g for DOS only 
           recognizes this bit for method 8 compression, newer 
           versions of PKZIP recognize this bit for any 
           compression method.)

#114 handles this using a runtime flag set using new_streaming. It works, but brings a type-level abstraction down to runtime, and requires the user to explicitly use the right constructor.

The specialization feature would let us implement this completely transparently, just fixing the header when io::Seek is available. While this makes the API cleaner, it's quite opaque for users who require this behaviour. For example, read_zipfile_from_stream is currently implemented using these fields.

I'm undecided... Thoughts?

@Plecra
Copy link
Member Author

Plecra commented Aug 24, 2020

Ooh... I've run into a real problem with the ergonomics of the file API.

So, I'd like to provide something along the lines of the API in the original post:

impl<S: io::Write> Archive<S> {
    pub fn insert(&mut self, path: PathBuf) -> io::Result<File<&mut S>>;
}

io::copy(&mut file, &mut archive.insert(file.path())?)?;

But this API is impossible to implement. At least, without a panicking branch in Drop for File, which we definitely don't want. The problem is that once we are done writing (after io::copy), we need to write the size of the file to S. Whether we call Seek::seek or Write::write, we'll have an io::Result to handle.

The user can "fix" the issue like so:

let mut zip_file = archive.insert(file.path())?;
io::copy(&mut file, &mut zip_file)?;
zip_file.finalize(); // consumes `zip_file`

...once they've already experienced a crash because they wrote their code the easy, incorrect way.

Tbh I'm stuck on this one. io::Write doesn't signal EOF, and we can't prevent Drop at compile time. Afaict, the best we've got is a Warnings section in the documentation

Aha! Thinking back to the current design, we could have the Archive type do some bookkeeping to ensure the file was left in a reasonable state when it's next written to. This will have a small performance impact, and depends on our answer to #184. It'd be for the best imo

@Plecra
Copy link
Member Author

Plecra commented Aug 25, 2020

@rylev Do you have an opinion on #156 (comment) ?

@rylev
Copy link
Contributor

rylev commented Aug 26, 2020

@Plecra what's the reason for allowing separate writes for both a path and the contents? Wouldn't it make more sense to force the user to provide the path and the contents at the same time?

@Plecra
Copy link
Member Author

Plecra commented Aug 26, 2020

I'm not sure I understand. Do you mean an API like Archive::insert(path: PathBuf, file_contents: Vec<u8>)? We want to write to an io::Write, so we have to give the user a type that implements it

@rylev
Copy link
Contributor

rylev commented Aug 26, 2020

I didn't necessarily mean taking an owned buffer, but yes in general, I meant taking a concrete type that represents the contents that we want to write. I understand that allowing writing to an io::Write is much more flexible, but I'm not sure if that flexibility is worth the awkwardness of the API if this is planned to be the primary way to write an archive.

@Plecra
Copy link
Member Author

Plecra commented Aug 26, 2020

What part of the API do you consider awkward? It's about the same as fs::File at this point.

Since we'll be writing to a generic writer anyway, it seems reasonable to reflect that API ourselves. I mean, we write the file contents through io::Write even with the current design.

@Plecra
Copy link
Member Author

Plecra commented Feb 1, 2023

This is all still relevant, but will continue development elsewhere (currently in 0.7/0.10 branches)

@Plecra Plecra closed this as completed Feb 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants