WIP: API Design #156

Plecra · 2020-06-21T22:31:15Z

After sifting through the specification and the issues left in the repository, I've mocked up what I think would be a good API to work towards. It's incomplete, and I mostly bring it up as food for better ideas. Anyway, goals...

Lossless modification: We want to be able to reliably modify archives we can't fully parse
- It's common for zip utilities to implement the format incorrectly, and we want to be able to ignore information we're not touching
In-place modification: Interpreting and editing the contents of an existing ZIP file
- It's what the format was originally designed to be good at
- ...but also less useful for today's purposes. I just think this would be fun to pull off.
- It would give many uses a significant speed boost, since we'd need to make less reads and writes
Guiding users towards correct implementations
- There are lots of warts in the spec, and we should let as few of them leak through our API as possible
- This is text encoding, encryption underspecification, time handling, and whatever else I've missed
Descriptive documentation!
- This is something that we can address in the shorter term, no need to wait for a new API

mod zip {
    use std::{io, fs, path, time};
    
    pub struct Archive<S> {
        inner: S
    }
    impl<S: io::Read + io::Seek> Archive<S> {
        pub fn open(source: S) -> io::Result<Archive<S>>;
    }
    impl<S> Archive<S> {
        pub fn get(&mut self, path: impl Eq<PathBuf>) -> Option<File<S>>; 
        pub fn find(&mut self, predicate: impl FnMut(&File<S>) -> bool) -> Option<File<S>>;
    }
    impl<S: io::Write> Archive<S> {
        pub fn create(sink: S) -> Archive<S>;
        pub fn insert(&mut self, path: PathBuf) -> FileMut<S>;
    }
    pub enum CompressionMethod {
        Store,
        Shrink,
        Reduce1,
        Reduce2,
        Reduce3,
        Reduce4,
        Implode,
        Deflate,
        Deflate64,
        PkwareImplode,
        Bzip2,
        Lzma,
        IbmZosCmpsc,
        IbmTerse,
        Lz77,
        Zstd,
        Mp3,
        Xz,
        Jpeg,
        WavPack,
        PPMd,
        
        /// The data is stored in a proprietary PKWARE format
        Reserved,
    }
    /// Text from a ZIP file
    ///
    /// While ZIP files should contain either CP437 or UTF-8 encoded text, many
    /// tools do not properly convert their file names, or sometimes even
    /// comments.
    ///
    /// To allow lossless modification of ZIP files, this can be treated as an
    /// opaque type and placed into a new archive.
    ///
    /// Otherwise, you can convert the contents to UTF-8 with
    /// [`String::from`]. Keep in mind that the text may still have been
    /// encoded incorrectly.
    pub struct Text;
    pub struct PathBuf;
    pub struct Attributes;
    impl Attributes {
        pub fn readonly(&self) -> bool;
        pub fn from_unix_mode(mode: u32) -> Self;
    }
    impl Eq<PathBuf> for &str {}
    impl From<fs::Permissions> for Attributes {}
    impl From<Text> for String {}
    impl From<PathBuf> for path::PathBuf {}
    impl fmt::Display for Text {}
    impl fmt::Debug for Text {}
    impl Text {
        /// Yields a [`&str`] if the text can be interpreted as UTF-8
        pub fn to_str(&self) -> Option<&str>;
    }
    impl<S> File<'_, S> {
        pub fn path(&self) -> &PathBuf;
        pub fn comment(&self) -> &Text;
        pub fn last_modified(&self) -> &time::SystemTime;
        pub fn attributes(&self) -> &Attributes;
    }
    impl<S: io::Write> File<'_, S> {
        pub fn set_path(&mut self, path: PathBuf);
        pub fn set_comment(&mut self, text: Text);
        pub fn set_last_modified(&mut self, time: time::SystemTime);
        pub fn set_attributes(&self, attributes: Attributes);
    }
}

Examples

Normalization

Feeding files through the library lets it strip unnecessary data from the archive, and recompress the files where appropriate. This also gives it a chance to adjust the version needed to extract down.

`fn Archive::rebuild<W: io::Write>(&mut self, sink: W, filter: impl FnMut(&File) -> bool);`

// Bundle all the files added to `orig.zip` in the last hour.
// This will ensure the archive is created with the minimum features necessary
// to read it.
zip::Archive::open(File::open("orig.zip")?)?
    .rebuild(File::create("new.zip"), |file|
        file
            .last_modified()
            .elapsed()
            .map(|dur| dur < Duration::from_secs(60 * 60))
            .unwrap_or(false)
    })?
    .finish()

Cons

Can't modify the files at all

Edit: I've created the 0.7 branch to explorer the API redesign. We can make proposals as PRs to it

The text was updated successfully, but these errors were encountered:

Plecra · 2020-06-23T09:58:13Z

There's also the possibility of supporting concurrent/parallel archives. We'd need to use custom traits in the place of io::{Read, Write} for the concurrent handling. It's harder to come up for a design for that in advance, but here are some things we'll want to expose:

Infallible reads for data that supports it (Vec<u8>)
Decompression libraries require io traits
Read handle duplication (File::try_clone & Cursor::clone)
- Edit: This could address the &mut self issues in Laziness of ZipArchive and ZiipFile make for an awkward API #147 by requiring the custom trait to use a shared reference

rylev · 2020-06-23T10:25:13Z

I would also like us to address #147

Plecra · 2020-06-23T10:59:28Z

There's another option which might be able to give the user more control: If we could just emit owned Headers when reading the central directory, the user would be able to handle all the parallelisation issues. It's trickier to isolate the library's responsibilities in this model, but I think it could be a lot more powerful.

let source = File::open("orig.zip")?;
let mut out = zip::Archive::create(File::create("new.zip")?);
zip::headers(source.try_clone()?)
    .par_iter()
    .try_for_each_init(
        || source.try_clone().unwrap(),
        |source, r| r.and_then(|header| match header.last_modified().elapsed() {
            // the file was last modified in the last hour
            Ok(dur) if dur < Duration::from_secs(60 * 60) => header.copy(source, out),
            // This `copy` method would have to be magic to actually work
            _ => Ok(()),
        },
    ))?;

EDIT: I believe this addresses the concerns brought up in #147 - is there any particular use case you'd like to improve?

rylev · 2020-06-23T15:46:25Z

I'm worried that this API, while powerful, will be tough to use.

Plecra · 2020-06-26T10:14:36Z

We should be able to draw from #91 to improve the parser's startup time (It should be constant for the archives without comments)

I don't think there's any way to avoid parsing the files in order, since the names and comments have a variable length, but I imagine it's common not to need to parse the whole central directory

Plecra · 2020-07-02T12:40:37Z

Time

When working with time, we'd ideally be able to state the moment in time at which something happened. In the case of ZIP files, we want to know when a file was added to the archive.

Unfortunately, ZIP files' last mod date/time fields are... not very helpful in achieving that. In the MS-DOS datetime format, they can only identify every other second between 1980 and 2108. More importantly, the time can be in any time zone, rendering it effectively meaningless.

There are a few ZIP extensions which provide alternatives:

NTFS Mtime: a 64 bit count from Jan 1st 1601
UNIX Mtime & Apple's extended timestamp: a 32 (😬) bit unix timestamp

It's possible for the parser to encounter any combination of these, so we'll need to be able to resolve conflicts:

What to do when the timestamps disagree?
What to do when the last mod time is in another time zone? (indistinguishable from above)
An API for the user to assert the time zone of the unspecified dates

Plecra · 2020-08-24T11:01:50Z

`io::Seek` writers

See #114

When writing a ZIP file, the only reason to seek through the output is to update the crc-32, compressed size, and uncompressed size fields, and is entirely optional:

     Bit 3: If this bit is set, the fields crc-32, compressed 
           size and uncompressed size are set to zero in the 
           local header.  The correct values are put in the 
           data descriptor immediately following the compressed
           data.  (Note: PKZIP version 2.04g for DOS only 
           recognizes this bit for method 8 compression, newer 
           versions of PKZIP recognize this bit for any 
           compression method.)

#114 handles this using a runtime flag set using new_streaming. It works, but brings a type-level abstraction down to runtime, and requires the user to explicitly use the right constructor.

The specialization feature would let us implement this completely transparently, just fixing the header when io::Seek is available. While this makes the API cleaner, it's quite opaque for users who require this behaviour. For example, read_zipfile_from_stream is currently implemented using these fields.

I'm undecided... Thoughts?

Plecra · 2020-08-24T13:31:13Z

Ooh... I've run into a real problem with the ergonomics of the file API.

So, I'd like to provide something along the lines of the API in the original post:

impl<S: io::Write> Archive<S> {
    pub fn insert(&mut self, path: PathBuf) -> io::Result<File<&mut S>>;
}

io::copy(&mut file, &mut archive.insert(file.path())?)?;

But this API is impossible to implement. At least, without a panicking branch in Drop for File, which we definitely don't want. The problem is that once we are done writing (after io::copy), we need to write the size of the file to S. Whether we call Seek::seek or Write::write, we'll have an io::Result to handle.

The user can "fix" the issue like so:

let mut zip_file = archive.insert(file.path())?;
io::copy(&mut file, &mut zip_file)?;
zip_file.finalize(); // consumes `zip_file`

...once they've already experienced a crash because they wrote their code the easy, incorrect way.

Tbh I'm stuck on this one. io::Write doesn't signal EOF, and we can't prevent Drop at compile time. Afaict, the best we've got is a Warnings section in the documentation

Aha! Thinking back to the current design, we could have the Archive type do some bookkeeping to ensure the file was left in a reasonable state when it's next written to. This will have a small performance impact, and depends on our answer to #184. It'd be for the best imo

Plecra · 2020-08-25T21:22:23Z

@rylev Do you have an opinion on #156 (comment) ?

rylev · 2020-08-26T13:01:43Z

@Plecra what's the reason for allowing separate writes for both a path and the contents? Wouldn't it make more sense to force the user to provide the path and the contents at the same time?

Plecra · 2020-08-26T13:57:14Z

I'm not sure I understand. Do you mean an API like Archive::insert(path: PathBuf, file_contents: Vec<u8>)? We want to write to an io::Write, so we have to give the user a type that implements it

rylev · 2020-08-26T14:04:52Z

I didn't necessarily mean taking an owned buffer, but yes in general, I meant taking a concrete type that represents the contents that we want to write. I understand that allowing writing to an io::Write is much more flexible, but I'm not sure if that flexibility is worth the awkwardness of the API if this is planned to be the primary way to write an archive.

Plecra · 2020-08-26T16:17:00Z

What part of the API do you consider awkward? It's about the same as fs::File at this point.

Since we'll be writing to a generic writer anyway, it seems reasonable to reflect that API ourselves. I mean, we write the file contents through io::Write even with the current design.

Plecra · 2023-02-01T15:16:28Z

This is all still relevant, but will continue development elsewhere (currently in 0.7/0.10 branches)

Contextualist mentioned this issue Feb 26, 2021

Support append to an existing archive #215

Merged

Plecra closed this as completed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: API Design #156

WIP: API Design #156

Plecra commented Jun 21, 2020 •

edited

Loading

Plecra commented Jun 23, 2020 •

edited

Loading

rylev commented Jun 23, 2020

Plecra commented Jun 23, 2020 •

edited

Loading

rylev commented Jun 23, 2020

Plecra commented Jun 26, 2020

Plecra commented Jul 2, 2020 •

edited

Loading

Plecra commented Aug 24, 2020

Plecra commented Aug 24, 2020 •

edited

Loading

Plecra commented Aug 25, 2020

rylev commented Aug 26, 2020

Plecra commented Aug 26, 2020

rylev commented Aug 26, 2020

Plecra commented Aug 26, 2020

Plecra commented Feb 1, 2023 •

edited

Loading

WIP: API Design #156

WIP: API Design #156

Comments

Plecra commented Jun 21, 2020 • edited Loading

Examples

Normalization

fn Archive::rebuild<W: io::Write>(&mut self, sink: W, filter: impl FnMut(&File) -> bool);

Cons

Plecra commented Jun 23, 2020 • edited Loading

rylev commented Jun 23, 2020

Plecra commented Jun 23, 2020 • edited Loading

rylev commented Jun 23, 2020

Plecra commented Jun 26, 2020

Plecra commented Jul 2, 2020 • edited Loading

Time

Plecra commented Aug 24, 2020

io::Seek writers

Plecra commented Aug 24, 2020 • edited Loading

Plecra commented Aug 25, 2020

rylev commented Aug 26, 2020

Plecra commented Aug 26, 2020

rylev commented Aug 26, 2020

Plecra commented Aug 26, 2020

Plecra commented Feb 1, 2023 • edited Loading

Plecra commented Jun 21, 2020 •

edited

Loading

`fn Archive::rebuild<W: io::Write>(&mut self, sink: W, filter: impl FnMut(&File) -> bool);`

Plecra commented Jun 23, 2020 •

edited

Loading

Plecra commented Jun 23, 2020 •

edited

Loading

Plecra commented Jul 2, 2020 •

edited

Loading

`io::Seek` writers

Plecra commented Aug 24, 2020 •

edited

Loading

Plecra commented Feb 1, 2023 •

edited

Loading