Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexibility in zipfile write interface #70227

Closed
takluyver mannequin opened this issue Jan 7, 2016 · 58 comments
Closed

More flexibility in zipfile write interface #70227

takluyver mannequin opened this issue Jan 7, 2016 · 58 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@takluyver
Copy link
Mannequin

takluyver mannequin commented Jan 7, 2016

BPO 26039
Nosy @takluyver, @vadmium, @serhiy-storchaka, @Carreau
Files
  • zipfile-flexibility.patch
  • zipfile-open-w.patch
  • zipinfo-from-file.patch: ZipInfo.from_file() classmethod
  • zipinfo-from-file2.patch: ZipInfo.from_file() + docs + tests
  • zipfile-open-w2.patch: zf.open(... mode='w') + docs + tests
  • zipfile-open-w3.patch: zf.open(... mode='w') after vadmium's review
  • zipinfo-from-file3.patch: ZipInfo.from_file() after vadmium's review
  • zipinfo-from-file4.patch: ZipInfo.from_file() after storchaka's review
  • zipinfo-from-file5.patch: ZipInfo.from_file() after storchaka's 2nd review
  • zipfile-open-w4.patch: zf.open(... mode='w') after vadmium's 2nd review
  • socketpair.py: example of monkey-patching
  • zipfile-open-w5.patch
  • zipfile-open-w6.patch
  • zipfile-open-w7.patch
  • zipfile-open-w8.patch
  • zipfile-open-w9.patch
  • zipfile-open-w-exceptions.patch
  • zipfile-flex-bonus.patch
  • zipfile-flex-bonus2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-05-15.09:28:40.190>
    created_at = <Date 2016-01-07.15:10:51.849>
    labels = ['type-feature', 'library']
    title = 'More flexibility in zipfile write interface'
    updated_at = <Date 2016-05-15.17:27:05.746>
    user = 'https://github.com/takluyver'

    bugs.python.org fields:

    activity = <Date 2016-05-15.17:27:05.746>
    actor = 'takluyver'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-05-15.09:28:40.190>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2016-01-07.15:10:51.849>
    creator = 'takluyver'
    dependencies = []
    files = ['41525', '41626', '41637', '41722', '41726', '41752', '41753', '41754', '41758', '41759', '42311', '42375', '42588', '42597', '42598', '42819', '42850', '42851', '42855']
    hgrepos = []
    issue_num = 26039
    keywords = ['patch']
    message_count = 58.0
    messages = ['257694', '257696', '257697', '258313', '258459', '258949', '258958', '258959', '259022', '259184', '259206', '259209', '259255', '259258', '259553', '259805', '259806', '259833', '260305', '260733', '260734', '260746', '260808', '260810', '260852', '261162', '261168', '261328', '261332', '262534', '262537', '262543', '262557', '262926', '264062', '264070', '264169', '264205', '264242', '264248', '264720', '265232', '265330', '265386', '265388', '265459', '265461', '265532', '265540', '265541', '265544', '265546', '265559', '265580', '265592', '265603', '265605', '265635']
    nosy_count = 6.0
    nosy_names = ['python-dev', 'takluyver', 'martin.panter', 'serhiy.storchaka', 'mbussonn', 'Dhiraj_Mishra']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue26039'
    versions = ['Python 3.6']

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 7, 2016

    I was working recently on some code writing zip files, and needed a bit more flexibility than the interface of zipfile provides. To do what I wanted, I ended up copying the entirety of ZipFile.write() into my own code with modifications. That's ugly, and to make it worse, the code I copied from the Python 3.4 stdlib didn't work properly with Python 3.5.

    Specifically, I want to:

    • Override the timestamp, which write() unconditionally takes from the mtime of the file.
    • Do extra processing on the data (e.g. calculating a checksum) as it is read. Being able to pass a file-like object in to be read, rather than just a filename, would work for this.

    I could do both by using ZipFile.writestr(), but then the entire file must be loaded into memory at once, which I would much rather avoid.

    The patch attached is one proposal which would make it easier to do what I want, but it's intended as a starting point for discussion. I'm not particularly attached to the API.

    • Should this be a new method, or new functionality for either the write or writestr method? I favour a new method, because the existing methods are already quite long, and I think it's nicer to break write() up into two parts like this.

    • If a new method, what should it be called? I opted for writefile()

    • What should its signature be? It's currently writefile(file_obj, zinfo, force_zip64=False), but I can see an argument for aligning it with writestr(zinfo_or_arcname, data, compress_type=None).

    • What to do about ZIP64: the code has to decide whether to use ZIP64 format before writing, because it affects the header size, but we may not know the length before we have read it all. I've used a force_zip64 boolean parameter, and documented that people should pass it True if a file of unknown size could exceed 2 GiB.

    • Are there other points where it could be made more flexible while we're thinking about this? I'd quite like to split out the code for writing a directory entry to a separate method ('writedir'?). I'd also like to allow datetime objects to be passed in for timestamps as well as the 6-tuples currently expected.

    @takluyver takluyver mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 7, 2016
    @Carreau
    Copy link
    Mannequin

    Carreau mannequin commented Jan 7, 2016

    from the last blog post of Brett Cannon, I would say that he might be interested in Participating to this discussion (Rewrite zipimport from scratch section)

    Though I'm not sure about the etiquette to ping someone on such an issue.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 7, 2016

    bpo-11980 has a patch that adds a method for writing a file into the ZIP archive.

    But more flexible way is to add the support of the "w" mode to ZipFile.open(). This will allow to write a data that doesn't exist in memory at once, nor in disk, but is loaded by chunks from the pipe, or from the socket, or from other archive, or just generated on fly. I'm worked on this, but this feature needed to add support of some other features. For example writing to unseekable file (already implemented) or writing and reading a file without known size (not implemented still).

    bpo-18595 has a proposition to add special methods for creating symlinks, directories etc.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 15, 2016

    Attached is a first go at a patch enabling zipfile.open(blah, mode='w')

    Files must be written sequentially, so you have to close one writing handle before opening another. If you try to open a second one before closing the first, it will raise RuntimeError. I considered doing something where it would write to temporary files and add them to the zip file when they were closed, but it seemed like a bad idea.

    You can almost certainly break this by reading from a zip file while there's an open writing handle. Resolving this is tricky because there's a disconnect in the requirements for reading and writing: writing allows for a non-seekable output stream, but reading assumes that you can seek freely. The simplest fix is to block reading while there is an open file handle. I don't think many people will need to read one file from a zip while writing another, anyway.

    I have used the lock, but I haven't thought carefully about thread safety, so that should still be checked carefully.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 17, 2016

    zipinfo-from-file.patch has an orthogonal but related change: the code in ZipFile.write() to construct a ZipInfo object from a filesystem file is pulled out to a classmethod ZipInfo.from_file().

    Together, these changes make it much easier to control how a file is written to a zip file, like this:

    zi = ZipInfo.from_file(blah)
    # ... manipulate zi...
    with open(blah, 'rb') as src, zf.open(zi, 'w') as dest:
        # copy of the read/write loop - maybe this should be
        # pulled out separately as well?

    If these changes make it in, I might put a backported copy of the module on PyPI so I can start using it without waiting for Python 3.6.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 26, 2016

    Serhiy, any chance you'd have some time to review my patch(es)? Or is there someone else interested in zipfile I might interest? Thanks :-)

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 26, 2016

    Your patches are great Thomas! This is just what I want to implement. It will take some time to make a careful review. Besides possible corrections I think these features will be added in 3.6. But new features need tests and documenting.

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 26, 2016
    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 26, 2016

    Thanks! I will work on docs and tests.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 27, 2016

    The '2' versions of the two different patches include some docs and tests for these new features.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 29, 2016

    I left some review comments.

    Since the rules for read-only and write-only access are so different, would it make more sense to have a separate method? Maybe your writefile() name, or open_write() or something? I am not too fussed, but unless there is a chance of being able to open a read-write random access file, I find these split-personality open() methods a bit “un-Pythonic”. On the other hand, I guess it is superficially consistent with other open() functions.

    Also, perhaps if you guaranteed the write-only option returned a file-like object, you could use shutil.copyfileobj() rather than a custom read-write loop.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 29, 2016

    Here's a new version of the zf.open() patch following Martin's review (thanks Martin!).

    I agree that it feels a bit awkward having two completely different actions for zf.open(), but it is a familiar interface, and since the mode parameter is already there, it requires a minimum of new public API. But I'm happy to add a new method like open_write() or write_handle() if people prefer that.

    The comments on the other patch are minimal, I'll put a new version of that together as well.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 29, 2016

    Thanks Serhiy for review comments.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 30, 2016

    Updated version of the ZipInfo.from_file() patch attached.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Jan 30, 2016

    Thanks Martin for more review; updated open-to-write patch attached.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 4, 2016

    Is there anything more I should be doing with either of these patches? I think I've incorporated all review comments I've seen. Thanks!

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 7, 2016

    New changeset 7fea2cebc604 by Serhiy Storchaka in branch 'default':
    Issue bpo-26039: Added zipfile.ZipInfo.from_file() and zipinfo.ZipInfo.is_dir().
    https://hg.python.org/cpython/rev/7fea2cebc604

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 7, 2016

    Committed zipinfo-from-file5.patch. Now I'm starting to review zipfile-open-w4.patch (I concurred with most Martin's comments for previous patches).

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 8, 2016

    Thanks Serhiy! I'll keep an eye out for comments on the other patch.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 15, 2016

    Hi Serhiy, any more comments on the zf.open() patch?

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 23, 2016

    Ping! ;-)

    @DhirajMishra
    Copy link
    Mannequin

    DhirajMishra mannequin commented Feb 23, 2016

    Please ha Look on bpo-11980

    http://bugs.python.org/issue11980
    Already have been Patched

    @vadmium
    Copy link
    Member

    vadmium commented Feb 23, 2016

    Thanks for the pointer Dhiraj. I prefer the open(mode="w") version proposed here, as being more flexible. This way you could wrap the writer object in e.g. TextIOWrapper. The other patch requires passing in a file reader object.

    Having another look at zipfile-open-w4.patch, I have some thoughts about locking and the writing-while-reading restriction:

    The lock seems to be designed to serialize reads and writes (which operate on the common underlying file object). See revision 4973ccd46e32, and <https://bugs.python.org/issue14099#msg234142\>, although it would be good to document this, or at the minimum add a comment explaining the purpose and scope of the lock.

    Currently, it appears that write() and writestr() acquire the lock, so I presume it is intended that these methods can be called multiple times concurrently, and also while the zip file is being read. With the patch, writestr() still preserves the lock usage, but write() does not because it is now implemented in terms of the new open(mode="w") method.

    I think it would be good to clarify that the lock does _not_ protect concurrent writes via open(mode="w").

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 24, 2016

    My initial patch would have allowed passing a readable file-like object into zipfile. I was persuaded that allowing ZipFile.open() to return a writable object was a more intuitive and flexible API.

    Concurrent writes with zf.open(mode='w') should be impossible, because it only allows one open handle at a time. It still uses the lock inside _ZipWriteFile, so concurrent writes to a single handle should be serialised.

    I would not recommend anyone try to do concurrent access to a single ZipFile object from multiple threads or coroutines. It's quite stateful, there is no mention of concurrency in the docs, and no tests I can see that try concurrent access. The only thing that might be safe is reading from two files inside the zip file (which shouldn't be changed by this), but I wouldn't want to guarantee even that.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Feb 24, 2016

    Oh, I see test_interleaved now, which does test overlapping reads of two files from the same zip file.

    Do you want that clarified in the docs - which don't currently mention the lock at all - or in a comment in the module?

    @vadmium
    Copy link
    Member

    vadmium commented Feb 25, 2016

    When you say concurrent writes should be impossible, I guess that only applies to a single-threaded program. There is no lock protecting the “self._fileRefCnt > 1” check and related manipulation (not that I am saying there should be).

    For serializing concurrent writes to a single handle, if that is intended it should be documented. If it is not intended, maybe it should be removed (my preference)?

    It would be good to wait if Serhiy can explain the purpose of the lock, seeing he was involved in adding it and probably knows a lot more about this module than I.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Mar 3, 2016

    Serhiy, have you had a chance to look at what the zf.open(mode='w') patch does with the lock?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Mar 3, 2016

    Sorry for the delay Thomas. This is complex and important to me issue and I want to be attentive to it.

    I think we should preserve long existing behavior even if it is not documented (documenting or deprecating it is other issue). Concurrent reading and wring with concurrent reading always (at least since adding ZipFile.open()) worked, but was never documented nor tested. Concurrent writing was added rather as a side effect of bpo-14099. If there is a benefit from getting rid of it, we can break it.

    For preserving current behavior ZipFile.open(mode='w') should acquire the lock and it should be released in _ZipWriteFile.close().

    I have added other comments on Rietveld. The patch needs to be updated to resolve conflicts with committed zipinfo-from-file5.patch.

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Mar 28, 2016

    https://github.com/SpiderOak/ZipStream tries to implement streaming in one more kind.

    Also libarchive have experimental support of streaming write to zip archives in newest version.

    So problem is actual.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Apr 6, 2016

    Sorry for the delay, this fell off my radar because emails from both the bug tracker and Rietveld tend to fall foul of my spam filters, so I have to go and check.

    I have implemented Serhiy's suggestions, but there turns out to be a test (test_write_after_read) that calls writestr while a reading handle is open, and it now fails. This is absolutely expected according to the design we discussed, but if there's a test, I guess that means we need to keep that functionality working?

    In principle, the only thing that's not possible is interleaving writes to multiple files within the zip - because we don't know where to start writing the second file. We should be able to have one writer and n readers going at once, but every time I start looking into that I get mired in complexity. Maybe (hopefully) there's some critical abstraction that hasn't occurred to me.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Apr 23, 2016

    Martin, Serhiy: Am I right that it would be unacceptable to change the test, and I therefore need to find a way to allow writestr while a reading-mode handle is open?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 23, 2016

    test_write_after_read was added in bpo-14099. It doesn't test intended behavior, it tests existing behavior just in case to not change it unintentionally. It is desirable to preserve it, but if there is no simple way, may be we can change it.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Apr 25, 2016

    It wasn't as bad as I thought (hopefully that doesn't mean I've missed something). zipfile-open-w6.patch allows interleaved reads and writes so long as the underlying file object is seekable.

    You still can't have multiple write handles open (it wouldn't know where to start the second file), or read & write handles on a non-seekable stream (I don't think a non-seekable read-write file ever makes sense, except perhaps for a pty, which is really two separate streams on one file descriptor).

    Tests are passing again.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 26, 2016

    I understand Python’s zipfile module does not allow reading unless seek() is supported (this was one of Марк’s complaints). So it is irrelevant whether we are writing.

    I left a few comments, but it sounds like it is valid read and write at the same time.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Apr 26, 2016

    zipfile-open-w7 adds a test that Martin requested

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented Apr 26, 2016

    zipfile-open-w8 removes the concurrent read-write check for non-seekable files. As Martin points out, reading currently requires seeking, and a streaming read API would look very different from the current API.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 3, 2016

    Any further review comments on this?

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 10, 2016

    Ping? Once this is landed, I intend to make a backport package of it so I can start using it before Python 3.6 comes out.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 11, 2016

    I have added comments for tests and documentation. The implementation looks correct. But I prefer simpler solution on this stage. Here is something between zipfile-open-w5.patch and zipfile-open-w8.patch, close to zipfile-open-w5.patch with fixed issues.

    • Writer doesn't use look and seeking.
    • Writing while there are open read handlers is permitted.
    • Reading while there is open write handler is forbidden.

    This works in all cases when current code works. The ability to read while there is open write handler is separate new feature (and I don't know whether there is a need in it).

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 12, 2016

    Thanks Serhiy. I'm quite happy with that set of constraints. I had a look over your patch set, and just spotted one thing in writestr that I have overlooked all along.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 12, 2016

    Thank you for your contribution Thomas. I'll push the patch with fixed writestr in short time unless Martin left comments.

    @vadmium
    Copy link
    Member

    vadmium commented May 13, 2016

    Yes this looks okay to me, so go ahead.

    My only concern was other methods affecting the file position while in the middle of writing, but it looks like those methods maintain separate objects like self.filelist, so there is no conflict.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 13, 2016

    New changeset 175c3f1d0256 by Serhiy Storchaka in branch 'default':
    Issue bpo-26039: zipfile.ZipFile.open() can now be used to write data into a ZIP
    https://hg.python.org/cpython/rev/175c3f1d0256

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 14, 2016

    Thanks Serhiy! I am marking this as fixed.

    @takluyver takluyver mannequin closed this as completed May 14, 2016
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 14, 2016

    Let me know when you create a backport package Thomas. I'd like to help.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 14, 2016

    There are yet few issues.

    1. Currently RuntimeError is widely used in zipfile. But in most cases ValueError would be more appropriate since it programming error (for example ValueError is raised when try to read or write to closed file). See bpo-24693, but this is backward incompatible change. For new exceptions we are free to use any exception type without breaking backward compatibility. It looks to me, that ValueError is more appropriate is these cases than RuntimeError. What do you think about this Thomas? Here is a patch that changes exceptions types.

    2. ZipInfo.is_dir() is not documented and lacks a docstring.

    3. It would be better to make force_zip64 a keyword-only parameter. This would allow to add new positional parameters without breaking compatibility.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 14, 2016

    1. For the cases like read/write handles conflicting, ValueError doesn't seem quite right to me - I take ValueError to mean that you have passed an invalid value to a function. RuntimeError feels like the closest fit for 'the current state does not allow this operation'. It is rather broad, though.

    If I was writing it anew, I would use ValueError for something like an invalid mode parameter: zf.open('foo', mode='q') . That particular case was already in the code, though, so I think backwards compatibility should take precedence.

    2 & 3. Agreed, I'll prepare a patch

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 14, 2016

    Patch attached for points 2 and 3

    @vadmium
    Copy link
    Member

    vadmium commented May 14, 2016

    The bonus patch looks okay, although I wonder if the directory slash (/) information should be in the RST rather than doc string. Usually the RST has all the details, and doc strings are just summaries.

    Regarding exceptions, I can sympathise with both sides of the argument and don’t have a strong opinion (why does the exception type matter for programmer errors anyway?). But I think it might be better to be locally consistent within the zipfile module, and the module is already heavily documented with RuntimeError for similar programmer errors.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 15, 2016

    Agreed that the docstring should be shorten version of the documentation and not vice versa. The directory slash information is implementation detail. I'm not sure it is needed in the documentation.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 15, 2016

    Updated version of the patch with the detail moved to rst.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 15, 2016

    New changeset 29b470b8ab77 by Serhiy Storchaka in branch 'default':
    Issue bpo-26039: Document ZipInfo.is_dir() and make force_zip64 keyword-only.
    https://hg.python.org/cpython/rev/29b470b8ab77

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 15, 2016

    LGTM. Thank you Thomas.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 15, 2016

    The backported package now exists on PyPI as zipfile36. It's passing the tests on 3.4 - I haven't tested further back. I'm trying out Gitlab for hosting it:
    https://gitlab.com/takluyver/zipfile36

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants