New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always default to .tar.gz sdists on any platform #748

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@dstufft
Member

dstufft commented Aug 19, 2016

It doesn't make much sense to have different sdist defaults for different platforms, instead we'll always use .tar.gz on every platform for greater consistency between platforms.

Always default to .tar.gz sdists on any platform
It doesn't make much sense to have different sdist defaults for different
platforms, instead we'll always use .tar.gz on every platform for greater
consistency between platforms.
@dstufft

This comment has been minimized.

Member

dstufft commented Aug 19, 2016

Note: This PR is stemming from a desire I have to have PyPI (eventually) start rejecting all sdists except those that end in .tar.gz.

@anthrotype

This comment has been minimized.

Contributor

anthrotype commented Aug 19, 2016

dumb question: why not all zip instead?
Windows can't natively open .tar.gz.

(I know Python can decompress those, but you won't be able to, say, double-click on the archive as with a regular .zip file)

@dstufft

This comment has been minimized.

Member

dstufft commented Aug 19, 2016

Currently on PyPI there are 444,338 .tar.gz files uploaded and 58,774 .zip files uploaded, making .tar.gz on PyPI something like 7.6x more popular than .zip. In addition, Windows users have an easier time upgrading Python and setuptools since their OS doesn't ship with either, while non Windows users tend to get their Python and setuptools from their OS, making it more difficult to upgrade without upgrading your entire OS, which means that changing Windows's default is more likely to propagate out faster.

@anthrotype

This comment has been minimized.

Contributor

anthrotype commented Aug 19, 2016

my point was just that zip can be decompressed on any OS, including Windows or obscure linux distros, without needing to install extra software, which is not the case with tar.gz on Windows.

@dstufft

This comment has been minimized.

Member

dstufft commented Aug 19, 2016

Sure, and if we were starting from nothing that would probably be enough to push .zip to the "winner", but given we're dealing with an existing ecosystem and .tar.gz is by and far the most popular option, there's less disruption by going with the flow. Very few people actually download these files by hand as per my download stats from PyPI, so the vast bulk of cases will be automated tooling which already can handle the .tar.gz files fine. If someone does want to download them by hand, well they're likely to still need something to unpack .tar.gz for most projects anyways since most files on PyPI are already .tar.gz (and Python itself can unpack .tar.gz).

@anthrotype

This comment has been minimized.

Contributor

anthrotype commented Aug 19, 2016

OK, I see. Thanks for you reply.
I was curious about this because I've recently added a setup.cfg file to default sdist to formats=zip, because I thought the latter would be more "portable" or Windows-friendly (even though the CI where the sdist is created is running Ubuntu).
I might be wrong.

@jaraco

This comment has been minimized.

Member

jaraco commented Aug 19, 2016

for greater consistency between platforms

+1 for consistency

Currently on PyPI there are 444,338 .tar.gz files uploaded and 58,774 .zip files uploaded.

I don't think that necessarily speaks to the popularity of the format as much as the popularity of the operating system on which the package was produced.

changing Windows's default is more likely to propagate out faster.

Good point.

I do prefer tar.gz files as they stream better through a pipe (generally speaking, a zip file must be entirely loaded into memory to be expanded, at least under Python's zipfile module).

There are other arguments in favor of zip though:

  • The format explicitly disavows knowledge of permissions, laying it at the feet of the extractor (or the umask) to set the permissions (when relevant). When it comes to sdists, this constraint is liberating.
  • zip files are easier to construct and extract in a Python process. Just the other week, I tried constructing a .tar.gz sdist in memory, but failed and punted, instead creating the sdist on my file system and dumping the file to a python literal string. I couldn't find an example of doing this anywhere. If .tar.gz is to be the preferred format, this use case needs to be solved.
  • Python has built-in support for runnable zip files and importing modules and packages from zip files (but not .tar.gz files), which while unrelated to packaging sdists, speaks to the establishment of that format.
  • Setuptools' own bootstrap module relies on zip for bootstrapping. That format wasn't chosen capriciously, but was switched from .tar.gz because there were issues with the latter (though I don't remember what).

At first, I was ambivalent, but on further consideration, I feel fairly strongly that zip is the better format for sdists.

changing Windows's default is more likely to propagate out faster.

It seems like this motivation is the primary one. What if instead, PyPI were to display a warning banner for packages publishing the undesirable format? This could instigate the individual package maintainers to use later versions of setuptools or update their project config to match the recommendation (even if they use another packaging mechanism like distutils).

@jaraco

This comment has been minimized.

Member

jaraco commented Aug 19, 2016

To be clear, I'm expressing my preference and concerns, but this project will follow whatever consensus is reached in PyPA.

@dstufft

This comment has been minimized.

Member

dstufft commented Aug 19, 2016

Currently on PyPI there are 444,338 .tar.gz files uploaded and 58,774 .zip files uploaded.

I don't think that necessarily speaks to the popularity of the format as much as the popularity of the operating system on which the package was produced.

Right, I'm not saying that people are explicitly picking .tar.gz (or explicitly picking .zip) in the general case, but by the nature of there being way more .tar.gz than .zip, we can make a guess about what the relative number of people who are going to observe some kind of difference in behavior. The fewer people who get any change in behavior, the less likely issues are to occur.

zip files are easier to construct and extract in a Python process. Just the other week, I tried constructing a .tar.gz sdist in memory, but failed and punted, instead creating the sdist on my file system and dumping the file to a python literal string. I couldn't find an example of doing this anywhere. If .tar.gz is to be the preferred format, this use case needs to be solved.

Hmm, here's something I threw together really quickly for creating a tarfile completely in memory using only in memory files. If the files are already on the filesystem then this is even easier since you can just use tgz.add("/path/to/some/file.txt", arcname="example/file.txt") instead of needing to construct a TarInfo.

import tarfile
import io


tarobj = io.BytesIO()


with tarfile.open(fileobj=tarobj, mode="w:gz") as tgz:
    data = b"This is an example file."

    t = tarfile.TarInfo("example/file.txt")
    t.size = len(data)

    tgz.addfile(t, io.BytesIO(data))


with open("example.tar.gz", "wb") as fp:
    fp.write(tarobj.getvalue())

Similarly, you can extract all of the files of a .tar.gz using something like:

import tarfile
import io

file_data = {}

tarobj = io.BytesIO(b"... Tar data goes here ...")

with tarfile.open(fileobj=tarobj, mode="r") as tgz:
    for filename in tgz.getnames():
        file_data[filename] = tgz.extractfile(filename).read()

changing Windows's default is more likely to propagate out faster.

It seems like this motivation is the primary one. What if instead, PyPI were to display a warning banner for packages publishing the undesirable format? This could instigate the individual package maintainers to use later versions of setuptools or update their project config to match the recommendation (even if they use another packaging mechanism like distutils).

It's one of the primary motivators, but whereas Windows users are generally free to upgrade their Python or setuptools installations on their machines more or less at will, Linux/macOS developers have that shipped as part of their OS, making them unable to upgrade their Python or their setuptools easily. On macOS for example, to upgrade your setuptools currently requires disabling "rootless" (requires a reboot, makes your computer less secure), then forcibly upgrading it, then rebooting again to rootless. On Linux, the system Python (and setuptools) integrates with the entire OS, and by upgrading them you risk breaking the entire OS by upgrading that system.

It's also about the number of people who have to change here, the more people who experience a change, the greater the chance that change will break something (e.g. https://xkcd.com/1172/), I believe that changing to .tar.gz is the safer alternative.

Is .zip theoretically nicer? Maybe, it has some nice properties, it also has some negative properties. For sdist's usecase the two formats are largely equivalent with little reason to pick one over another (e.g., while zip may be better supported for decompression out of the box on Windows, a tar file is going to compress down to smaller, saving more bandwidth on PyPI and getting delivered fater to end users).

Oh, and TIL that anyone who has Python 3.4+ installed on their system, has a tool to create and extract .tar.gz installed (on the command line anyways), you can run python -m tarfile -e foobar.tar.gz to extract foobar.tar.gz to your current directory.

jaraco added a commit that referenced this pull request Aug 19, 2016

@jaraco

This comment has been minimized.

Member

jaraco commented Aug 19, 2016

here's something I threw together really quickly

Thanks for that. It worked like a charm. I don't know what I was doing wrong that I couldn't come up with something like that.

Is .zip theoretically nicer? Maybe.

If zip is the nicer (optimal) format, I'd prefer we accept the greater challenge of moving to it, rather in three years explaining why we've managed to move everyone to an inferior format.

@ncoghlan, @qwcode - any opinions on which format is optimal for sdists?

Irrespective of the format chosen, and thinking about the implementation, since this format selection actually exists in distutils, I'd rather see the sanctioned format committed to the stdlib, at which point it's straightforward to add forward compatibility in setuptools.

I think I'd go as far as essentially disabling platform-specific formats by having initialize_options set the default for formats to ['gztar'] or similar, and remove the lookup by os.name.

@dstufft

This comment has been minimized.

Member

dstufft commented Aug 19, 2016

If zip is the nicer (optimal) format, I'd prefer we accept the greater challenge of moving to it, rather in three years explaining why we've managed to move everyone to an inferior format.

The thing is, I don't really think it is the optimal format, I think it greatly depends on what things you're optimizing for. For instance, a .zip compresses each file individually (possibly each file differently!) which makes it great for random access (something important for things like, Python's zip-import) but this means that it compresses to a larger size than soemthing like .tar.gz which concats all of the files into a single stream, and then compresses that. If all of the sdists on PyPI were zip files instead of mostly .tar.gz files, then our total bandwidth would be noticbly higher.

IOW, my "maybe" was really saying, it greatly depends on what things you're trying to optimize for and for our specific use case, it's really easy to arbitrary pick one side or the other and argue for it on it's technical merits because on the technical side, they're basically equal for sdists.

@ncoghlan

This comment has been minimized.

Member

ncoghlan commented Aug 20, 2016

Changing the default sdist output format in distutils for 3.6 would definitely be possible (albeit needing to be done before the first beta next month), as the standard library has tarfile and gzip support, and the shutil archive operations abstract away the format specific details for basic operations: https://docs.python.org/3/library/shutil.html#archiving-operations

For sdist, I think tar.gz makes the most sense as the default format, as it really is intended as just a release archive, with it being converted to another format (install directory, wheel, egg, or a downstream format) prior to actual use. Wearing my "Linux distro contributor" hat, I'll also note that there's a lot of existing open source infrastructure tooling built around the notion of upstream projects publishing tarballs as the base unit of a release (git commit hashes are starting to be an acceptable alternative, but tarballs still have a lot of very nice features for the purpose).

I think zip is a better fit for our built formats though, so it makes sense to continue to require that for both wheels and eggs (which is already explicitly the case for wheels, and implicitly the case for eggs).

@jaraco

This comment has been minimized.

Member

jaraco commented Aug 20, 2016

I've created http://bugs.python.org/issue27819 and assigned it to myself. Once that's in place, I'll add forward compatibility into Setuptools.

@jaraco jaraco closed this Aug 20, 2016

jaraco added a commit that referenced this pull request Aug 20, 2016

@ncoghlan

This comment has been minimized.

Member

ncoghlan commented Aug 21, 2016

Thank you!

@dstufft dstufft deleted the dstufft:default-tar-gz branch Aug 21, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment