Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default tar format to modern POSIX 2001 (pax) for better portability/interop, support and standards conformance #80449

CAM-Gerlach opened this issue Mar 12, 2019 · 9 comments
3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error


Copy link

BPO 36268
Nosy @gustaebel, @serhiy-storchaka, @CAM-Gerlach
  • bpo-36268: Change default tar format to pax from GNU  #12355
  • bpo-30661: Improve doc for tarfile pax change and effect on shutil #12635
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-03-21.14:46:40.428>
    created_at = <Date 2019-03-12.02:14:28.198>
    labels = ['3.8', 'type-bug', 'library']
    title = 'Change default tar format to modern POSIX 2001 (pax) for better portability/interop, support and standards conformance'
    updated_at = <Date 2019-03-30.17:36:23.905>
    user = '' fields:

    activity = <Date 2019-03-30.17:36:23.905>
    actor = 'CAM-Gerlach'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-03-21.14:46:40.428>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2019-03-12.02:14:28.198>
    creator = 'CAM-Gerlach'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36268
    keywords = ['patch']
    message_count = 9.0
    messages = ['337710', '337860', '337871', '337929', '337951', '338020', '338151', '338546', '338547']
    nosy_count = 3.0
    nosy_names = ['lars.gustaebel', 'serhiy.storchaka', 'CAM-Gerlach']
    pr_nums = ['12355', '12635']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = ''
    versions = ['Python 3.8']

    Copy link
    Member Author

    I propose changing tarfile.DEFAULT_FORMAT to be tarfile.PAX_FORMAT , rather than the legacy tarfile.GNU_FORMAT for Python 3.8. This would offer several benefits:

    • Removes limitations of the old GNU tar format, including in max UID/GID values and bits in device major and minor numbers, and is the most flexible and feature-rich tar format currently
    • Encodes all filenames as UTF-8 in a portable way, ensuring consistent and correct handling on all platforms, avoid errors like this one and generally ensure expected, sensible defaults
    • Is the current interoperable POSIX standard, used by all modern platforms (Linux, Unix, macOS, and third-party unarchivers on Windows) rather than a vendor-specific extension like GNU tar
    • Backwards compatible with any unarchiver capable of reading ustar format, unlike GNU tar as the extended pax headers will just be ignored
    • Fixes bpo-30661, support tarfile.PAX_FORMAT in shutil.make_archive (was proposed as a fix to the same, but it was never followed up on and the issue remains open)

    This change would have no effect on reading existing archives, only writing new ones, and should be broadly compatible with any remotely modern system, as pax support is included in all the widely used libraries/systems:

    • POSIX 2001 (major Unix vendors), released in 2001 (18 years ago)
    • GNU tar 1.14 (Linux, etc), released in 2004 (15 years ago)
    • bsdtar/libtar ~1.2.51 (BSD, macOS, etc), at least as of 2006 (13 years ago), with significant bug fixes up through 2011 (8 years ago)
    • 7-zip (Windows) at some point before 2011 (>8 years ago), with significant bug fixes up to 2011 (8 years ago)
    • Python 2.6, released in 2008 (11 years ago)

    Furthermore, essentially every existing archiver supports ustar format, which would allow interoperability on very old/exotic platforms that don't support pax for some reason (and would certainly not support GNU). Therefore, it should be more than safe to make the change now, with archivers on the three major platforms supporting the modern standard for nearly a decade, and any esoteric ones at least as likely to support the POSIX standard as the vendor-specific GNU extension.

    Is there any particular reason why we shouldn't make this change? Is there a particular group/list I should contact to follow up about seeing this implemented? It seems it should only require a one-line change here, aside from updating the docs and presumably the NEWS file, which I would be willing to do (I would think it should make a fairly straightforward first contribution).

    @CAM-Gerlach CAM-Gerlach added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 12, 2019
    Copy link

    Looks reasonable.

    Do you know whether it is supported on OpenBSD and NetBSD? In other popular programming languages?

    Copy link
    Member Author

    In general, since pax is a backwards-compatible superset of the standard, portable ustar unlike the vendor-specific GNU format that even GNU tar itself no longer recommends in favor of switching to pax by default, it is to my understanding essentially always the better choice. The only exception would be systems that support GNU tar but not POSIX 2001 and where the limitations of the old ustar must be bypassed, which as far I'm aware is basically just really old (>10-15 years) GNU/Linux.

    NetBSD and OpenBSD both use bsdtar implementations, which as far as I could find means they support the POSIX 2001-standard pax format, and (unless they use libarchive which supports all three) likely *don't* support the current GNU format which is specific to GNU tar. Even if they don't, their ustar support means they can read pax archives as legacy ustar archives (as pax is backwards-compatible), while the same is not necessarily true of GNU tar archives. Therefore, pax is strictly a better choice than GNU or ustar.

    Most other programming languages I could find did not have internal/standard library implementations, instead relying on the aforementioned libraries or varying third party packages:

    • For C/C++, Libarchive and GNU tar are the modern two heavy hitters, and they both have supported it for a very long long. Modern version of old-style bsdtar should, but if not then they don't support GNU tar either. These are commonly used when needed with C/C++, or programmers implement their own bespoke solutions.
    • Libtar (C) does not, but it hasn't been updated for 6 years (and has been in minimal maintenance mode for over 15) so I'm not sure its really relevant anymore. Virtually any platform will also have one of the previous.
    • The major implementation for Java, Apache Commons Compress, added support for both pax and GNU in its 1.2 version, back in 2011 (8 years ago)
    • R uses the system's tar executable (or bundled modern tar), so will have the same support as that (i.e. any remotely modern system should be compatible). Their documentation explicitly recommends against GNU tar in favor of pax or ustar instead for portability:
    • git-archive uses pax exclusively
    • PHP supports ustar only, not pax or GNU; in that case pax is generally the more compatible of the two extended formats
    • The node-tar library, the apparent standard for Javascript, support it
    • The standard tar package for Go supports it
    • What seems to be the major current implementation for C#, SharpZipLib, supports it
    • Ruby has no apparent standard implementation; a few third-party libraries have a mix of support

    Copy link

    Do you mind to create a PR?

    Copy link
    Member Author

    Sure, in work now. Its my first contribution to CPython, so bear with me. I presume this is too trivial to go in the What's New in Python article, but does merit a NEWS entry so users are aware of the change? Aside from changing this line, updating the documentation to reflect the change, and possibly adding a NEWS entry, is there anything else that needs to be done? Thanks.

    Copy link
    Member Author

    PR is up with CI checks green as GH-12355. I also had to fix one test which implicitly assumed that DEFAULT_FORMAT == GNU_FORMAT.

    Copy link
    Member Author

    Also, one additional minor note (since I apparently can't edit comments here). Windows 10 (since the April 2018 update a year ago) now includes libarchive-based bsdtar built-in by default and accessible from the standard command prompt, which as mentioned fully supports pax.

    Therefore, all modern platforms should support extracting them out of the box (aside from Windows 7/Server 2008, for which extended support will end within two months from Python 3.8's initial release, Windows 10 pre-1803 for which enterprise support will end a few months after that, and Windows 8.1/Server 2012, which will be in extended support for a few more years but very low enterprise/developer/power user adoption; of course, these don't include any built-in tar support at all anyway).

    Copy link

    New changeset e680c3d by Serhiy Storchaka (CAM Gerlach) in branch 'master':
    bpo-36268: Change default tar format to pax from GNU. (GH-12355)

    Copy link

    Thank you for your contribution!

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    None yet

    No branches or pull requests

    2 participants