Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile add uses random order #74878

Closed
bmwiedemann mannequin opened this issue Jun 18, 2017 · 23 comments
Closed

tarfile add uses random order #74878

bmwiedemann mannequin opened this issue Jun 18, 2017 · 23 comments
Labels
3.7 stdlib type-bug

Comments

@bmwiedemann
Copy link
Mannequin

@bmwiedemann bmwiedemann mannequin commented Jun 18, 2017

BPO 30693
Nosy @rhettinger, @gustaebel, @vstinner, @tiran, @ned-deily, @bitdancer, @serhiy-storchaka, @bmwiedemann
PRs
  • #2263
  • #5557
  • #5567
  • #31713
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-02-06.18:38:13.647>
    created_at = <Date 2017-06-18.02:03:30.834>
    labels = ['3.7', 'type-bug', 'library']
    title = 'tarfile add uses random order'
    updated_at = <Date 2022-03-06.22:48:55.393>
    user = 'https://github.com/bmwiedemann'

    bugs.python.org fields:

    activity = <Date 2022-03-06.22:48:55.393>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-02-06.18:38:13.647>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2017-06-18.02:03:30.834>
    creator = 'bmwiedemann'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30693
    keywords = ['patch']
    message_count = 23.0
    messages = ['296251', '296280', '296308', '310164', '310166', '310167', '310169', '310170', '310171', '310173', '310205', '310337', '310342', '311318', '311320', '311321', '311324', '311325', '311387', '311605', '311685', '311736', '311738']
    nosy_count = 9.0
    nosy_names = ['rhettinger', 'lars.gustaebel', 'vstinner', 'christian.heimes', 'ned.deily', 'r.david.murray', 'python-dev', 'serhiy.storchaka', 'bmwiedemann']
    pr_nums = ['2263', '5557', '5567', '31713']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30693'
    versions = ['Python 3.7']

    @bmwiedemann
    Copy link
    Mannequin Author

    @bmwiedemann bmwiedemann mannequin commented Jun 18, 2017

    Filesystems do not give any guarantees about ordering of files returned in directory listings, thus tarfile.add adds files in random order, when using os.listdir in recursion.

    See also https://reproducible-builds.org/docs/stable-inputs/ on that topic.

    @bmwiedemann bmwiedemann mannequin added 3.7 stdlib type-bug labels Jun 18, 2017
    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Jun 18, 2017

    The patch for similar issue with the glob module was rejected recently since it is easy to sort the result of glob.glob() (see bpo-30461). This issue looks similar, but there are differences. On one side, the command line tar utility doesn't have the option for sorting file names and seems don't sort them by default (I didn't checked). It is possible to use external sorting with the tarfile module as with the tar utility (generate the list of all files and directories, sort it, and pass every item to TarFile.add with the option recursive=False). But on other side, this is not so easy as for glob.glob(). And the overhead of the sorting is expected to be smaller than for glob.glob(). This may be considered as additional arguments for approving the patch.

    If this approach will be approved, it should be applied also to the ZIP archives.

    FYI the order of archived files can affect the compression ratio of the compressed tar archive. For example the 7-Zip archiver sorts files by extensions, this increases the chance that files of the same type (text, multimedia, spreadsheet, executables, etc) are grouped together and use the common dictionary for global compression. This isn't directly related to this issue, just a material for possible future enhancement.

    @bmwiedemann
    Copy link
    Mannequin Author

    @bmwiedemann bmwiedemann mannequin commented Jun 19, 2017

    note: recent GNU tar versions (1.28?) added an option --sort=name

    and the overhead of sorting (e.g. I measured 4ms for 10000 files) is negligible compared to the other processing done on the files here.

    @bitdancer
    Copy link
    Member

    @bitdancer bitdancer commented Jan 17, 2018

    Given the reproducible builds angle, I'd say this was worth doing.

    @tiran
    Copy link
    Member

    @tiran tiran commented Jan 17, 2018

    +1 from me

    In my opinion it's both a good idea to not sort the result of glob.glob() and make the order of tar and zip module content ordered. The glob module is low level and it makes sense to expose the file system sort order.

    On the other hand tar and zip modules are on a higher level. Without sorting it's impossible to create reproducible archives. The performance impact is irrelevant. I/O and compression dominant performance.

    @tiran
    Copy link
    Member

    @tiran tiran commented Jan 17, 2018

    PS: I'm -0 to backport the change to 3.6 and 2.7. 3.5 is in security fix mode and therefore completely out of scope.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 17, 2018

    Since we currently don't warranty *anything* about ordering, I like the idea of *fixing* Python 2.7 and 3.6 as well. +1 for fix it in 2.7, 3.6 and master.

    @bitdancer
    Copy link
    Member

    @bitdancer bitdancer commented Jan 17, 2018

    Ah, I was just going to ask about that. I guess I'm -0 on the backport as well. The other reproducible build stuff is only going to land in 3.7. However, this is in a more general category than the pyc stuff, so I can see the argument for backporting it.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 17, 2018

    The only warranty in that TarFile.getmembers(), TarFile.getnames() and ZipFile.infolist() returns members/names "in the same order as the members in the archive".

    Currently, there is no warranty when packing, only on unpack.

    @tiran
    Copy link
    Member

    @tiran tiran commented Jan 17, 2018

    The patch changes behavior. It's fine for 3.7 but not for 3.6/2.7. Somebody may depend on filesystem order.

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented Jan 17, 2018

    This doesn't seem appropriate to me for backporting to existing releases (3.6. and 2.7). AFAIK, the current file-system-order behavior has never been identified as a bug. Unless there is a stronger case for changing the existing 3.6.x behavior, I am -1 on backporting.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Jan 20, 2018

    If make this change you need to make similar changes in other places that recursively add files to archives: shutil, zipapp, distutils, and maybe more.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 20, 2018

    I now agree to leave Python 2.7 and 3.6 unchanged.

    @tiran
    Copy link
    Member

    @tiran tiran commented Jan 31, 2018

    We missed beta freeze deadline. :/

    Ned,
    can we get this change into beta 2? It's low risk change to make the tarballs and other archives have a stable sort order. We even considered to backport the change to 3.6 and 2.7.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 31, 2018

    New changeset 8452104 by Victor Stinner (Bernhard M. Wiedemann) in branch 'master':
    bpo-30693: zip+tarfile: sort directory listing (bpo-2263)
    8452104

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 31, 2018

    We missed beta freeze deadline. :/

    I merged the PR. We will have to create a cherry-pick request once the 3.7 branch will be created. If Ned rejects it, we have to change the version number of documentation.

    https://mail.python.org/pipermail/python-dev/2018-January/152012.html

    IMHO the change is very safe to be merged into 3.7b2.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Jan 31, 2018

    I requested additional changes in msg310337.

    @bmwiedemann
    Copy link
    Mannequin Author

    @bmwiedemann bmwiedemann mannequin commented Jan 31, 2018

    @serhiy IMHO, just because we fix one problem, we do not have to fix all other problems at the same time. You can still open a pull-request for the others, but I know too little about those to test them.
    And having commits pending for 7 months is not exactly energizing either.

    For my use-case I just needed a trivial 1 line fix in tarfile.py and already ended up with a diffstat of
    7 files changed, 39 insertions(+), 6 deletions(-)

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented Jan 31, 2018

    New changeset 57750be by Ned Deily (Bernhard M. Wiedemann) in branch '3.7':
    bpo-30693: zip+tarfile: sort directory listing (bpo-2263)
    57750be

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Feb 4, 2018

    Tests are failing on Windows.

    ======================================================================
    ERROR: test_ordered_recursion (test.test_tarfile.Bz2WriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\unittest\mock.py", line 1191, in patched
        return func(*args, **keywargs)
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1152, in test_ordered_recursion
        support.unlink(os.path.join(path, "1"))
      File "C:\py\cpython3.7\lib\test\support\__init__.py", line 394, in unlink
        _unlink(filename)
      File "C:\py\cpython3.7\lib\test\support\__init__.py", line 344, in _unlink
        _waitfor(os.unlink, filename)
      File "C:\py\cpython3.7\lib\test\support\__init__.py", line 341, in _waitfor
        RuntimeWarning, stacklevel=4)
    RuntimeWarning: tests may fail, delete still pending for C:\py\cpython3.7\build\test_python_8504\@test_8504_tmp-tardir\directory\1

    ======================================================================
    ERROR: test_directory_size (test.test_tarfile.GzipWriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1121, in test_directory_size
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    ======================================================================
    ERROR: test_ordered_recursion (test.test_tarfile.GzipWriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\unittest\mock.py", line 1191, in patched
        return func(*args, **keywargs)
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1137, in test_ordered_recursion
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    ======================================================================
    ERROR: test_directory_size (test.test_tarfile.LzmaWriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1121, in test_directory_size
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    ======================================================================
    ERROR: test_ordered_recursion (test.test_tarfile.LzmaWriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\unittest\mock.py", line 1191, in patched
        return func(*args, **keywargs)
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1137, in test_ordered_recursion
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    ======================================================================
    ERROR: test_directory_size (test.test_tarfile.WriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1121, in test_directory_size
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    ======================================================================
    ERROR: test_ordered_recursion (test.test_tarfile.WriteTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\py\cpython3.7\lib\unittest\mock.py", line 1191, in patched
        return func(*args, **keywargs)
      File "C:\py\cpython3.7\lib\test\test_tarfile.py", line 1137, in test_ordered_recursion
        os.mkdir(path)
    FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\py\\cpython3.7\\build\\test_python_8504\\@test_8504_tmp-tardir\\directory'

    @bmwiedemann
    Copy link
    Mannequin Author

    @bmwiedemann bmwiedemann mannequin commented Feb 5, 2018

    Serhiy, can you test #5557

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Feb 6, 2018

    New changeset 4ad703b by Serhiy Storchaka (Bernhard M. Wiedemann) in branch 'master':
    bpo-30693: Fix tarfile test cleanup on MSWindows (bpo-5557)
    4ad703b

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Feb 6, 2018

    New changeset 2c6f668 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7':
    bpo-30693: Fix tarfile test cleanup on MSWindows (GH-5557) (GH-5567)
    2c6f668

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 stdlib type-bug
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants