Skip to content

tarfile: undocumented (and potentially surprising) performance slowdown when adding lots of files with w:xz mode #132994

@mgorny

Description

@mgorny

Documentation

Consider the following reproducer:

import io
import sys
import tarfile


with open("/dev/urandom", "rb") as f:
    data = io.BytesIO(f.read(3849))
size = len(data.getbuffer())

if sys.argv[1] == "stream":
    kwargs = {
        "mode": "w|xz",
        "compresslevel": 9,
    }
else:
    kwargs = {
        "mode": "w:xz",
        "preset": 9,
    }

with tarfile.open("test.tar.xz", format=tarfile.GNU_FORMAT, **kwargs) as tarf:
    for x in range(50000):
        data.seek(0)
        tinfo = tarfile.TarInfo(f"{x}.txt")
        tinfo.size = size
        tarf.addfile(tinfo, data)

It is supposed to simulate a simplified version of adding lots of small files to an xz-compressed archive. Consider the timings:

$ time python3.13 test.py normal

real	0m10,316s
user	0m9,716s
sys	0m0,568s
$ time python3.13 test.py stream

real	0m9,115s
user	0m8,999s
sys	0m0,090s

The stream mode (w|xz) is noticeably faster than the regular mode (w:xz) here. For example, when the problem was reported to pycargoebuild, I've found out that repacking uv-0.6.17 crates takes roughly 3 min 35 s in regular mode, and 3 min 15 s in stream mode.

I presume the differences are by design, but I think it would be useful to document them more clearly. Currently, the documentation indicates that:

For special purposes, there is a second format for mode: filemode|[compression]. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file. If given, fileobj may be any object that has a read() or write() method (depending on the mode) that works with bytes. bufsize specifies the blocksize and defaults to 20 * 512 bytes. Use this variant in combination with e.g. sys.stdin.buffer, a socket file object or a tape device. […]

(https://docs.python.org/3/library/tarfile.html)

This suggests that you'd only use the stream mode in special cases, in particular when the underlying file doesn't provide for random access. However, this experiment seems to suggest that the stream mode is faster in general, and particularly when dealing with lots of files. Therefore, I think the documentation could be updated to indicate that the stream mode is faster when adding lots of files — or perhaps that it should be preferable in general, unless random access is actually necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dir

    Projects

    Status

    No status

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions