tarfile: undocumented (and potentially surprising) performance slowdown when adding lots of files with `w:xz` mode

# Documentation

Consider the following reproducer:

```python
import io
import sys
import tarfile


with open("/dev/urandom", "rb") as f:
    data = io.BytesIO(f.read(3849))
size = len(data.getbuffer())

if sys.argv[1] == "stream":
    kwargs = {
        "mode": "w|xz",
        "compresslevel": 9,
    }
else:
    kwargs = {
        "mode": "w:xz",
        "preset": 9,
    }

with tarfile.open("test.tar.xz", format=tarfile.GNU_FORMAT, **kwargs) as tarf:
    for x in range(50000):
        data.seek(0)
        tinfo = tarfile.TarInfo(f"{x}.txt")
        tinfo.size = size
        tarf.addfile(tinfo, data)
```

It is supposed to simulate a simplified version of adding lots of small files to an xz-compressed archive. Consider the timings:

```console
$ time python3.13 test.py normal

real	0m10,316s
user	0m9,716s
sys	0m0,568s
$ time python3.13 test.py stream

real	0m9,115s
user	0m8,999s
sys	0m0,090s
```

The stream mode (`w|xz`) is noticeably faster than the regular mode (`w:xz`) here. For example, when [the problem was reported to pycargoebuild](https://github.com/projg2/pycargoebuild/issues/39), I've found out that repacking `uv-0.6.17` crates takes roughly 3 min 35 s in regular mode, and 3 min 15 s in stream mode.

I presume the differences are by design, but I think it would be useful to document them more clearly. Currently, the documentation indicates that:

> For special purposes, there is a second format for mode: `filemode|[compression]`. [tarfile.open()](https://docs.python.org/3/library/tarfile.html#tarfile.open) will return a [TarFile](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile) object that processes its data as a stream of blocks. No random seeking will be done on the file. If given, fileobj may be any object that has a [read()](https://docs.python.org/3/library/io.html#io.RawIOBase.read) or [write()](https://docs.python.org/3/library/io.html#io.RawIOBase.write) method (depending on the mode) that works with bytes. bufsize specifies the blocksize and defaults to 20 * 512 bytes. Use this variant in combination with e.g. `sys.stdin.buffer`, a socket [file object](https://docs.python.org/3/glossary.html#term-file-object) or a tape device. […]
>
> (https://docs.python.org/3/library/tarfile.html)

This suggests that you'd only use the stream mode in special cases, in particular when the underlying file doesn't provide for random access. However, this experiment seems to suggest that the stream mode is faster in general, and particularly when dealing with lots of files. Therefore, I think the documentation could be updated to indicate that the stream mode is faster when adding lots of files — or perhaps that it should be preferable in general, unless random access is actually necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tarfile: undocumented (and potentially surprising) performance slowdown when adding lots of files with `w:xz` mode #132994

Documentation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

tarfile: undocumented (and potentially surprising) performance slowdown when adding lots of files with w:xz mode #132994

Description

Documentation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

tarfile: undocumented (and potentially surprising) performance slowdown when adding lots of files with `w:xz` mode #132994