bpo-35227: Add support for file objects of unknown size to tarfile by remilapeyre · Pull Request #10714 · python/cpython

remilapeyre · 2018-11-26T10:31:22Z

This commit adds a new method to TarFile to support adding file object
whose size is unknown beforehand:

import tarfile
import urllib.request
tf = tarfile.open('tarfile.tar', 'w')
response = urllib.request.urlopen('http://www.example.com/huge-page.html')
tarinfo = tf.gettarinfo(name='foo', fileobj=response)
tf.addbuffer(tarinfo, response)
tf.close()

The Tar header being written before the data in Tar archive, this requires
to write a first header without knowing the size and then seek back to
overwrite it when all data has been written and the size is known.

This method therefore cannot be used on compressed archives or unseekable
media.

A change has been made in TarFile.gettarinfo to not replace the name
argument by fileobj.name as in memory buffer do not have one. This may
need to be removed and replaced by something like:

if fileobj is not None:
    try:
        name = fileobj.name
    except AttributeError:
        # one of name or fileobj.name should be set
        if name is None:
            raise

if backward compatibility for this behavior is wanted.

https://bugs.python.org/issue35227

This commit adds a new method to TarFile to support adding file object whose size is unknown beforehand: import tarfile import urllib.request tf = tarfile.open('tarfile.tar', 'w') response = urllib.request.urlopen('http://www.example.com/huge-page.html') tarinfo = tf.gettarinfo(name='foo', fileobj=response) tf.addbuffer(tarinfo, buf) tf.close() The Tar header being written before the data in Tar archive, this requires to write a first header without knowing the size and then seek back to overwrite it when all data has been written and the size is known. This method therefore cannot be used on compressed archives or unseekable media. A change has been made in TarFile.gettarinfo to not replace the `name` argument by `fileobj.name` as in memory buffer do not have one. This may need to be removed and replaced by something like: if fileobj is not None: try: name = fileobj.name except AttributeError: # one of name or fileobj.name should be set if name is None: raise if backward compatibility for this behavior is wanted.

mgorny · 2018-11-26T22:13:12Z

I've just tested it on top of subprocess.Popen(...).stdout, and it works just fine! Good work, thanks!

vadmium · 2018-11-27T08:47:57Z

Doc/library/tarfile.rst

      The *name* parameter accepts a :term:`path-like object`.

+   .. versionchanged:: 3.8
+      The *fileobj* attribute can now be an in-memory buffer


Attribute or parameter? Neither TarInfo nor TarFile seem to have a documented fileobj attribute.

vadmium · 2018-11-27T08:48:03Z

Doc/library/tarfile.rst

+.. method:: TarFile.addbuffer(tarinfo, buf)
+
+   Add the :class:`TarInfo` object *tarinfo* to the archive reading data from
+   *buf*. The size of *buf* needs not to be known beforehand and *tarinfo.size*


First sentence would be easier to read with a comma: Add . . . to the archive, reading data . . .

What kind of object is buf supposed to be? If it is a file object, say what API it should support (e.g. BufferedIOBase.read method, or the entire readable API of BufferedIOBase). The name buf suggests an in-memory buffer like bytes, which doesn’t match the verb reading well. If this is the case, I suggest changing the parameter name.

If you mean that knowing the amount of data beforehand is allowed but optional, change needs not to be known to need not be known.

Thanks for all your inputs, buf needs to support RawIOBase.read, RawIOBase.tell. I'm not sure how to say so properly in the documentation.

The name is not appropriate, as @mgorny saidaddstream and renaming all buf to stream might be better.

Would it be hard to make it work without .tell()? I suppose you could just count the data you read.

I wanted to reuse copyfileobj that makes use of shutil.copyfileobj which does not expose this but if I remove the dependency on copyfileobj and I could do this which would be better.

I suspect tell is not needed. It looks like you only use tell on the tar file object, not the file being added to the tar file. (If tell is available, then SEEK_END probably also works and you can get the file size in advance.)

My suggested wording: “The fileobj argument should implement RawIOBase.read, which is called until the end of the stream is reached.”

vadmium · 2018-11-27T08:54:00Z

Doc/library/tarfile.rst

   :attr:`~io.FileIO.name` attribute, or the *name* argument.  The name
-   should be a text string.
+   should be a text string. If *fileobj* is an in-memory buffer, a default file
+   status will be used.


I think in-memory buffer should be clarified. If you mean something like an OS pipe, the behaviour should already be obvious (result of stat). But I think you mean some other kind of object.

This method can be called on any object having a tell and read method, I'm not sure how to properly express this in the documentation.

vadmium · 2018-11-27T09:25:18Z

Lib/tarfile.py

+        start_pos = dst.tell()
        shutil.copyfileobj(src, dst, bufsize)
-        return
+        return dst.tell() - start_pos


If this branch is only used by your new code, it would be easier to understand if you moved it directly into the call site. Then the rest of the function does not need changing, and it only has one personality.

This is indeed used only by my code, from what I could gather calling copyfileobj with length=None is never done and this is not a documented interface.

If we inline this part of the code in addbuffer I think we should remove the support for length=None and raise an error here to avoid mistakes.

vadmium · 2018-11-27T09:31:54Z

Lib/tarfile.py

        return [tarinfo.name for tarinfo in self.getmembers()]

+    def _getdefaultstat(self):
+        time = int(datetime.datetime.now().timestamp())


Why not time.time() like the existing _Stream.init_write_gz method?

This is a mistake, thanks.

vadmium · 2018-11-27T10:06:43Z

Lib/tarfile.py

+
+        tarinfo = copy.copy(tarinfo)
+        # we record the stream as a plain file
+        tarinfo.type = REGTYPE


Wouldn’t it be better to require the caller to set up tarinfo correctly, or for addbuffer to support the other kinds of entries?

This aim of the method is to write data from fileobj and store it as a plain file. Not overriding tarinfo.type cause issue when fileobj is a pipe because tarinfo.pipe is set as FIFOTYPE wich is not supposed to have data associated to it in the Tar archive as far as I can tell.

Do you think changing the signature from addbuffer(self, tarinfo, fileobj) to addbuffer(self, tarinfo, fileobj, type=REGTYPE) would be acceptable?

I don’t see the benefit of an additional type argument. What would happen if it disagreed with tarinfo.type?

If the new method was documented for only adding “regular” files, I think setting REGTYPE might be okay.

vadmium · 2018-11-27T10:09:34Z

Lib/tarfile.py

+        self._write_header(tarinfo)
+
+        bufsize = self.copybufsize
+        tarinfo.size = copyfileobj(fileobj, self.fileobj, bufsize=bufsize)


This is an internal copy of tarinfo, not the copy passed by the caller. Seems to contradict your documentation.

I forgot this, thanks.

github-actions · 2025-04-13T06:03:40Z

This PR is stale because it has been open for 30 days with no activity.

the-knights-who-say-ni added the CLA signed label Nov 26, 2018

bedevere-bot added the awaiting review label Nov 26, 2018

Rémi Lapeyre added 3 commits November 26, 2018 12:10

Use defaults uid and gid on Windows

0bd71a4

Use non seekable fd in tarfile test_buffer_write

0b26391

Use more descriptive name when seeking in TarFile.addbuffer

cdcd38b

Raise ValueError when using TarFile.addbuffer with pax headers

11ca0f0

vadmium reviewed Nov 27, 2018

View reviewed changes

mgorny mannequin mentioned this pull request Apr 10, 2022

[RFE] tarfile: support adding file objects without prior known size #79408

Open

ezio-melotti removed the CLA signed label Jul 13, 2022

github-actions bot added the stale Stale PR or inactive for long period of time. label Apr 13, 2025

Uh oh!

Conversation

remilapeyre commented Nov 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgorny commented Nov 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remilapeyre Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

remilapeyre commented Nov 26, 2018 •

edited

Loading

remilapeyre Nov 27, 2018 •

edited

Loading