Skip to content

bpo-35227: Add support for file objects of unknown size to tarfile#10714

Open
remilapeyre wants to merge 5 commits intopython:mainfrom
remilapeyre:add-streams-support-to-tarfile
Open

bpo-35227: Add support for file objects of unknown size to tarfile#10714
remilapeyre wants to merge 5 commits intopython:mainfrom
remilapeyre:add-streams-support-to-tarfile

Conversation

@remilapeyre
Copy link

@remilapeyre remilapeyre commented Nov 26, 2018

This commit adds a new method to TarFile to support adding file object
whose size is unknown beforehand:

import tarfile
import urllib.request
tf = tarfile.open('tarfile.tar', 'w')
response = urllib.request.urlopen('http://www.example.com/huge-page.html')
tarinfo = tf.gettarinfo(name='foo', fileobj=response)
tf.addbuffer(tarinfo, response)
tf.close()

The Tar header being written before the data in Tar archive, this requires
to write a first header without knowing the size and then seek back to
overwrite it when all data has been written and the size is known.

This method therefore cannot be used on compressed archives or unseekable
media.

A change has been made in TarFile.gettarinfo to not replace the name
argument by fileobj.name as in memory buffer do not have one. This may
need to be removed and replaced by something like:

if fileobj is not None:
    try:
        name = fileobj.name
    except AttributeError:
        # one of name or fileobj.name should be set
        if name is None:
            raise

if backward compatibility for this behavior is wanted.

https://bugs.python.org/issue35227

This commit adds a new method to TarFile to support adding file object
whose size is unknown beforehand:

    import tarfile
    import urllib.request
    tf = tarfile.open('tarfile.tar', 'w')
    response = urllib.request.urlopen('http://www.example.com/huge-page.html')
    tarinfo = tf.gettarinfo(name='foo', fileobj=response)
    tf.addbuffer(tarinfo, buf)
    tf.close()

The Tar header being written before the data in Tar archive, this requires
to write a first header without knowing the size and then seek back to
overwrite it when all data has been written and the size is known.

This method therefore cannot be used on compressed archives or unseekable
media.

A change has been made in TarFile.gettarinfo to not replace the `name`
argument by `fileobj.name` as in memory buffer do not have one. This may
need to be removed and replaced by something like:

    if fileobj is not None:
        try:
            name = fileobj.name
        except AttributeError:
            # one of name or fileobj.name should be set
            if name is None:
                raise

if backward compatibility for this behavior is wanted.
@mgorny
Copy link
Contributor

mgorny commented Nov 26, 2018

I've just tested it on top of subprocess.Popen(...).stdout, and it works just fine! Good work, thanks!

The *name* parameter accepts a :term:`path-like object`.

.. versionchanged:: 3.8
The *fileobj* attribute can now be an in-memory buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute or parameter? Neither TarInfo nor TarFile seem to have a documented fileobj attribute.

.. method:: TarFile.addbuffer(tarinfo, buf)

Add the :class:`TarInfo` object *tarinfo* to the archive reading data from
*buf*. The size of *buf* needs not to be known beforehand and *tarinfo.size*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence would be easier to read with a comma: Add . . . to the archive, reading data . . .

What kind of object is buf supposed to be? If it is a file object, say what API it should support (e.g. BufferedIOBase.read method, or the entire readable API of BufferedIOBase). The name buf suggests an in-memory buffer like bytes, which doesn’t match the verb reading well. If this is the case, I suggest changing the parameter name.

If you mean that knowing the amount of data beforehand is allowed but optional, change needs not to be known to need not be known.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your inputs, buf needs to support RawIOBase.read, RawIOBase.tell. I'm not sure how to say so properly in the documentation.

The name is not appropriate, as @mgorny saidaddstream and renaming all buf to stream might be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be hard to make it work without .tell()? I suppose you could just count the data you read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to reuse copyfileobj that makes use of shutil.copyfileobj which does not expose this but if I remove the dependency on copyfileobj and I could do this which would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect tell is not needed. It looks like you only use tell on the tar file object, not the file being added to the tar file. (If tell is available, then SEEK_END probably also works and you can get the file size in advance.)

My suggested wording: “The fileobj argument should implement RawIOBase.read, which is called until the end of the stream is reached.”

:attr:`~io.FileIO.name` attribute, or the *name* argument. The name
should be a text string.
should be a text string. If *fileobj* is an in-memory buffer, a default file
status will be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in-memory buffer should be clarified. If you mean something like an OS pipe, the behaviour should already be obvious (result of stat). But I think you mean some other kind of object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be called on any object having a tell and read method, I'm not sure how to properly express this in the documentation.

start_pos = dst.tell()
shutil.copyfileobj(src, dst, bufsize)
return
return dst.tell() - start_pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this branch is only used by your new code, it would be easier to understand if you moved it directly into the call site. Then the rest of the function does not need changing, and it only has one personality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed used only by my code, from what I could gather calling copyfileobj with length=None is never done and this is not a documented interface.

If we inline this part of the code in addbuffer I think we should remove the support for length=None and raise an error here to avoid mistakes.

return [tarinfo.name for tarinfo in self.getmembers()]

def _getdefaultstat(self):
time = int(datetime.datetime.now().timestamp())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not time.time() like the existing _Stream.init_write_gz method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake, thanks.


tarinfo = copy.copy(tarinfo)
# we record the stream as a plain file
tarinfo.type = REGTYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t it be better to require the caller to set up tarinfo correctly, or for addbuffer to support the other kinds of entries?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aim of the method is to write data from fileobj and store it as a plain file. Not overriding tarinfo.type cause issue when fileobj is a pipe because tarinfo.pipe is set as FIFOTYPE wich is not supposed to have data associated to it in the Tar archive as far as I can tell.

Do you think changing the signature from addbuffer(self, tarinfo, fileobj) to addbuffer(self, tarinfo, fileobj, type=REGTYPE) would be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see the benefit of an additional type argument. What would happen if it disagreed with tarinfo.type?

If the new method was documented for only adding “regular” files, I think setting REGTYPE might be okay.

self._write_header(tarinfo)

bufsize = self.copybufsize
tarinfo.size = copyfileobj(fileobj, self.fileobj, bufsize=bufsize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal copy of tarinfo, not the copy passed by the caller. Seems to contradict your documentation.

Copy link
Author

@remilapeyre remilapeyre Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this, thanks.

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review stale Stale PR or inactive for long period of time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants