bpo-35227: Add support for file objects of unknown size to tarfile#10714
bpo-35227: Add support for file objects of unknown size to tarfile#10714remilapeyre wants to merge 5 commits intopython:mainfrom
Conversation
This commit adds a new method to TarFile to support adding file object
whose size is unknown beforehand:
import tarfile
import urllib.request
tf = tarfile.open('tarfile.tar', 'w')
response = urllib.request.urlopen('http://www.example.com/huge-page.html')
tarinfo = tf.gettarinfo(name='foo', fileobj=response)
tf.addbuffer(tarinfo, buf)
tf.close()
The Tar header being written before the data in Tar archive, this requires
to write a first header without knowing the size and then seek back to
overwrite it when all data has been written and the size is known.
This method therefore cannot be used on compressed archives or unseekable
media.
A change has been made in TarFile.gettarinfo to not replace the `name`
argument by `fileobj.name` as in memory buffer do not have one. This may
need to be removed and replaced by something like:
if fileobj is not None:
try:
name = fileobj.name
except AttributeError:
# one of name or fileobj.name should be set
if name is None:
raise
if backward compatibility for this behavior is wanted.
|
I've just tested it on top of |
| The *name* parameter accepts a :term:`path-like object`. | ||
|
|
||
| .. versionchanged:: 3.8 | ||
| The *fileobj* attribute can now be an in-memory buffer |
There was a problem hiding this comment.
Attribute or parameter? Neither TarInfo nor TarFile seem to have a documented fileobj attribute.
| .. method:: TarFile.addbuffer(tarinfo, buf) | ||
|
|
||
| Add the :class:`TarInfo` object *tarinfo* to the archive reading data from | ||
| *buf*. The size of *buf* needs not to be known beforehand and *tarinfo.size* |
There was a problem hiding this comment.
First sentence would be easier to read with a comma: Add . . . to the archive, reading data . . .
What kind of object is buf supposed to be? If it is a file object, say what API it should support (e.g. BufferedIOBase.read method, or the entire readable API of BufferedIOBase). The name buf suggests an in-memory buffer like bytes, which doesn’t match the verb reading well. If this is the case, I suggest changing the parameter name.
If you mean that knowing the amount of data beforehand is allowed but optional, change needs not to be known to need not be known.
There was a problem hiding this comment.
Thanks for all your inputs, buf needs to support RawIOBase.read, RawIOBase.tell. I'm not sure how to say so properly in the documentation.
The name is not appropriate, as @mgorny saidaddstream and renaming all buf to stream might be better.
There was a problem hiding this comment.
Would it be hard to make it work without .tell()? I suppose you could just count the data you read.
There was a problem hiding this comment.
I wanted to reuse copyfileobj that makes use of shutil.copyfileobj which does not expose this but if I remove the dependency on copyfileobj and I could do this which would be better.
There was a problem hiding this comment.
I suspect tell is not needed. It looks like you only use tell on the tar file object, not the file being added to the tar file. (If tell is available, then SEEK_END probably also works and you can get the file size in advance.)
My suggested wording: “The fileobj argument should implement RawIOBase.read, which is called until the end of the stream is reached.”
| :attr:`~io.FileIO.name` attribute, or the *name* argument. The name | ||
| should be a text string. | ||
| should be a text string. If *fileobj* is an in-memory buffer, a default file | ||
| status will be used. |
There was a problem hiding this comment.
I think in-memory buffer should be clarified. If you mean something like an OS pipe, the behaviour should already be obvious (result of stat). But I think you mean some other kind of object.
There was a problem hiding this comment.
This method can be called on any object having a tell and read method, I'm not sure how to properly express this in the documentation.
| start_pos = dst.tell() | ||
| shutil.copyfileobj(src, dst, bufsize) | ||
| return | ||
| return dst.tell() - start_pos |
There was a problem hiding this comment.
If this branch is only used by your new code, it would be easier to understand if you moved it directly into the call site. Then the rest of the function does not need changing, and it only has one personality.
There was a problem hiding this comment.
This is indeed used only by my code, from what I could gather calling copyfileobj with length=None is never done and this is not a documented interface.
If we inline this part of the code in addbuffer I think we should remove the support for length=None and raise an error here to avoid mistakes.
| return [tarinfo.name for tarinfo in self.getmembers()] | ||
|
|
||
| def _getdefaultstat(self): | ||
| time = int(datetime.datetime.now().timestamp()) |
There was a problem hiding this comment.
Why not time.time() like the existing _Stream.init_write_gz method?
|
|
||
| tarinfo = copy.copy(tarinfo) | ||
| # we record the stream as a plain file | ||
| tarinfo.type = REGTYPE |
There was a problem hiding this comment.
Wouldn’t it be better to require the caller to set up tarinfo correctly, or for addbuffer to support the other kinds of entries?
There was a problem hiding this comment.
This aim of the method is to write data from fileobj and store it as a plain file. Not overriding tarinfo.type cause issue when fileobj is a pipe because tarinfo.pipe is set as FIFOTYPE wich is not supposed to have data associated to it in the Tar archive as far as I can tell.
Do you think changing the signature from addbuffer(self, tarinfo, fileobj) to addbuffer(self, tarinfo, fileobj, type=REGTYPE) would be acceptable?
There was a problem hiding this comment.
I don’t see the benefit of an additional type argument. What would happen if it disagreed with tarinfo.type?
If the new method was documented for only adding “regular” files, I think setting REGTYPE might be okay.
| self._write_header(tarinfo) | ||
|
|
||
| bufsize = self.copybufsize | ||
| tarinfo.size = copyfileobj(fileobj, self.fileobj, bufsize=bufsize) |
There was a problem hiding this comment.
This is an internal copy of tarinfo, not the copy passed by the caller. Seems to contradict your documentation.
|
This PR is stale because it has been open for 30 days with no activity. |
This commit adds a new method to TarFile to support adding file object
whose size is unknown beforehand:
The Tar header being written before the data in Tar archive, this requires
to write a first header without knowing the size and then seek back to
overwrite it when all data has been written and the size is known.
This method therefore cannot be used on compressed archives or unseekable
media.
A change has been made in TarFile.gettarinfo to not replace the
nameargument by
fileobj.nameas in memory buffer do not have one. This mayneed to be removed and replaced by something like:
if backward compatibility for this behavior is wanted.
https://bugs.python.org/issue35227