New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354
Comments
The seekable method on gzip.GzipFile always returns True, even if the underlying buffer is not seekable. However, if seek is called on the GzipFile, the seek will fail unless the underlying buffer is seekable. This can cause consumers of the GzipFile object to mistakenly believe calling seek on the object is safe, when in fact it will lead to an exception. For example, this led to a bug when I was trying to use requests & boto3 to stream & decompress an S3 upload like so: resp = requests.get(uri, stream=True)
decompressed = gzip.GzipFile(fileobj=resp.raw)
boto3.client('s3').upload_fileobj(decompressed, Bucket=bucket, Key=key) boto3 checks the seekable method on the the GzipFile, chooses a code path based on the file being seekable but later raises an exception when the seek call fails because the underlying HTTP stream is not seekable. |
Allowing for non seekable files was added in bpo-1675951. And under that issue in msg117131, the author of the change wrote: bpo-23529 made significant changes to the code and seekable() is again mentioned in msg239245 and subsequent comments. Nosying the devs who worked on those issues. |
If a change is made, it would be nice to bring the “gzip”, “bzip” and LZMA modules closer together. The current “bzip” and LZMA modules rely on the underlying “seekable” method without a fallback implementation, but also have a check for read mode. I think the seeking functionality in these modules is a misfeature. But since it is already here, it is probably best to leave it alone, and just document it. My comment about making “seekable” stricter is at <https://bugs.python.org/review/23529/diff/14296/Lib/gzip.py#oldcode550\>. Even if the underlying stream is not seekable, GzipFile can still fast-forward. Here is a demonstration: >>> z = BytesIO(bytes.fromhex(
... "1F8B08000000000002FFF348CD29D051F05448CC55282E294DCE56C8CC53485448AFCA"
... "2C5048CBCC490500F44BF0A01F000000"
... ))
>>> def seek(*args): raise UnsupportedOperation()
...
>>> z.seek = seek # Make the underlying stream not seekable
>>> f = GzipFile(fileobj=z)
>>> f.read(10)
b'Help, I am'
>>> f.seek(20) # Fast forward
20
>>> f.read()
b'a gzip file'
>>> f.seek(0) # Rewind
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/proj/python/cpython/Lib/gzip.py", line 368, in seek
return self._buffer.seek(offset, whence)
File "/home/proj/python/cpython/Lib/_compression.py", line 137, in seek
self._rewind()
File "/home/proj/python/cpython/Lib/gzip.py", line 515, in _rewind
super()._rewind()
File "/home/proj/python/cpython/Lib/_compression.py", line 115, in _rewind
self._fp.seek(0)
File "/home/proj/python/cpython/Lib/gzip.py", line 105, in seek
return self.file.seek(off)
File "<stdin>", line 1, in seek
io.UnsupportedOperation |
I share Martin's opinion that this is a misfeature. User code can check seekable() and use seek() if it returns True or cache necessary data in memory if it returns False, because it is expected that seek() is more efficient. But in case of GzipFile it is not efficient, and can lead to decompression the whole content of the file and to much worse performance. |
And I share Martin's concern about fast-forward with an unseekable underlying file. If this works in current code, we can't simply return break it. This may mean that we can't change the implementation of GzipFile.seekable() at all, even if it lies in some cases. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: