GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

waltaskew · 2018-03-28T16:28:48Z

BPO	33173
Nosy	@pitrou, @vadmium, @serhiy-storchaka, @csabella, @waltaskew, @tirkarthi
PRs	bpo-33173: Return Underlying fileobj's Seekability in GzipFile.seekable #6303

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-03-28.16:28:47.984>
labels = ['3.8', 'library']
title = "GzipFile's .seekable() returns True even if underlying buffer is not seekable"
updated_at = <Date 2018-11-09.08:55:32.150>
user = 'https://github.com/waltaskew'

bugs.python.org fields:

activity = <Date 2018-11-09.08:55:32.150>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-03-28.16:28:47.984>
creator = 'Walt Askew'
dependencies = []
files = []
hgrepos = []
issue_num = 33173
keywords = ['patch']
message_count = 5.0
messages = ['314613', '327206', '327226', '329505', '329506']
nosy_count = 6.0
nosy_names = ['pitrou', 'martin.panter', 'serhiy.storchaka', 'cheryl.sabella', 'Walt Askew', 'xtreak']
pr_nums = ['6303']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue33173'
versions = ['Python 3.8']

waltaskew · 2018-03-28T16:28:48Z

The seekable method on gzip.GzipFile always returns True, even if the underlying buffer is not seekable. However, if seek is called on the GzipFile, the seek will fail unless the underlying buffer is seekable. This can cause consumers of the GzipFile object to mistakenly believe calling seek on the object is safe, when in fact it will lead to an exception.

For example, this led to a bug when I was trying to use requests & boto3 to stream & decompress an S3 upload like so:

resp = requests.get(uri, stream=True)
decompressed = gzip.GzipFile(fileobj=resp.raw)
boto3.client('s3').upload_fileobj(decompressed, Bucket=bucket, Key=key)

boto3 checks the seekable method on the the GzipFile, chooses a code path based on the file being seekable but later raises an exception when the seek call fails because the underlying HTTP stream is not seekable.

csabella · 2018-10-06T01:04:28Z

Allowing for non seekable files was added in bpo-1675951. And under that issue in msg117131, the author of the change wrote:
"The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable."

bpo-23529 made significant changes to the code and seekable() is again mentioned in msg239245 and subsequent comments.

Nosying the devs who worked on those issues.

vadmium · 2018-10-06T06:03:22Z

If a change is made, it would be nice to bring the “gzip”, “bzip” and LZMA modules closer together. The current “bzip” and LZMA modules rely on the underlying “seekable” method without a fallback implementation, but also have a check for read mode.

I think the seeking functionality in these modules is a misfeature. But since it is already here, it is probably best to leave it alone, and just document it.

My comment about making “seekable” stricter is at <https://bugs.python.org/review/23529/diff/14296/Lib/gzip.py#oldcode550\>. Even if the underlying stream is not seekable, GzipFile can still fast-forward. Here is a demonstration:

>>> z = BytesIO(bytes.fromhex(
...     "1F8B08000000000002FFF348CD29D051F05448CC55282E294DCE56C8CC53485448AFCA"
...     "2C5048CBCC490500F44BF0A01F000000"
... ))
>>> def seek(*args): raise UnsupportedOperation()
... 
>>> z.seek = seek  # Make the underlying stream not seekable
>>> f = GzipFile(fileobj=z)
>>> f.read(10)
b'Help, I am'
>>> f.seek(20)  # Fast forward
20
>>> f.read()
b'a gzip file'
>>> f.seek(0)  # Rewind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/gzip.py", line 368, in seek
    return self._buffer.seek(offset, whence)
  File "/home/proj/python/cpython/Lib/_compression.py", line 137, in seek
    self._rewind()
  File "/home/proj/python/cpython/Lib/gzip.py", line 515, in _rewind
    super()._rewind()
  File "/home/proj/python/cpython/Lib/_compression.py", line 115, in _rewind
    self._fp.seek(0)
  File "/home/proj/python/cpython/Lib/gzip.py", line 105, in seek
    return self.file.seek(off)
  File "<stdin>", line 1, in seek
io.UnsupportedOperation

serhiy-storchaka · 2018-11-09T08:45:29Z

I share Martin's opinion that this is a misfeature. User code can check seekable() and use seek() if it returns True or cache necessary data in memory if it returns False, because it is expected that seek() is more efficient. But in case of GzipFile it is not efficient, and can lead to decompression the whole content of the file and to much worse performance.

serhiy-storchaka · 2018-11-09T08:55:32Z

And I share Martin's concern about fast-forward with an unseekable underlying file. If this works in current code, we can't simply return break it. This may mean that we can't change the implementation of GzipFile.seekable() at all, even if it lies in some cases.

waltaskew mannequin added the stdlib Python modules in the Lib dir label Mar 28, 2018

csabella added the 3.8 only security fixes label Oct 6, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

waltaskew mannequin commented Mar 28, 2018

waltaskew mannequin commented Mar 28, 2018

csabella commented Oct 6, 2018

vadmium commented Oct 6, 2018

serhiy-storchaka commented Nov 9, 2018

serhiy-storchaka commented Nov 9, 2018

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

Comments

waltaskew mannequin commented Mar 28, 2018

waltaskew mannequin commented Mar 28, 2018

csabella commented Oct 6, 2018

vadmium commented Oct 6, 2018

serhiy-storchaka commented Nov 9, 2018

serhiy-storchaka commented Nov 9, 2018