Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

Open
waltaskew mannequin opened this issue Mar 28, 2018 · 5 comments
Open

GzipFile's .seekable() returns True even if underlying buffer is not seekable #77354

waltaskew mannequin opened this issue Mar 28, 2018 · 5 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir

Comments

@waltaskew
Copy link
Mannequin

waltaskew mannequin commented Mar 28, 2018

BPO 33173
Nosy @pitrou, @vadmium, @serhiy-storchaka, @csabella, @waltaskew, @tirkarthi
PRs
  • bpo-33173: Return Underlying fileobj's Seekability in GzipFile.seekable #6303
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-03-28.16:28:47.984>
    labels = ['3.8', 'library']
    title = "GzipFile's .seekable() returns True even if underlying buffer is not seekable"
    updated_at = <Date 2018-11-09.08:55:32.150>
    user = 'https://github.com/waltaskew'

    bugs.python.org fields:

    activity = <Date 2018-11-09.08:55:32.150>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-03-28.16:28:47.984>
    creator = 'Walt Askew'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33173
    keywords = ['patch']
    message_count = 5.0
    messages = ['314613', '327206', '327226', '329505', '329506']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'martin.panter', 'serhiy.storchaka', 'cheryl.sabella', 'Walt Askew', 'xtreak']
    pr_nums = ['6303']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue33173'
    versions = ['Python 3.8']

    @waltaskew
    Copy link
    Mannequin Author

    waltaskew mannequin commented Mar 28, 2018

    The seekable method on gzip.GzipFile always returns True, even if the underlying buffer is not seekable. However, if seek is called on the GzipFile, the seek will fail unless the underlying buffer is seekable. This can cause consumers of the GzipFile object to mistakenly believe calling seek on the object is safe, when in fact it will lead to an exception.

    For example, this led to a bug when I was trying to use requests & boto3 to stream & decompress an S3 upload like so:

    resp = requests.get(uri, stream=True)
    decompressed = gzip.GzipFile(fileobj=resp.raw)
    boto3.client('s3').upload_fileobj(decompressed, Bucket=bucket, Key=key)

    boto3 checks the seekable method on the the GzipFile, chooses a code path based on the file being seekable but later raises an exception when the seek call fails because the underlying HTTP stream is not seekable.

    @waltaskew waltaskew mannequin added the stdlib Python modules in the Lib dir label Mar 28, 2018
    @csabella
    Copy link
    Contributor

    csabella commented Oct 6, 2018

    Allowing for non seekable files was added in bpo-1675951. And under that issue in msg117131, the author of the change wrote:
    "The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable."

    bpo-23529 made significant changes to the code and seekable() is again mentioned in msg239245 and subsequent comments.

    Nosying the devs who worked on those issues.

    @csabella csabella added the 3.8 only security fixes label Oct 6, 2018
    @vadmium
    Copy link
    Member

    vadmium commented Oct 6, 2018

    If a change is made, it would be nice to bring the “gzip”, “bzip” and LZMA modules closer together. The current “bzip” and LZMA modules rely on the underlying “seekable” method without a fallback implementation, but also have a check for read mode.

    I think the seeking functionality in these modules is a misfeature. But since it is already here, it is probably best to leave it alone, and just document it.

    My comment about making “seekable” stricter is at <https://bugs.python.org/review/23529/diff/14296/Lib/gzip.py#oldcode550\>. Even if the underlying stream is not seekable, GzipFile can still fast-forward. Here is a demonstration:

    >>> z = BytesIO(bytes.fromhex(
    ...     "1F8B08000000000002FFF348CD29D051F05448CC55282E294DCE56C8CC53485448AFCA"
    ...     "2C5048CBCC490500F44BF0A01F000000"
    ... ))
    >>> def seek(*args): raise UnsupportedOperation()
    ... 
    >>> z.seek = seek  # Make the underlying stream not seekable
    >>> f = GzipFile(fileobj=z)
    >>> f.read(10)
    b'Help, I am'
    >>> f.seek(20)  # Fast forward
    20
    >>> f.read()
    b'a gzip file'
    >>> f.seek(0)  # Rewind
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/proj/python/cpython/Lib/gzip.py", line 368, in seek
        return self._buffer.seek(offset, whence)
      File "/home/proj/python/cpython/Lib/_compression.py", line 137, in seek
        self._rewind()
      File "/home/proj/python/cpython/Lib/gzip.py", line 515, in _rewind
        super()._rewind()
      File "/home/proj/python/cpython/Lib/_compression.py", line 115, in _rewind
        self._fp.seek(0)
      File "/home/proj/python/cpython/Lib/gzip.py", line 105, in seek
        return self.file.seek(off)
      File "<stdin>", line 1, in seek
    io.UnsupportedOperation

    @serhiy-storchaka
    Copy link
    Member

    I share Martin's opinion that this is a misfeature. User code can check seekable() and use seek() if it returns True or cache necessary data in memory if it returns False, because it is expected that seek() is more efficient. But in case of GzipFile it is not efficient, and can lead to decompression the whole content of the file and to much worse performance.

    @serhiy-storchaka
    Copy link
    Member

    And I share Martin's concern about fast-forward with an unseekable underlying file. If this works in current code, we can't simply return break it. This may mean that we can't change the implementation of GzipFile.seekable() at all, even if it lies in some cases.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir
    Projects
    Status: No status
    Development

    No branches or pull requests

    3 participants