Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration breaks with bz2.open(filename,'rt') #59751

Closed
dabeaz mannequin opened this issue Aug 3, 2012 · 16 comments
Closed

Iteration breaks with bz2.open(filename,'rt') #59751

dabeaz mannequin opened this issue Aug 3, 2012 · 16 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dabeaz
Copy link
Mannequin

dabeaz mannequin commented Aug 3, 2012

BPO 15546
Nosy @pitrou, @serhiy-storchaka
Files
  • access-log-0108.bz2
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-08-05.00:28:17.872>
    created_at = <Date 2012-08-03.09:04:33.768>
    labels = ['type-bug', 'library']
    title = "Iteration breaks with bz2.open(filename,'rt')"
    updated_at = <Date 2013-01-22.13:59:29.889>
    user = 'https://bugs.python.org/dabeaz'

    bugs.python.org fields:

    activity = <Date 2013-01-22.13:59:29.889>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-08-05.00:28:17.872>
    closer = 'nadeem.vawda'
    components = ['Library (Lib)']
    creation = <Date 2012-08-03.09:04:33.768>
    creator = 'dabeaz'
    dependencies = []
    files = ['26673']
    hgrepos = []
    issue_num = 15546
    keywords = []
    message_count = 16.0
    messages = ['167299', '167305', '167308', '167369', '167370', '167397', '167407', '167408', '167461', '167462', '167470', '167493', '167497', '167501', '167505', '180388']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'nadeem.vawda', 'dabeaz', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue15546'
    versions = ['Python 3.3']

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Aug 3, 2012

    The bz2 library in Python3.3b1 doesn't support iteration for text-mode properly. Example:

    >>> f = bz2.open('access-log-0108.bz2')
    >>> next(f)       # Works
    b'140.180.132.213 - - [24/Feb/2008:00:08:59 -0600] "GET /ply/ply.html HTTP/1.1" 200 97238\n'
    
    >>> g = bz2.open('access-log-0108.bz2','rt')
    >>> next(g)       # Fails
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    StopIteration
    >>>

    @dabeaz dabeaz mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 3, 2012
    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 3, 2012

    I can't seem to reproduce this with an up-to-date checkout from Mercurial:

        >>> import bz2
        >>> g = bz2.open('access-log-0108.bz2','rt')
        >>> next(g)
        '140.180.132.213 - - [24/Feb/2008:00:08:59 -0600] "GET /ply/ply.html HTTP/1.1" 200 97238\n'

    (where 'access-log-0108.bz2' is a file I created with the output above as
    its first line, and a couple of other lines of random junk following that)

    Would it be possible for you to upload the file you used to trigger this
    bug?

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Aug 3, 2012

    File attached. The file can be read in its entirety in binary mode.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 3, 2012

    The cause of this problem is that BZ2File.read1() sometimes returns b"", even though
    the file is not at EOF. This happens when the underlying BZ2Decompressor cannot produce
    any decompressed data from just the block passed to it in _fill_buffer(); in this case, it needs to read more of the compressed stream to make progress.

    It would seem that BZ2File cannot satisfy the contract of the read1() method - we
    can't guarantee that a single call to the read() method of the underlying file will
    allow us to return a non-empty result, whereas returning b"" is reserved for the
    case where we have reached EOF.

    Simply removing the read1() method would simply trade this problem for a bigger one
    (resurrecting bpo-10791), so I propose amending BZ2File.read1() to make as many reads
    from the underlying file as necessary to return a non-empty result.

    Antoine, what do you think of this?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 3, 2012

    I propose amending BZ2File.read1() to make as many reads
    from the underlying file as necessary to return a non-empty result.

    Agreed. IMO, read1()'s contract should be read as a best-effort thing, not an absolute guarantee. Returning an empty string when there is still data available is wrong.

    @serhiy-storchaka
    Copy link
    Member

    I encountered this when implemented bzip2 support in zipfile (bpo-14371). I solved this also by rewriting read and read1 to make as many reads from the underlying file as necessary to return a non-empty result.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 4, 2012

    New changeset cdf27a213bd2 by Nadeem Vawda in branch 'default':
    bpo-15546: Fix BZ2File.read1()'s handling of pathological input data.
    http://hg.python.org/cpython/rev/cdf27a213bd2

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 4, 2012

    OK, BZ2File should now be fixed. It looks like LZMAFile and GzipFile may
    be susceptible to the same problem; I'll push fixes for them shortly.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 5, 2012

    New changeset 5284e65e865b by Nadeem Vawda in branch 'default':
    bpo-15546: Fix {GzipFile,LZMAFile}.read1()'s handling of pathological input data.
    http://hg.python.org/cpython/rev/5284e65e865b

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 5, 2012

    Done.

    Thanks for the bug report, David.

    @nadeemvawda nadeemvawda mannequin closed this as completed Aug 5, 2012
    @serhiy-storchaka
    Copy link
    Member

    What about peek()?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 5, 2012

    Before these fixes, it looks like all three classes' peek() methods were susceptible
    to the same problem as read1().

    The fixes for BZ2File.read1() and LZMAFile.read1() should have fixed peek() as well;
    both methods are implemented in terms of _fill_buffer().

    For GzipFile, peek() is still potentially broken - I'll push a fix shortly.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 5, 2012

    New changeset 8c07ff7f882f by Nadeem Vawda in branch 'default':
    bpo-15546: Also fix GzipFile.peek().
    http://hg.python.org/cpython/rev/8c07ff7f882f

    @serhiy-storchaka
    Copy link
    Member

    I have a doubts. Is it not a dead cycle if the end of the compressed data will happen on the end of reading block? Maybe instead of "while self.extrasize <= 0:" worth to write "while self.extrasize <= 0 and self.fileobj is not None:"?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Aug 5, 2012

    No, if _read() is called once the file is already at EOF, it raises an
    EOFError (http://hg.python.org/cpython/file/8c07ff7f882f/Lib/gzip.py#l433),
    which will then break out of the loop.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 22, 2013

    New changeset 0f25119ceee8 by Serhiy Storchaka in branch '3.2':
    bpo-15546: Fix GzipFile.peek()'s handling of pathological input data.
    http://hg.python.org/cpython/rev/0f25119ceee8

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants