Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading with bz2.BZ2File() returns one garbage character #44233

Closed
cpn mannequin opened this issue Nov 15, 2006 · 8 comments
Closed

Reading with bz2.BZ2File() returns one garbage character #44233

cpn mannequin opened this issue Nov 15, 2006 · 8 comments
Labels
extension-modules C modules in the Modules dir

Comments

@cpn
Copy link
Mannequin

cpn mannequin commented Nov 15, 2006

BPO 1597011
Nosy @birkenfeld
Files
  • bzp.py: python script to reproduce the bug
  • python-trunk-bz2.patch
  • python-trunk-bz2-v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2007-09-17.05:48:14.732>
    created_at = <Date 2006-11-15.14:19:09.000>
    labels = ['extension-modules']
    title = 'Reading with bz2.BZ2File() returns one garbage character'
    updated_at = <Date 2007-09-17.06:48:23.973>
    user = 'https://bugs.python.org/cpn'

    bugs.python.org fields:

    activity = <Date 2007-09-17.06:48:23.973>
    actor = 'jafo'
    assignee = 'jafo'
    closed = True
    closed_date = <Date 2007-09-17.05:48:14.732>
    closer = 'jafo'
    components = ['Extension Modules']
    creation = <Date 2006-11-15.14:19:09.000>
    creator = 'cpn'
    dependencies = []
    files = ['2214', '8340', '8359']
    hgrepos = []
    issue_num = 1597011
    keywords = []
    message_count = 8.0
    messages = ['30548', '30549', '30550', '30551', '30552', '55363', '55469', '55950']
    nosy_count = 3.0
    nosy_names = ['georg.brandl', 'jafo', 'cpn']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1597011'
    versions = ['Python 2.4']

    @cpn
    Copy link
    Mannequin Author

    cpn mannequin commented Nov 15, 2006

    When comparing two files which should be equal the last line is
    different:

    The first file is a bzip2 compressed file and is read with
    bz2.BZ2File()
    The second file is the same file uncompressed and read with open()

    The first file named file.txt.bz2 is uncompressed with:

    $ bunzip2 -k file.txt.bz2

    To compare I use this script:
    ###############################
    import bz2

    f1 = bz2.BZ2File(r'file.txt.bz2', 'r')
    f2 = open(r'file.txt', 'r')
    lines = 0
    while True:
       line1 = f1.readline()
       line2 = f2.readline()
       if line1 == '':
          break
       lines += 1
       if line1 != line2:
          print 'line number:', lines
          print repr(line1)
          print repr(line2)
    f1.close()
    f2.close()
    ##############################

    Output:

    $ python bzp.py
    line number: 588317
    '\x07'
    '' 

    The offending attached file is 5.5 MB. Sorry, i could not reproduce this problem
    with a smaller file.

    Tested in Fedora Core 5 and Python 2.4.3

    @cpn cpn mannequin added extension-modules C modules in the Modules dir labels Nov 15, 2006
    @cpn
    Copy link
    Mannequin Author

    cpn mannequin commented Nov 15, 2006

    I can't upload the bz2 sample file. So it is here:
    http://fahstats.com/img/file.txt.bz2

    @birkenfeld
    Copy link
    Member

    With your file, I can reproduce that on Linux, Python 2.5.

    Which compressor did you compress your file with?
    I unpacked it with bunzip2 without problems, then recompressed it with bzip2, which resulted
    in a slightly smaller (51 bytes) file, which then didn't trigger the bug.

    @cpn
    Copy link
    Mannequin Author

    cpn mannequin commented Nov 15, 2006

    I received this file already compressed. I don't know what was the used compressor.
    There is no error if i test the compressed file with:

    $ bzip2 -t file.txt.bz2

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Aug 28, 2007

    There are some bugs in the bz2 module. The problem boils down to the
    following code, notice how *c is assigned *BEFORE* the check to see if
    there was a read error:

    do {
    BZ2_bzRead(&bzerror, f->fp, &c, 1);
    f->pos++;
    *buf++ = c;
    } while (bzerror == BZ_OK && c != '\n' && buf != end);

    This could be fixed by putting a "if (bzerror == BZ_OK) break;" after
    the BZ2_bzRead() call.

    However, I also noticed that in the universal newline section of the
    code it is reading a character, incrementing f->pos, *THEN* checking if
    buf == end and if so is throwing away the character.

    I changed the code around so that the read loop is unified between
    universal newlines and regular newlines. I guess this is a small
    performance penalty, since it's checking the newline mode for each
    character read, however we're already doing a system call for every
    character so one additional comparison and jump to merge duplicate code
    for maintenance reasons is probably a good plan. Especially since the
    reason for this bug only existed in one of the two duplicated parts of
    the code.

    Please let me know if this looks good to commit.

    @jafo jafo mannequin assigned jafo Aug 28, 2007
    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Aug 30, 2007

    Found some problems in the previous version, this one passes the tests
    and I've also spent time reviewing the code and I think this is correct.
    Part of the problem is that only bzerror was being checked, not the
    number of bytes read. When bzerror is not BZ_OK, the code expects that
    it returns a byte that was read, but in some cases it returns an error
    when no bytes were read.

    This code passes the test and also correctly handles the bz2 file that
    is the object of this bug.

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Sep 17, 2007

    I have committed this into trunk and the 2.5 maintenance branch. It
    passes all tests and the resulting build passes the submitter-provided test.

    @jafo jafo mannequin closed this as completed Sep 17, 2007
    @jafo jafo mannequin closed this as completed Sep 17, 2007
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant