Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

Closed
sjmachin mannequin opened this issue Dec 7, 2008 · 11 comments
Closed

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

sjmachin mannequin opened this issue Dec 7, 2008 · 11 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@sjmachin
Copy link
Mannequin

sjmachin mannequin commented Dec 7, 2008

BPO 4574
Nosy @gpshead, @pitrou, @vstinner
Files
  • py30cr64bug.py: This stand-alone script illustrates the problems.
  • incremental_newline_decoder_bug.py
  • test_io.patch
  • incremental_newline_decoder-2.patch
  • utf16_newlines.patch
  • utf16_newlines2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2008-12-14.18:11:50.678>
    created_at = <Date 2008-12-07.11:00:34.827>
    labels = ['interpreter-core', 'type-crash']
    title = 'reading UTF16-encoded text file crashes if \\r on 64-char boundary'
    updated_at = <Date 2008-12-14.18:11:50.677>
    user = 'https://bugs.python.org/sjmachin'

    bugs.python.org fields:

    activity = <Date 2008-12-14.18:11:50.677>
    actor = 'pitrou'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2008-12-14.18:11:50.678>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2008-12-07.11:00:34.827>
    creator = 'sjmachin'
    dependencies = []
    files = ['12260', '12296', '12297', '12298', '12344', '12345']
    hgrepos = []
    issue_num = 4574
    keywords = ['patch']
    message_count = 11.0
    messages = ['77219', '77377', '77378', '77382', '77384', '77755', '77757', '77758', '77760', '77808', '77813']
    nosy_count = 4.0
    nosy_names = ['gregory.p.smith', 'sjmachin', 'pitrou', 'vstinner']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue4574'
    versions = ['Python 3.0']

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Dec 7, 2008

    Problem in the newline handling in io.py, class
    IncrementalNewlineDecoder, method decode. It reads text files in 128-
    byte chunks. Converting CR LF to \n requires special case handling
    when '\r' is detected at the end of the decoded chunk in case
    there's an LF at the start of the next chunk. It prepends b'\r' (only 1
    byte) to the next chunk's raw bytes and decodes that. But \r in UTF-16
    takes 2 bytes; we are now 1 byte out of kilter and various failures are
    possible (including silently producing garbage output from a truncated
    file with an odd number of bytes).

    The attached script illustrates the problems.

    @sjmachin sjmachin mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Dec 7, 2008
    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2008

    The bug is in IncrementalNewlineDecoder, not in the codec nor
    TextIOWrapper.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2008

    Smaller example to demonstrate the problem.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2008

    Here is a patch for test_io.py: check the problem by adding new
    encodings to TextIOWrapperTest.testNewlines().

    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2008

    Ugly patch to fix this issue:

    • add more regression tests for charsets UTF-16*, UTF-32*
    • add mandatory argument "encoding" to io.IncrementalNewlineDecoder
      constructor => BREAK THE API
    • use the encoding the encode "\r"
    • most ulgy hack: strip the BOM for codecs "UTF-16" and "UTF-32" (when
      encoding "\r" to bytes) => I don't know how to encode "\r" without the
      BOM

    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2008

    A couple of suggestions:

    • if IncrementalNewlineDecoder gets an encoding argument, it can also
      instantiate the decoder itself; that way the API is a bit simpler
    - to encode '\r' without the BOM, you can e.g. use an incremental
    encoder and encode it twice:
    >>> enc = codecs.getincrementalencoder('utf16')('strict')
    >>> enc.encode('\r')
    b'\xff\xfe\r\x00'
    >>> enc.encode('\r')
    b'\r\x00'

    I think breaking the API can be ok since the original API is broken
    (witness this bug). There can even be a compatibility mode where we
    check whether encoding has an encode() method, and if it has, take it
    as the decoder object rather than the encoding name.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2008

    Here is a simpler patch with a different approach and a lot of tests.
    The advantage is that it doesn't break the API.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 13, 2008

    This new variant also removes the dangerous hack in getstate / setstate.

    @gpshead
    Copy link
    Member

    gpshead commented Dec 13, 2008

    utf16_newlines2.patch looks good to me.

    This is a data corruption issue. If it is deferred for 3.0.1 it must be
    fixed in 3.0.2.

    +1 on putting this in 3.0.1.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2008

    Committed to py3k and release30-maint in r67760 and r67759. Needs
    backporting to 2.x.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2008

    Backported to trunk and 2.6.2 in r67762 and r67764.

    @pitrou pitrou closed this as completed Dec 14, 2008
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants