reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

sjmachin · 2008-12-07T11:00:35Z

BPO	4574
Nosy	@gpshead, @pitrou, @vstinner
Files	py30cr64bug.py: This stand-alone script illustrates the problems. incremental_newline_decoder_bug.py test_io.patch incremental_newline_decoder-2.patch utf16_newlines.patch utf16_newlines2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/pitrou'
closed_at = <Date 2008-12-14.18:11:50.678>
created_at = <Date 2008-12-07.11:00:34.827>
labels = ['interpreter-core', 'type-crash']
title = 'reading UTF16-encoded text file crashes if \\r on 64-char boundary'
updated_at = <Date 2008-12-14.18:11:50.677>
user = 'https://bugs.python.org/sjmachin'

bugs.python.org fields:

activity = <Date 2008-12-14.18:11:50.677>
actor = 'pitrou'
assignee = 'pitrou'
closed = True
closed_date = <Date 2008-12-14.18:11:50.678>
closer = 'pitrou'
components = ['Interpreter Core']
creation = <Date 2008-12-07.11:00:34.827>
creator = 'sjmachin'
dependencies = []
files = ['12260', '12296', '12297', '12298', '12344', '12345']
hgrepos = []
issue_num = 4574
keywords = ['patch']
message_count = 11.0
messages = ['77219', '77377', '77378', '77382', '77384', '77755', '77757', '77758', '77760', '77808', '77813']
nosy_count = 4.0
nosy_names = ['gregory.p.smith', 'sjmachin', 'pitrou', 'vstinner']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue4574'
versions = ['Python 3.0']

sjmachin · 2008-12-07T11:00:33Z

Problem in the newline handling in io.py, class
IncrementalNewlineDecoder, method decode. It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk in case
there's an LF at the start of the next chunk. It prepends b'\r' (only 1
byte) to the next chunk's raw bytes and decodes that. But \r in UTF-16
takes 2 bytes; we are now 1 byte out of kilter and various failures are
possible (including silently producing garbage output from a truncated
file with an odd number of bytes).

The attached script illustrates the problems.

vstinner · 2008-12-09T00:20:21Z

The bug is in IncrementalNewlineDecoder, not in the codec nor
TextIOWrapper.

vstinner · 2008-12-09T00:24:05Z

Smaller example to demonstrate the problem.

vstinner · 2008-12-09T01:53:26Z

Here is a patch for test_io.py: check the problem by adding new
encodings to TextIOWrapperTest.testNewlines().

vstinner · 2008-12-09T02:08:24Z

Ugly patch to fix this issue:

add more regression tests for charsets UTF-16*, UTF-32*
add mandatory argument "encoding" to io.IncrementalNewlineDecoder
constructor => BREAK THE API
use the encoding the encode "\r"
most ulgy hack: strip the BOM for codecs "UTF-16" and "UTF-32" (when
encoding "\r" to bytes) => I don't know how to encode "\r" without the
BOM

pitrou · 2008-12-13T17:03:06Z

A couple of suggestions:

if IncrementalNewlineDecoder gets an encoding argument, it can also
instantiate the decoder itself; that way the API is a bit simpler

- to encode '\r' without the BOM, you can e.g. use an incremental
encoder and encode it twice:
>>> enc = codecs.getincrementalencoder('utf16')('strict')
>>> enc.encode('\r')
b'\xff\xfe\r\x00'
>>> enc.encode('\r')
b'\r\x00'

I think breaking the API can be ok since the original API is broken
(witness this bug). There can even be a compatibility mode where we
check whether encoding has an encode() method, and if it has, take it
as the decoder object rather than the encoding name.

pitrou · 2008-12-13T18:14:05Z

Here is a simpler patch with a different approach and a lot of tests.
The advantage is that it doesn't break the API.

pitrou · 2008-12-13T18:30:32Z

This new variant also removes the dangerous hack in getstate / setstate.

gpshead · 2008-12-13T18:51:45Z

utf16_newlines2.patch looks good to me.

This is a data corruption issue. If it is deferred for 3.0.1 it must be
fixed in 3.0.2.

+1 on putting this in 3.0.1.

pitrou · 2008-12-14T17:10:51Z

Committed to py3k and release30-maint in r67760 and r67759. Needs
backporting to 2.x.

pitrou · 2008-12-14T18:11:51Z

Backported to trunk and 2.6.2 in r67762 and r67764.

sjmachin mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Dec 7, 2008

gpshead added the release-blocker label Dec 13, 2008

gpshead assigned pitrou Dec 13, 2008

pitrou removed the release-blocker label Dec 14, 2008

pitrou closed this as completed Dec 14, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

sjmachin mannequin commented Dec 7, 2008

sjmachin mannequin commented Dec 7, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

pitrou commented Dec 13, 2008

pitrou commented Dec 13, 2008

pitrou commented Dec 13, 2008

gpshead commented Dec 13, 2008

pitrou commented Dec 14, 2008

pitrou commented Dec 14, 2008

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

reading UTF16-encoded text file crashes if \r on 64-char boundary #48824

Comments

sjmachin mannequin commented Dec 7, 2008

sjmachin mannequin commented Dec 7, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

vstinner commented Dec 9, 2008

pitrou commented Dec 13, 2008

pitrou commented Dec 13, 2008

pitrou commented Dec 13, 2008

gpshead commented Dec 13, 2008

pitrou commented Dec 14, 2008

pitrou commented Dec 14, 2008