codecs.open().readlines(sizehint) bug #39564

jepler · 2003-11-18T17:22:40Z

BPO	844561
Nosy	@malemburg
Files	codecs_readlines_bug.py: Counts lines wrong with codecs.open()

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2004-02-26.15:26:15.000>
created_at = <Date 2003-11-18.17:22:40.000>
labels = ['expert-unicode']
title = 'codecs.open().readlines(sizehint) bug'
updated_at = <Date 2004-02-26.15:26:15.000>
user = 'https://bugs.python.org/jepler'

bugs.python.org fields:

activity = <Date 2004-02-26.15:26:15.000>
actor = 'lemburg'
assignee = 'lemburg'
closed = True
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2003-11-18.17:22:40.000>
creator = 'jepler'
dependencies = []
files = ['1100']
hgrepos = []
issue_num = 844561
keywords = []
message_count = 8.0
messages = ['19029', '19030', '19031', '19032', '19033', '19034', '19035', '19036']
nosy_count = 2.0
nosy_names = ['lemburg', 'jepler']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue844561'
versions = ['Python 2.2']

jepler · 2003-11-18T17:22:40Z

codecs.open().readlines(sizehint) can return truncated
lines. The attached script, which uses
readlines(sizehint) to count the number of lines in a
file, demonstrates the problem. Correct output would
be 1000 in both cases, but different values are
returned depending on sizehint because of the truncated
lines.

jepler · 2003-11-18T17:28:41Z

Logged In: YES
user_id=2772

The script triggers the assertion error using at least
python 2.3.2 (locally compiled) and python 2.2.2 (redhat 9 RPM)

malemburg · 2004-02-25T23:04:37Z

Logged In: YES
user_id=38388

It's hard to say whether this is a bug or not. The sizehint
argument is not well documented and the way you use it
does not look a proper way to use it.

From the docs:
""""
f the optional sizehint argument is present, instead of
reading up to EOF, whole lines totalling approximately
sizehint bytes (possibly after rounding up to an internal
buffer size) are read.
""""

In your example the underlying open() implementation
seems to round up the sizehint value to include the whole
line, while the codec.open() version will only read sizehint
bytes without any rounding (see the codecs.py
implementation).

jepler · 2004-02-26T01:14:01Z

Logged In: YES
user_id=2772

To me, the phrase "*whole lines* totalling approximately
sizehint" means that no item from readlines(sizehint) will
be an incomplete line. I don't understand why this
requirement isn't clearly indicated to you by the text you
included in your comments.

malemburg · 2004-02-26T09:51:08Z

Logged In: YES
user_id=38388

Good catch. I must have overread the "whole lines" bit :-)

In that case, it's probably best to have .readlines() ignore
the sizehint argument altogether. An efficient implementation
is hard to do since the line breaking is not done at C level,
but after the data has been read.

jepler · 2004-02-26T14:50:46Z

Logged In: YES
user_id=2772

Ignoring sizehint and reading the whole file is probably
better than truncating lines. This change would also fix
another bug I realized exists in codecs readlines(sizehint)
currently: if it reads only part of a multi-byte character,
you get a decoding error...

A slightly more complicated approach would be to read
sizehint bytes and then while the result doesn't end in a
newline, read one more byte and decode again. When sizehint
is large enough, doing byte-at-a-time reading of the last
half-line shouldn't be that bad for performance. No, I
don't have a patch.

Is there a way to differentiate between "the byte string
ends with an incomplete multi-byte character" and "the byte
string contains an invalid sequence of bytes"?

malemburg · 2004-02-26T15:20:24Z

Logged In: YES
user_id=38388

Ok, I'll fix codecs.py to ignore the sizehint argument then
(should not break any code; at worst it might cause problems
with MemoryOverflows).

To answer your question: whether a byte string is incomplete
or in error depends on the encoding and only the codec can
decide what to do. While the codecs do differentiate and the
error callback logic could be used to work out a correct
solution, this would require a lot of work.

malemburg · 2004-02-26T15:26:15Z

Logged In: YES
user_id=38388

Fixed in CVS.

jepler mannequin closed this as completed Nov 18, 2003

jepler mannequin assigned malemburg Nov 18, 2003

jepler mannequin added the topic-unicode label Nov 18, 2003

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codecs.open().readlines(sizehint) bug #39564

codecs.open().readlines(sizehint) bug #39564

jepler mannequin commented Nov 18, 2003

jepler mannequin commented Nov 18, 2003

jepler mannequin commented Nov 18, 2003

malemburg commented Feb 25, 2004

jepler mannequin commented Feb 26, 2004

malemburg commented Feb 26, 2004

jepler mannequin commented Feb 26, 2004

malemburg commented Feb 26, 2004

malemburg commented Feb 26, 2004

codecs.open().readlines(sizehint) bug #39564

codecs.open().readlines(sizehint) bug #39564

Comments

jepler mannequin commented Nov 18, 2003

jepler mannequin commented Nov 18, 2003

jepler mannequin commented Nov 18, 2003

malemburg commented Feb 25, 2004

jepler mannequin commented Feb 26, 2004

malemburg commented Feb 26, 2004

jepler mannequin commented Feb 26, 2004

malemburg commented Feb 26, 2004

malemburg commented Feb 26, 2004