Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs.open().readlines(sizehint) bug #39564

Closed
jepler mannequin opened this issue Nov 18, 2003 · 8 comments
Closed

codecs.open().readlines(sizehint) bug #39564

jepler mannequin opened this issue Nov 18, 2003 · 8 comments
Assignees

Comments

@jepler
Copy link
Mannequin

jepler mannequin commented Nov 18, 2003

BPO 844561
Nosy @malemburg
Files
  • codecs_readlines_bug.py: Counts lines wrong with codecs.open()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/malemburg'
    closed_at = <Date 2004-02-26.15:26:15.000>
    created_at = <Date 2003-11-18.17:22:40.000>
    labels = ['expert-unicode']
    title = 'codecs.open().readlines(sizehint) bug'
    updated_at = <Date 2004-02-26.15:26:15.000>
    user = 'https://bugs.python.org/jepler'

    bugs.python.org fields:

    activity = <Date 2004-02-26.15:26:15.000>
    actor = 'lemburg'
    assignee = 'lemburg'
    closed = True
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2003-11-18.17:22:40.000>
    creator = 'jepler'
    dependencies = []
    files = ['1100']
    hgrepos = []
    issue_num = 844561
    keywords = []
    message_count = 8.0
    messages = ['19029', '19030', '19031', '19032', '19033', '19034', '19035', '19036']
    nosy_count = 2.0
    nosy_names = ['lemburg', 'jepler']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue844561'
    versions = ['Python 2.2']

    @jepler
    Copy link
    Mannequin Author

    jepler mannequin commented Nov 18, 2003

    codecs.open().readlines(sizehint) can return truncated
    lines. The attached script, which uses
    readlines(sizehint) to count the number of lines in a
    file, demonstrates the problem. Correct output would
    be 1000 in both cases, but different values are
    returned depending on sizehint because of the truncated
    lines.

    @jepler jepler mannequin closed this as completed Nov 18, 2003
    @jepler jepler mannequin assigned malemburg Nov 18, 2003
    @jepler jepler mannequin added the topic-unicode label Nov 18, 2003
    @jepler
    Copy link
    Mannequin Author

    jepler mannequin commented Nov 18, 2003

    Logged In: YES
    user_id=2772

    The script triggers the assertion error using at least
    python 2.3.2 (locally compiled) and python 2.2.2 (redhat 9 RPM)

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    It's hard to say whether this is a bug or not. The sizehint
    argument is not well documented and the way you use it
    does not look a proper way to use it.

    From the docs:
    """"
    f the optional sizehint argument is present, instead of
    reading up to EOF, whole lines totalling approximately
    sizehint bytes (possibly after rounding up to an internal
    buffer size) are read.
    """"

    In your example the underlying open() implementation
    seems to round up the sizehint value to include the whole
    line, while the codec.open() version will only read sizehint
    bytes without any rounding (see the codecs.py
    implementation).

    @jepler
    Copy link
    Mannequin Author

    jepler mannequin commented Feb 26, 2004

    Logged In: YES
    user_id=2772

    To me, the phrase "*whole lines* totalling approximately
    sizehint" means that no item from readlines(sizehint) will
    be an incomplete line. I don't understand why this
    requirement isn't clearly indicated to you by the text you
    included in your comments.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Good catch. I must have overread the "whole lines" bit :-)

    In that case, it's probably best to have .readlines() ignore
    the sizehint argument altogether. An efficient implementation
    is hard to do since the line breaking is not done at C level,
    but after the data has been read.

    @jepler
    Copy link
    Mannequin Author

    jepler mannequin commented Feb 26, 2004

    Logged In: YES
    user_id=2772

    Ignoring sizehint and reading the whole file is probably
    better than truncating lines. This change would also fix
    another bug I realized exists in codecs readlines(sizehint)
    currently: if it reads only part of a multi-byte character,
    you get a decoding error...

    A slightly more complicated approach would be to read
    sizehint bytes and then while the result doesn't end in a
    newline, read one more byte and decode again. When sizehint
    is large enough, doing byte-at-a-time reading of the last
    half-line shouldn't be that bad for performance. No, I
    don't have a patch.

    Is there a way to differentiate between "the byte string
    ends with an incomplete multi-byte character" and "the byte
    string contains an invalid sequence of bytes"?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Ok, I'll fix codecs.py to ignore the sizehint argument then
    (should not break any code; at worst it might cause problems
    with MemoryOverflows).

    To answer your question: whether a byte string is incomplete
    or in error depends on the encoding and only the codec can
    decide what to do. While the codecs do differentiate and the
    error callback logic could be used to work out a correct
    solution, this would require a lot of work.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Fixed in CVS.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant