Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs.readline sometimes removes newline chars #41794

Closed
irmen mannequin opened this issue Apr 2, 2005 · 14 comments
Closed

codecs.readline sometimes removes newline chars #41794

irmen mannequin opened this issue Apr 2, 2005 · 14 comments
Labels
stdlib Python modules in the Lib dir

Comments

@irmen
Copy link
Mannequin

irmen mannequin commented Apr 2, 2005

BPO 1175396
Nosy @doerwalter, @irmen
Files
  • bug.py: test program that shows the problem
  • testfile1.txt: inputfile with CR/LF line endings (needed by bug.py)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2005-10-09.23:25:37.000>
    created_at = <Date 2005-04-02.15:14:44.000>
    labels = ['library']
    title = 'codecs.readline sometimes removes newline chars'
    updated_at = <Date 2005-10-09.23:25:37.000>
    user = 'https://github.com/irmen'

    bugs.python.org fields:

    activity = <Date 2005-10-09.23:25:37.000>
    actor = 'irmen'
    assignee = 'none'
    closed = True
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2005-04-02.15:14:44.000>
    creator = 'irmen'
    dependencies = []
    files = ['1658', '1659']
    hgrepos = []
    issue_num = 1175396
    keywords = []
    message_count = 14.0
    messages = ['24858', '24859', '24860', '24861', '24862', '24863', '24864', '24865', '24866', '24867', '24868', '24869', '24870', '24871']
    nosy_count = 5.0
    nosy_names = ['doerwalter', 'glchapman', 'vrmeyer', 'irmen', 'mmm']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1175396'
    versions = ['Python 2.4']

    @irmen
    Copy link
    Mannequin Author

    irmen mannequin commented Apr 2, 2005

    In Python 2.4.1 i observed a new bug in codecs.readline,
    it seems that with certain inputs it removes newline
    characters from the end of the line....

    Probably related to bug bpo-1076985 (Incorrect
    behaviour of StreamReader.readline leads to crash)
    and bug bpo-1098990 codec readline() splits lines apart
    (both with status closed) so I'm assigning this to Walter.

    See the attached files that demonstrate the problem.
    Reproduced with Python 2.4.1 on windows XP and on
    Linux. The problem does not occur with Python 2.4.

    (btw, it seems bug bpo-1076985 was fixed in python 2.4.1,
    but the other one (bpo-1098990) not? )

    @irmen irmen mannequin closed this as completed Apr 2, 2005
    @irmen irmen mannequin added the stdlib Python modules in the Lib dir label Apr 2, 2005
    @irmen irmen mannequin closed this as completed Apr 2, 2005
    @irmen irmen mannequin added the stdlib Python modules in the Lib dir label Apr 2, 2005
    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    Checked in a fix as:
    Lib/codecs.py 1.42/1.43
    Lib/test/test_codecs.py 1.22
    Lib/codecs.py 1.35.2.6
    Lib/test/test_codecs.py 1.15.2.4

    Are you really sure, that the fix for bpo-1098990 is not in
    2.4.1? According to the tracker for bpo-1098990 the fix was in
    lib/codecs.py revision 1.35.2.2 and revision 1.35.2.3 is the
    one that got the r241c1 tag, so the fix should be in 2.4.1.

    @glchapman
    Copy link
    Mannequin

    glchapman mannequin commented Apr 9, 2005

    Logged In: YES
    user_id=86307

    Sorry to comment on a closed report, but perhaps this fix
    should not be limited only to cases where size is None.
    Today, I ran into a spurious syntax error when trying to
    import pythondoc (from
    http://effbot.org/downloads/pythondoc-2.1b3-20050325.zip).
    It turned out a \r was ending up in what looked to the
    parser like the middle of a line, presumably because a \n
    was dropped. Anyway, I applied the referenced patch to
    2.4.1, except I left out the "size is None" condition, since
    I knew the tokenizer passes in a size param. With that
    change pythondoc import successfully. (Also, I just ran the
    test suite and nothing broke.)

    Since the size parameter is already documented as being
    passed to StreamReader.read (in codecs.py -- the HTML
    documentation needs to be updated), and since
    StreamReader.read says size is an approximate maximum,
    perhaps it's OK to read one extra byte.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    OK, I'm reopening to bug report. I didn't manage to install
    pythondoc. cElementTree complains about: No such file or
    directory: './pyconfig.h'. Can you provide a simple Python
    file that fails when imported?

    @mmm
    Copy link
    Mannequin

    mmm mannequin commented Apr 14, 2005

    Logged In: YES
    user_id=65460

    foo2.py from bpo-1163244 fails to import. Not being expert in
    Python internals I hope it is due to this bug.

    @glchapman
    Copy link
    Mannequin

    glchapman mannequin commented Apr 14, 2005

    Logged In: YES
    user_id=86307

    I think the foo2.py from 1163244 is probably the same bug;
    at any rate, the reason for it is that a \r is at the
    beginning of the last line when read in by decoding_fgets.
    I have simpler test file which shows the bug which I'll
    email to Walter (you basically just have to get a \r as the
    last character in the block read by StreamReader, so that
    atcr will be true).

    The problem is caused by StreamReader.readline doing:

        if self.atcr and data.startswith(u"\n"):
            data = data[1:]

    since the tokenizer relies on '\n' as the line break
    character, but it will never see the '\n' removed by the above.

    FWIW (not much), I think the 2.4 StreamReader.readline
    actually made more sense than the current code, although a
    few changes would seem useful (see below). I don't think it
    is particularly useful to treat the size parameter as a
    fixed maximum number of bytes to read, since the number of
    bytes read has no fixed relationship to the number of
    decoded unicode characters (and also, in the case of the
    tokenizer, no fixed relationship to the number of bytes of
    encoded utf8). Also, given the current code, the size
    parameter is effectively ignored if there is a charbuffer:
    if you have 5 characters sitting in the charbuffer and use a
    size of 0x1FF, you only get back the 5 characters, even if
    they do not end in a linebreak. For the tokenizer, this
    means an unnecessary PyMem_RESIZE and an extra call to
    decoding_readline roughly every BUFSIZ bytes in the file
    (since the tokenizer assumes failure to fetch a complete
    line means its buffer is too small, whereas in fact it was
    caused by an incomplete line being stored in the
    StreamReader's charbuffer).

    As to changes from 2.4, if the unicode object were to add a
    findlinebreak method which returns the index of the first
    character for which Py_UNICODE_ISLINEBREAK is true, readline
    could use that instead of find("\n"). If it used such a
    method, readline would also need to explicitly handle a
    "\r\n" sequence, including a potential read(1) if a '\r'
    appears at the end of the data (in the case where size is
    not None). Of course, one problem with that idea is it
    requires a new method, which may not be allowed until 2.5,
    and the 2.4.1 behavior definitely needs to be fixed some
    way. (Interestingly, it looks to me like sre has everything
    necessary for searching for unicode linebreaks except syntax
    with which to express the idea in a pattern (maybe I'm
    missing something, but I can't find a way to get a compiled
    pattern to emit CATEGORY_UNI_LINEBREAK).)

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    The current readline() is implemented so that even if the
    complete line ending (i.e. '\r\n') can't be read within size
    bytes, at least something that might work as a line end
    (i.e. '\r') is available at the end of the string, so that
    the user knowns that the line is complete. The atcr members
    makes sure that the '\n' that is read next isn't
    misinterpreted as another line. Unfortunately the import
    mechanisn doesn't work that way: It demands a '\n' as line
    terminator and will continue reading if it only finds the
    '\r'. This means that the '\n' will be skipped and the
    import mechanisn treats those two lines as one.

    IMHO the best solution would be the read the extra character
    even if size is None, as glchapman suggested, so the above
    situation would really only happen if the last character
    from the stream is '\r'. I think the tokenizer should be
    able to survive this. What it didn't survive in 2.4 was that
    readline() tried to get it's hand on a complete line even if
    the length of the line was way bigger than the size passed
    in. If we remove the "size is None" restriction from the
    current code, then I think we should remove the atcr logic
    too, because it could only happen under exceptional
    circumstances and the old handling might be better for those
    users that don't recognize '\r' as a line end.

    But in any case the tokenizer should be fixed to be able to
    handle any line length returned from readline(). I'd really
    like to get a review by Martin v. Löwis of glchapman's patch
    bpo-1101726.

    Of course the simplest solution would be: "If you want a
    complete line, don't pass a size parameter to readline()". I
    don't know if it would make sense to change the PEP-263
    machinery to support this.

    The other problem is if readline() returns data from the
    charbuffer. I don't really see how this can be fixed. We
    might call read() with a size parameter even if there are
    characters in the charbuffer, but what happens if the user
    calls readline(size=100000) first and then
    readline(size=10)? The first readline() call might put a
    very long string into the charbuffer and then the second
    call will return a unicode string whose length is in no way
    correlated to the size parameter. (This happens even even
    with the current code of course, so the message should be:
    Don't trust the size parameter, it only tells the underlying
    stream how many bytes to read (approximately), it's doesn't
    tell you *anything* about the amount of data returned.).
    This was different in 2.3.x, because the readline() method
    of the underlying stream was used, which handled this
    properly, but only worked if an encoded linefeed was
    recognizable as a normal 8bit '\n' linefeed by the
    underlying readline() (i.e. the UTF-16 encodings didn't work).

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    OK, I've checked in the following:

    Lib/codecs.py 1.44
    Lib/test/test_codecs.py 1.23
    Lib/codecs.py 1.35.2.7
    Lib/test/test_codecs.py 1.15.2.5

    with the following changes as suggested by glchapman:
    If a chunk read has a trailing "\r", read one more character
    even if the user has passed a size parameter. Remove the
    atcr logic. This should fix most of the problems. There are
    three open issues:
    1. How do we handle the problem of a truncated line, if the
      data comes from the charbuffer instead of being read from
      the stream?

    2. How do we handle error reporting? The buffering code
      might read more data than the current line. If this data has
      a decoding error the caller might report a wrong line
      number. (See bug bpo-1178484)

    3. Fixing the tokenizer.

    @glchapman
    Copy link
    Mannequin

    glchapman mannequin commented Apr 23, 2005

    Logged In: YES
    user_id=86307

    1. How do we handle the problem of a truncated line, if the
      data comes from the charbuffer instead of being read from
      the stream?

    My suggestion is to make the top of the loop look like:

        while True:
            havecharbuffer = bool(self.charbuffer)

    And then the break condition (when no line break found)
    should be:

        # we didn't get anything or this was our only try
        if not data or (size is not None and not havecharbuffer):

    (too many negatives in that). Anyway, the idea is that, if
    size specified, there will be at most one read of the
    underlying stream (using the size). So if you enter the
    loop with a charbuffer, and that charbuffer does not have a
    line break, then you redo the loop once (the second time it
    will break, because havecharbuffer will be False).

    Also, not sure about this, but should the size parameter
    default to -1 (to keep it in sync with read)?

    As to issue 2, it looks like it should be possible to get
    the line number right, because the UnicodeDecodeError
    exception object has all the necessary information in it
    (including the original string). I think this should be
    done by fp_readl (in tokenizer.c).

    By the way, using a findlinebreak function (using sre) turns
    out to be slower than splitting/joining when no size is
    specified (which I assume will be the overwhelmingly most
    common case), so that turned out to be a bad idea on my part.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    > 1) How do we handle the problem of a truncated line, if the
    > data comes from the charbuffer instead of being read from
    > > the stream?
    >
    > My suggestion is to make the top of the loop look like:
    >
    > while True:
    > havecharbuffer = bool(self.charbuffer)
    >
    > And then the break condition (when no line break found)
    > should be:
    >
    > # we didn't get anything or this was our only try
    > if not data or (size is not None and not
    havecharbuffer):
    >
    > (too many negatives in that). Anyway, the idea is that, if
    > size specified, there will be at most one read of the
    > underlying stream (using the size). So if you enter the
    > loop with a charbuffer, and that charbuffer does not have a
    > line break, then you redo the loop once (the second time it
    > will break, because havecharbuffer will be False).

    This makes sense. However, with the current state of the
    tokenizer this might be to dangerous, because the resulting
    line might be twice as long. So fixing the tokenizer should
    be the first step. BTW, your patch fixes the problems with
    the fix for bpo-1178484, so I think I'm going to apply the
    patch in the next days, if there are no objections.

    > Also, not sure about this, but should the size parameter
    > default to -1 (to keep it in sync with read)?

    None seems to be a better default from an API viewpoint,
    but -1 is better for "C compatibility".

    > As to issue 2, it looks like it should be possible to get
    > the line number right, because the UnicodeDecodeError
    > exception object has all the necessary information in it
    > (including the original string). I think this should be
    > done by fp_readl (in tokenizer.c).

    The patch for bpo-1178484 fixes this. Combined with this patch
    I think we're in good shape.

    > By the way, using a findlinebreak function (using sre) turns
    > out to be slower than splitting/joining when no size is
    > specified (which I assume will be the overwhelmingly most
    > common case), so that turned out to be a bad idea on my
    part.

    Coding this on the C level and using Py_UNICODE_ISLINEBREAK()
    should be the fastest version, but I don't know if this is worth
    it.

    @vrmeyer
    Copy link
    Mannequin

    vrmeyer mannequin commented Jul 12, 2005

    Logged In: YES
    user_id=338015

    Hello doerwalter,

    Our thanks to you and to glchapman for working on this bug.

    I think the project I am working on may have run into this
    bug while
    attempting to use the python COM client wrappers for ADO on
    Win32. We saw the problem in Python 2.4.1, but not 2.4 or
    2.3.5.

    Do you have any projection for when the fix will make it into
    the stable distribution?

    Our production is running on 2.3.5. If it looks like a long
    while
    before the fix is distributed, we may upgrade to 2.4.0.
    Otherwise
    we'll stick with 2.3.5 and wait.

    Thanks again.

    Alan
    

    @glchapman
    Copy link
    Mannequin

    glchapman mannequin commented Jul 12, 2005

    Logged In: YES
    user_id=86307

    Build 204 of pywin32 has a workaround for bug 1163244 which
    should also avoid this bug: it leaves out the encoding
    notation in generated COM wrapper files (so loading will not
    involve the codecs module). If your wrapper files don't
    need extended characters (I don't think ADO does), you might
    want to give that a shot.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    2.4.2 is out the door, so I'm closing this bug as fixed.

    @irmen
    Copy link
    Mannequin Author

    irmen mannequin commented Oct 9, 2005

    Logged In: YES
    user_id=129426

    Confirmed fixed with Python 2.4.2 on mac os x. But the
    status is still 'open' although resolution is 'fixed'...

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant