Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py3k: duplicated line endings when using read(1) #45736

Closed
amauryfa opened this issue Nov 6, 2007 · 41 comments
Closed

py3k: duplicated line endings when using read(1) #45736

amauryfa opened this issue Nov 6, 2007 · 41 comments
Labels
OS-windows stdlib Python modules in the Lib dir

Comments

@amauryfa
Copy link
Member

amauryfa commented Nov 6, 2007

BPO 1395
Nosy @gvanrossum, @amauryfa, @vstinner, @tiran, @avassalotti
Files
  • io.diff
  • io2.diff
  • io3.diff
  • py3k_windows.log.gz
  • io4.diff
  • io4.diff
  • linux_test.log.gz
  • io5.diff
  • io6.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2007-11-19.20:35:55.073>
    created_at = <Date 2007-11-06.00:21:36.726>
    labels = ['library', 'OS-windows']
    title = 'py3k: duplicated line endings when using read(1)'
    updated_at = <Date 2011-11-23.14:03:38.913>
    user = 'https://github.com/amauryfa'

    bugs.python.org fields:

    activity = <Date 2011-11-23.14:03:38.913>
    actor = 'sable'
    assignee = 'none'
    closed = True
    closed_date = <Date 2007-11-19.20:35:55.073>
    closer = 'amaury.forgeotdarc'
    components = ['Library (Lib)', 'Windows']
    creation = <Date 2007-11-06.00:21:36.726>
    creator = 'amaury.forgeotdarc'
    dependencies = []
    files = ['8703', '8706', '8708', '8709', '8711', '8712', '8713', '8714', '8719']
    hgrepos = []
    issue_num = 1395
    keywords = []
    message_count = 41.0
    messages = ['57147', '57152', '57155', '57156', '57163', '57164', '57166', '57169', '57171', '57176', '57204', '57205', '57207', '57208', '57209', '57210', '57214', '57217', '57218', '57219', '57220', '57228', '57229', '57230', '57231', '57234', '57235', '57236', '57238', '57252', '57264', '57272', '57273', '57283', '57291', '57311', '57368', '57664', '148182', '148183', '148185']
    nosy_count = 7.0
    nosy_names = ['gvanrossum', 'amaury.forgeotdarc', 'vstinner', 'draghuram', 'sable', 'christian.heimes', 'alexandre.vassalotti']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1395'
    versions = ['Python 3.0']

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 6, 2007

    When reading a Windows text file one byte at a time, \r\n get split into
    two function calls, and the user receives two \n.

    The following test fails (put it somewhere in test_io.py, inside
    TextIOWrapperTest for example)

        def testReadOneByOne(self):
            txt = io.TextIOWrapper(io.BytesIO(b"AA\r\nBB"))
            reads = ""
            while True:
                c = txt.read(1)
                if not c:
                    break
                reads += c
            self.assertEquals(reads, "AA\nBB")
            # AssertionError: 'AA\n\nBB' != 'AA\nBB'

    Note that replacing read(1) by read(2) gives the correct result.

    This problem is why test_netrc fails on Windows. It may also be the root
    cause for bpo-1142 (when \r\n position is just a multiple of the
    _CHUNK_SIZE).
    It also possible that the correction to this problem will have good
    effects on test_mailbox, which uses tell() and seek() intensively.

    @gvanrossum
    Copy link
    Member

    Wow, thanks!
    This is not just a bug on Windows -- it is a bug in the TextIOWrapper
    code that is just more likely on Windows. It is easily reproduced on
    Linux too:

    >>> f = open("@", "wb")>>> f.write(b"a\r\n")
    6
    >>> f.close()
    >>> f = open("@", "r")
    >>> f.read(1)
    'a'
    >>> f.read(1)
    '\n'
    >>>

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 6, 2007

    I think the solution is to do the translation on a bigger chunk than on
    the string being returned in each read call. For example, the following
    patch correctly returns "a" and "\n" (instead of "a" and two "\n"s).

    Index: Lib/io.py
    ===================================================================

    --- Lib/io.py   (revision 58874)
    +++ Lib/io.py   (working copy)
    @@ -1253,8 +1253,9 @@
                     res += pending
                     if not readahead:
                         break
    +            res = self._replacenl(res)
                 self._pending = res[n:]
    -            return self._replacenl(res[:n])
    +            return res[:n]
         def __next__(self):
             self._telling = False

    Of course, we need to take care of the case when the last character in
    "res" is "\r" to be followed by "\n" in the next read. I just wanted to
    see if I am on the right track and if so, I can spend more time on this.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 6, 2007

    Some thoughts:

    • is it ok to call replacenl multiple times on the same string? \r\r\n
      and other combinations like this.
    • I wonder if it is possible to correctly handle \r\n at the CHUNK_SIZE
      limit without an incremental decoder. it seems that we need at least a
      flag saying "\r was previously read, ignore the next \n"

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 6, 2007

    I am attaching the patch io.diff that does the following:

    • If a chunk ends in "\r", read the next chunk to check if it starts
      with "\n". This is obviously a very simplified solution that can be
      optimized.

    • invoke _replacenl on the complete string read instead of what is being
      returned in each read call.

    I also incorporated the test case by Amaury and added one more.

    With this patch in place, the following tests failed (on SuSE 10.1):

    test_doctest test_mailbox test_nis test_old_mailbox
    test_pickletools test_pty test_sys

    The failures (other than known test_mailbox and test_old_mailbox) didn't
    look like they are caused by this fix.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 6, 2007

    This patch goes in the right direction, but has several problems IMO:

    • it reads a complete chunk for just one more byte

    • the re-read should be disabled when lineends are not translated
      these two are minor annoyance and can be easily corrected, but:

    • there is no limit to the re-read; it can exhaust the memory if the
      source is a big file with many \r (like a Mac text file)

    • How does this behave if the underlying stream has not yet available
      data (a socket, a terminal, or an opened file): will the re-read block
      at the 129th byte until more data is available? If so, it is annoying
      for a simple call to read(1).

    I will try to write these test cases if you want.

    @gvanrossum
    Copy link
    Member

    IMO you shouldn't read another chunk when the last character you've seen
    is \r; instead, you should set a flag so that on the next read, you'll
    ignore an initial \n. The flag should be made of the tell/seek state as
    well.

    (The problem with reading another character is that in interactive input
    mode, this might force the user to type an entire second line.)

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 6, 2007

    On 11/6/07, Amaury Forgeot d'Arc <report@bugs.python.org> wrote:

    • it reads a complete chunk for just one more byte

    • the re-read should be disabled when lineends are not translated
      these two are minor annoyance and can be easily corrected, but:

    • there is no limit to the re-read; it can exhaust the memory if the
      source is a big file with many \r (like a Mac text file)

    Yes. reading another chunk is not an optimal solution but after seeing
    Guido's comment, I now agree that I shouldn't try to read at all. I
    will work on modifications.

    I will try to write these test cases if you want.

    That will definitely be useful.

    @tiran
    Copy link
    Member

    tiran commented Nov 6, 2007

    Guido van Rossum wrote:

    IMO you shouldn't read another chunk when the last character you've seen
    is \r; instead, you should set a flag so that on the next read, you'll
    ignore an initial \n. The flag should be made of the tell/seek state as
    well.

    In my opinion the check for \r should only happen when os.linesep or
    newline starts with \r. On Unix with standard newline the \r should be
    treated as every other char.

    Christian

    @gvanrossum
    Copy link
    Member

    In my opinion the check for \r should only happen when os.linesep or
    newline starts with \r. On Unix with standard newline the \r should be
    treated as every other char.

    No: it is not dependent on os.linesep but on the newline parameter
    passed to open().
    It's perfectly valid to want to read a file with CRLF endings on Unix
    (Amaury's patches end up in my /tmp directory like that.)

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 7, 2007

    I am attaching another patch (io2.diff). Please review. I am not sure
    whether _adjust_chunk() should also adjust "readahead".

    BTW, PEP-3116 says:

    "If universal newlines without translation are requested on input (i.e.
    newline=''), if a system read operation returns a buffer ending in '\r',
    another system read operation is done to determine whether it is
    followed by '\n' or not. In universal newlines mode with translation,
    the second system read operation may be postponed until the next read
    request, and if the following system read operation returns a buffer
    starting with '\n', that character is simply discarded."

    I suppose this issue is mainly talking about the latter (newline is
    None). I don't understand what is meant by "enabling universal new line
    mode without translation". Isn't the purpose of enabling universal new
    line mode is to translate line endings? I may be missing something
    basic, of course.

    @tiran
    Copy link
    Member

    tiran commented Nov 7, 2007

    The new patch fixes test_netrc for me but test_csv and test_mailbox are
    still broken.

    @tiran tiran added stdlib Python modules in the Lib dir OS-windows labels Nov 7, 2007
    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 7, 2007

    The new patch fixes test_netrc for me but test_csv and test_mailbox are
    still broken.

    Unfortunately, I am not able to build python on windows so I can not
    test there. Can you post the exact errors? You can send me private
    email as well if you think the test output would clutter this issue.

    @gvanrossum
    Copy link
    Member

    This looks promising, but my head hurts when I try to understand the
    code that's already there and think about whether your patch will always
    do the right thing... I'll look more later.

    Regarding "universal newlines without translation:" that means that \r\n
    and \r are recognized as line endings (as is \n) and that readline()
    will return whatever line end it sees. Compare this to setting
    newline="\n"; then \r is not treated as a line ending at all (and if the
    input is a\rb\n, the next readline call would return that entire string).

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 7, 2007

    The new patch fixes test_netrc for me but test_csv and test_mailbox are
    still broken.

    For test_mailbox at least, I think I have a clue: the _pending member
    now contains translated newlines.
    But in tell(), we use its length and compare it with the length of a
    previous "pending" value stored in self._snapshot...

    Can we try to somehow move the replacenl() call inside the _get_chunk
    function?
    Not sure it will work. This gets more and more obscure...

    @gvanrossum
    Copy link
    Member

    Somebody needs to reverse-engineer the invariants applying to the
    various instance variables and add comments explaining them, and
    showing how they are maintained by each call.

    @tiran
    Copy link
    Member

    tiran commented Nov 7, 2007

    By the way what happened to the SoC project related to Python's new IO
    subsystem? IIRC somebody was working on a C optimization of the io lib.

    @gvanrossum
    Copy link
    Member

    On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:

    Christian Heimes added the comment:

    By the way what happened to the SoC project related to Python's new IO
    subsystem? IIRC somebody was working on a C optimization of the io lib.


    Tracker <report@bugs.python.org>
    <http://bugs.python.org/issue1395\>


    I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
    mixing you up? (If you ca, please respond to the bug.)

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 7, 2007

    Hi Amaury and Christian,

    io3.diff does replacenl() in adjust_chunk() (trying Amaury's
    suggestion). Can you see if it fixes test_mailbox failures?

    @tiran
    Copy link
    Member

    tiran commented Nov 7, 2007

    Good work!

    The tests for mailbox, netrc and csv are passing with your test. I'm
    going to run the entire suite now.

    @tiran
    Copy link
    Member

    tiran commented Nov 7, 2007

    I take it back. I accidentally run the unit tests on the trunk instead
    of the py3k branch. mailbox and csv are still breaking with your test,
    netrc is doing fine.

    @avassalotti
    Copy link
    Member

    On 11/7/07, Guido van Rossum <guido@python.org> wrote:

    On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:
    >
    > Christian Heimes added the comment:
    >
    > By the way what happened to the SoC project related to Python's new IO
    > subsystem? IIRC somebody was working on a C optimization of the io lib.
    >

    I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
    mixing you up? (If you ca, please respond to the bug.)

    I think so. My GSoC project was to merge the interface of
    cPickle/pickle and cStringIO/StringIO. I am still working on it,
    albeit slowly (my school homework is killing all my free time, right
    now). My work on StringIO and BytesIO is basically done; I just need
    to add newline translation and encoding support before it can be
    merged into the trunk. The cPickle/pickle merge is mostly done (i.e.,
    all the current tests passes); I am right now in the bug-hunting
    phase.

    @gvanrossum
    Copy link
    Member

    Cool. How hard do you think it would be to extend your work on
    StringIO into a translation of TextIOWrapper into C?

    On Nov 7, 2007 3:55 PM, Alexandre Vassalotti <alexandre@peadrop.com> wrote:

    On 11/7/07, Guido van Rossum <guido@python.org> wrote:
    > On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:
    > >
    > > Christian Heimes added the comment:
    > >
    > > By the way what happened to the SoC project related to Python's new IO
    > > subsystem? IIRC somebody was working on a C optimization of the io lib.
    > >
    >
    > I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
    > mixing you up? (If you ca, please respond to the bug.)

    I think so. My GSoC project was to merge the interface of
    cPickle/pickle and cStringIO/StringIO. I am still working on it,
    albeit slowly (my school homework is killing all my free time, right
    now). My work on StringIO and BytesIO is basically done; I just need
    to add newline translation and encoding support before it can be
    merged into the trunk. The cPickle/pickle merge is mostly done (i.e.,
    all the current tests passes); I am right now in the bug-hunting
    phase.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    io3.diff does replacenl() in adjust_chunk() (trying Amaury's
    suggestion). Can you see if it fixes test_mailbox failures?

    Unfortunately, it does not. And some tests now fail in test_io
    (correcting testTelling seems a good start point, since we just changed
    the tell() behaviour)

    @avassalotti
    Copy link
    Member

    On 11/7/07, Guido van Rossum wrote:

    Cool. How hard do you think it would be to extend your work on
    StringIO into a translation of TextIOWrapper into C?

    Well, StringIO and TextIOWrapper are quite different. The only part
    that I could reuse, from StringIO, would be the newline translation
    facilities. I think that a perfect translation to C of the current
    Python implementation of TextIOWrapper will be burdensome (due to
    things like function annotations) but not too hard to do.

    Nevertheless, that would be neat project for me. I could start to work
    it by mid-December, along with the other performance enhancements I
    have in mind.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    OK, I have taken another approach which seems to work (see io4.diff):
    It uses an IncrementalNewlineDecoder, which wraps the initial (e.g.
    utf-8) decoder.
    All the tests in test_io pass on Windows, including those added by
    io.diff and io2.diff. This was not the case with the other proposed patches.

    While not completely finished, this approach seems much saner to me: it
    is simple, does not introduce obscure variables or functions, and is
    compatible with the (complex) code in tell() which reconstruct the file
    position.

    Next steps are:

    • move _seennl management inside this decoder, and remove _replacenl
    • make the decoder directly inherit from codecs.IncrementalDecoder; code
      may be easier to understand.
    • fix test_mailbox.

    About mailbox.py: it seems that the code cannot work: it uses statements
    like
    self._file.read(stop - self._file.tell())
    where 'stop' was previously initialized with some other call to
    self._file.tell(). But read() counts characters, whereas tell() returns
    byte counts. The question is now: how does it work with python2.5? I'll
    try to add debug prints to see where the differences are.

    @tiran
    Copy link
    Member

    tiran commented Nov 8, 2007

    The patch doesn't apply

    $ patch -p0 < io4.diff
    (Stripping trailing CRs from patch.)
    patching file Lib/io.py
    patch: **** malformed patch at line 41: @@ -1133,7 +1160,10 @@

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    Sorry, I think I corrupted the file by hand. Here is another version

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    About mailbox.py: it seems that the code cannot work

    Of course: the file mode was recently changed from rb+ to r+ (revision
    57809). This means that every occurrence of os.linesep has to disappear.
    Oh my.

    @tiran
    Copy link
    Member

    tiran commented Nov 8, 2007

    The new io4.diff breaks test_io and test_univnewlines on Linux

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    The new io4.diff breaks test_io and test_univnewlines on Linux
    oops, I missed this one.

    Here is a new version: io5.diff, which should handle the "seen newlines"
    better.

    Two more bug fixes found by test_univnewlines:

    • write() should return the number of characters written, but before
      they are newline-translated.
    • a problem in tell(), my decoder can return two characters even when
      you feed only one (example: decode(b'\r') holds the \r and returns
      nothing. then a decode(b'x') returns both chars "\rx")

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 8, 2007

    Considering that test_csv is failing on windows even without any changes
    related to this issue, I looked at it and came up with this patch:

    -----------------
    Index: Lib/test/test_csv.py
    ===================================================================

    --- Lib/test/test_csv.py        (revision 58914)
    +++ Lib/test/test_csv.py        (working copy)
    @@ -375,7 +375,7 @@
     
     class TestCsvBase(unittest.TestCase):
         def readerAssertEqual(self, input, expected_result):
    -        with TemporaryFile("w+") as fileobj:
    +        with TemporaryFile("w+", newline='') as fileobj:
                 fileobj.write(input)
                 fileobj.seek(0)
                 reader = csv.reader(fileobj, dialect = self.dialect)

    Does this look ok? The tests pass on windows and Linux.

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 8, 2007

    On 11/8/07, Amaury Forgeot d'Arc <report@bugs.python.org> wrote:

    OK, I have taken another approach which seems to work (see io4.diff):
    It uses an IncrementalNewlineDecoder, which wraps the initial (e.g.
    utf-8) decoder.

    I like this approach even though I haven't looked at the patch in detail.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Nov 8, 2007

    Updated patch (io6.diff):

    • simplifications in readline
    • seennl is now completely handled by the NewlineDecoder

    @gvanrossum
    Copy link
    Member

    Considering that test_csv is failing on windows even without any changes
    related to this issue, I looked at it and came up with this patch:

    -----------------
    Index: Lib/test/test_csv.py
    ===================================================================
    --- Lib/test/test_csv.py (revision 58914)
    +++ Lib/test/test_csv.py (working copy)
    @@ -375,7 +375,7 @@

    class TestCsvBase(unittest.TestCase):
    def readerAssertEqual(self, input, expected_result):

    •    with TemporaryFile("w+") as fileobj:
      
    •    with TemporaryFile("w+", newline='') as fileobj:
             fileobj.write(input)
             fileobj.seek(0)
             reader = csv.reader(fileobj, dialect = self.dialect)
      

    -----------------

    Does this look ok? The tests pass on windows and Linux.

    Yes, especially since writerAssertEqual() already uses that. :-)

    I do think there is something iffy here -- the 2.x version of this
    test opens the files in binary mode. I wonder what end users are
    supposed to do.

    @draghuram
    Copy link
    Mannequin

    draghuram mannequin commented Nov 9, 2007

    On 11/8/07, Guido van Rossum <report@bugs.python.org> wrote:

    I do think there is something iffy here -- the 2.x version of this
    test opens the files in binary mode. I wonder what end users are
    supposed to do.

    I think that requirement (need to open in binary mode) is no more
    present in 3.0. As per revision 56777 (bpo-1767398), csv module seems to
    support unicode strings and as such I would expect that files *should
    not* be opened in binary mode (This requires doc change both in csv
    doc and "what's new in 3.0). The tests in test_csv are explicitly
    writing "\r\n" which seems to be a hangover from 2.x. We seem to have
    side-stepped the problem by passing "newline=''" at some places but
    the real fix may be changing "\r\n"s to "\n" now that the temp files
    are being opened in text mode.

    @tiran
    Copy link
    Member

    tiran commented Nov 11, 2007

    By the way I've found the daily builds you were asking for, Raghuram.
    http://www.python.org/dev/daily-msi/ :)

    @amauryfa
    Copy link
    Member Author

    Committed the patch in r59060.

    @sable
    Copy link
    Mannequin

    sable mannequin commented Nov 23, 2011

    I am trying to get Python working when compiled with Visual Studio 2010 (cf bpo-13210).

    When running the tests with the python 2.7 branch compiled with VS2010, the "test_issue_1395_5" in test_io.py will cause Python to eat the whole memory within a few seconds and make the server completely unresponsive.

    @amauryfa
    Copy link
    Member Author

    You should open a new issue for this new problem.

    @sable
    Copy link
    Mannequin

    sable mannequin commented Nov 23, 2011

    OK, sorry. Done in bpo-13461.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants