py3k: duplicated line endings when using read(1) #45736

amauryfa · 2007-11-06T00:21:37Z

BPO	1395
Nosy	@gvanrossum, @amauryfa, @vstinner, @tiran, @avassalotti
Files	io.diff io2.diff io3.diff py3k_windows.log.gz io4.diff io4.diff linux_test.log.gz io5.diff io6.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2007-11-19.20:35:55.073>
created_at = <Date 2007-11-06.00:21:36.726>
labels = ['library', 'OS-windows']
title = 'py3k: duplicated line endings when using read(1)'
updated_at = <Date 2011-11-23.14:03:38.913>
user = 'https://github.com/amauryfa'

bugs.python.org fields:

activity = <Date 2011-11-23.14:03:38.913>
actor = 'sable'
assignee = 'none'
closed = True
closed_date = <Date 2007-11-19.20:35:55.073>
closer = 'amaury.forgeotdarc'
components = ['Library (Lib)', 'Windows']
creation = <Date 2007-11-06.00:21:36.726>
creator = 'amaury.forgeotdarc'
dependencies = []
files = ['8703', '8706', '8708', '8709', '8711', '8712', '8713', '8714', '8719']
hgrepos = []
issue_num = 1395
keywords = []
message_count = 41.0
messages = ['57147', '57152', '57155', '57156', '57163', '57164', '57166', '57169', '57171', '57176', '57204', '57205', '57207', '57208', '57209', '57210', '57214', '57217', '57218', '57219', '57220', '57228', '57229', '57230', '57231', '57234', '57235', '57236', '57238', '57252', '57264', '57272', '57273', '57283', '57291', '57311', '57368', '57664', '148182', '148183', '148185']
nosy_count = 7.0
nosy_names = ['gvanrossum', 'amaury.forgeotdarc', 'vstinner', 'draghuram', 'sable', 'christian.heimes', 'alexandre.vassalotti']
pr_nums = []
priority = 'high'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1395'
versions = ['Python 3.0']

amauryfa · 2007-11-06T00:21:36Z

When reading a Windows text file one byte at a time, \r\n get split into
two function calls, and the user receives two \n.

The following test fails (put it somewhere in test_io.py, inside
TextIOWrapperTest for example)

    def testReadOneByOne(self):
        txt = io.TextIOWrapper(io.BytesIO(b"AA\r\nBB"))
        reads = ""
        while True:
            c = txt.read(1)
            if not c:
                break
            reads += c
        self.assertEquals(reads, "AA\nBB")
        # AssertionError: 'AA\n\nBB' != 'AA\nBB'

Note that replacing read(1) by read(2) gives the correct result.

This problem is why test_netrc fails on Windows. It may also be the root
cause for bpo-1142 (when \r\n position is just a multiple of the
_CHUNK_SIZE).
It also possible that the correction to this problem will have good
effects on test_mailbox, which uses tell() and seek() intensively.

gvanrossum · 2007-11-06T00:46:20Z

Wow, thanks!
This is not just a bug on Windows -- it is a bug in the TextIOWrapper
code that is just more likely on Windows. It is easily reproduced on
Linux too:

>>> f = open("@", "wb")>>> f.write(b"a\r\n")
6
>>> f.close()
>>> f = open("@", "r")
>>> f.read(1)
'a'
>>> f.read(1)
'\n'
>>>

draghuram · 2007-11-06T04:36:32Z

I think the solution is to do the translation on a bigger chunk than on
the string being returned in each read call. For example, the following
patch correctly returns "a" and "\n" (instead of "a" and two "\n"s).

Index: Lib/io.py
===================================================================

--- Lib/io.py   (revision 58874)
+++ Lib/io.py   (working copy)
@@ -1253,8 +1253,9 @@
                 res += pending
                 if not readahead:
                     break
+            res = self._replacenl(res)
             self._pending = res[n:]
-            return self._replacenl(res[:n])
+            return res[:n]

     def __next__(self):
         self._telling = False

Of course, we need to take care of the case when the last character in
"res" is "\r" to be followed by "\n" in the next read. I just wanted to
see if I am on the right track and if so, I can spend more time on this.

amauryfa · 2007-11-06T08:11:57Z

Some thoughts:

is it ok to call replacenl multiple times on the same string? \r\r\n
and other combinations like this.
I wonder if it is possible to correctly handle \r\n at the CHUNK_SIZE
limit without an incremental decoder. it seems that we need at least a
flag saying "\r was previously read, ignore the next \n"

draghuram · 2007-11-06T16:12:18Z

I am attaching the patch io.diff that does the following:

If a chunk ends in "\r", read the next chunk to check if it starts
with "\n". This is obviously a very simplified solution that can be
optimized.
invoke _replacenl on the complete string read instead of what is being
returned in each read call.

I also incorporated the test case by Amaury and added one more.

With this patch in place, the following tests failed (on SuSE 10.1):

test_doctest test_mailbox test_nis test_old_mailbox
test_pickletools test_pty test_sys

The failures (other than known test_mailbox and test_old_mailbox) didn't
look like they are caused by this fix.

amauryfa · 2007-11-06T16:51:50Z

This patch goes in the right direction, but has several problems IMO:

it reads a complete chunk for just one more byte
the re-read should be disabled when lineends are not translated
these two are minor annoyance and can be easily corrected, but:
there is no limit to the re-read; it can exhaust the memory if the
source is a big file with many \r (like a Mac text file)
How does this behave if the underlying stream has not yet available
data (a socket, a terminal, or an opened file): will the re-read block
at the 129th byte until more data is available? If so, it is annoying
for a simple call to read(1).

I will try to write these test cases if you want.

gvanrossum · 2007-11-06T17:43:35Z

IMO you shouldn't read another chunk when the last character you've seen
is \r; instead, you should set a flag so that on the next read, you'll
ignore an initial \n. The flag should be made of the tell/seek state as
well.

(The problem with reading another character is that in interactive input
mode, this might force the user to type an entire second line.)

draghuram · 2007-11-06T17:52:24Z

On 11/6/07, Amaury Forgeot d'Arc <report@bugs.python.org> wrote:

it reads a complete chunk for just one more byte

the re-read should be disabled when lineends are not translated
these two are minor annoyance and can be easily corrected, but:

there is no limit to the re-read; it can exhaust the memory if the
source is a big file with many \r (like a Mac text file)

Yes. reading another chunk is not an optimal solution but after seeing
Guido's comment, I now agree that I shouldn't try to read at all. I
will work on modifications.

I will try to write these test cases if you want.

That will definitely be useful.

tiran · 2007-11-06T18:21:09Z

Guido van Rossum wrote:

IMO you shouldn't read another chunk when the last character you've seen
is \r; instead, you should set a flag so that on the next read, you'll
ignore an initial \n. The flag should be made of the tell/seek state as
well.

In my opinion the check for \r should only happen when os.linesep or
newline starts with \r. On Unix with standard newline the \r should be
treated as every other char.

Christian

gvanrossum · 2007-11-06T18:59:14Z

In my opinion the check for \r should only happen when os.linesep or
newline starts with \r. On Unix with standard newline the \r should be
treated as every other char.

No: it is not dependent on os.linesep but on the newline parameter
passed to open().
It's perfectly valid to want to read a file with CRLF endings on Unix
(Amaury's patches end up in my /tmp directory like that.)

draghuram · 2007-11-07T17:08:56Z

I am attaching another patch (io2.diff). Please review. I am not sure
whether _adjust_chunk() should also adjust "readahead".

BTW, PEP-3116 says:

"If universal newlines without translation are requested on input (i.e.
newline=''), if a system read operation returns a buffer ending in '\r',
another system read operation is done to determine whether it is
followed by '\n' or not. In universal newlines mode with translation,
the second system read operation may be postponed until the next read
request, and if the following system read operation returns a buffer
starting with '\n', that character is simply discarded."

I suppose this issue is mainly talking about the latter (newline is
None). I don't understand what is meant by "enabling universal new line
mode without translation". Isn't the purpose of enabling universal new
line mode is to translate line endings? I may be missing something
basic, of course.

tiran · 2007-11-07T17:20:30Z

The new patch fixes test_netrc for me but test_csv and test_mailbox are
still broken.

draghuram · 2007-11-07T17:28:34Z

The new patch fixes test_netrc for me but test_csv and test_mailbox are
still broken.

Unfortunately, I am not able to build python on windows so I can not
test there. Can you post the exact errors? You can send me private
email as well if you think the test output would clutter this issue.

gvanrossum · 2007-11-07T17:41:31Z

This looks promising, but my head hurts when I try to understand the
code that's already there and think about whether your patch will always
do the right thing... I'll look more later.

Regarding "universal newlines without translation:" that means that \r\n
and \r are recognized as line endings (as is \n) and that readline()
will return whatever line end it sees. Compare this to setting
newline="\n"; then \r is not treated as a line ending at all (and if the
input is a\rb\n, the next readline call would return that entire string).

amauryfa · 2007-11-07T17:47:22Z

The new patch fixes test_netrc for me but test_csv and test_mailbox are
still broken.

For test_mailbox at least, I think I have a clue: the _pending member
now contains translated newlines.
But in tell(), we use its length and compare it with the length of a
previous "pending" value stored in self._snapshot...

Can we try to somehow move the replacenl() call inside the _get_chunk
function?
Not sure it will work. This gets more and more obscure...

gvanrossum · 2007-11-07T17:49:33Z

Somebody needs to reverse-engineer the invariants applying to the
various instance variables and add comments explaining them, and
showing how they are maintained by each call.

tiran · 2007-11-07T18:35:00Z

By the way what happened to the SoC project related to Python's new IO
subsystem? IIRC somebody was working on a C optimization of the io lib.

gvanrossum · 2007-11-07T19:03:08Z

On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:

Christian Heimes added the comment:

By the way what happened to the SoC project related to Python's new IO
subsystem? IIRC somebody was working on a C optimization of the io lib.

Tracker <report@bugs.python.org>
<http://bugs.python.org/issue1395\>

I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
mixing you up? (If you ca, please respond to the bug.)

draghuram · 2007-11-07T19:06:05Z

Hi Amaury and Christian,

io3.diff does replacenl() in adjust_chunk() (trying Amaury's
suggestion). Can you see if it fixes test_mailbox failures?

tiran · 2007-11-07T19:10:59Z

Good work!

The tests for mailbox, netrc and csv are passing with your test. I'm
going to run the entire suite now.

tiran · 2007-11-07T19:14:56Z

I take it back. I accidentally run the unit tests on the trunk instead
of the py3k branch. mailbox and csv are still breaking with your test,
netrc is doing fine.

avassalotti · 2007-11-07T23:55:40Z

On 11/7/07, Guido van Rossum <guido@python.org> wrote:

On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:
>
> Christian Heimes added the comment:
>
> By the way what happened to the SoC project related to Python's new IO
> subsystem? IIRC somebody was working on a C optimization of the io lib.
>

I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
mixing you up? (If you ca, please respond to the bug.)

I think so. My GSoC project was to merge the interface of
cPickle/pickle and cStringIO/StringIO. I am still working on it,
albeit slowly (my school homework is killing all my free time, right
now). My work on StringIO and BytesIO is basically done; I just need
to add newline translation and encoding support before it can be
merged into the trunk. The cPickle/pickle merge is mostly done (i.e.,
all the current tests passes); I am right now in the bug-hunting
phase.

gvanrossum · 2007-11-08T00:11:54Z

Cool. How hard do you think it would be to extend your work on
StringIO into a translation of TextIOWrapper into C?

On Nov 7, 2007 3:55 PM, Alexandre Vassalotti <alexandre@peadrop.com> wrote:

On 11/7/07, Guido van Rossum <guido@python.org> wrote:
> On 11/7/07, Christian Heimes <report@bugs.python.org> wrote:
> >
> > Christian Heimes added the comment:
> >
> > By the way what happened to the SoC project related to Python's new IO
> > subsystem? IIRC somebody was working on a C optimization of the io lib.
> >
>
> I think it was Alexandre Vassalotti. Is that right, Alexandre? Or am I
> mixing you up? (If you ca, please respond to the bug.)

I think so. My GSoC project was to merge the interface of
cPickle/pickle and cStringIO/StringIO. I am still working on it,
albeit slowly (my school homework is killing all my free time, right
now). My work on StringIO and BytesIO is basically done; I just need
to add newline translation and encoding support before it can be
merged into the trunk. The cPickle/pickle merge is mostly done (i.e.,
all the current tests passes); I am right now in the bug-hunting
phase.

amauryfa · 2007-11-08T00:30:48Z

io3.diff does replacenl() in adjust_chunk() (trying Amaury's
suggestion). Can you see if it fixes test_mailbox failures?

Unfortunately, it does not. And some tests now fail in test_io
(correcting testTelling seems a good start point, since we just changed
the tell() behaviour)

avassalotti · 2007-11-08T01:51:59Z

On 11/7/07, Guido van Rossum wrote:

Cool. How hard do you think it would be to extend your work on
StringIO into a translation of TextIOWrapper into C?

Well, StringIO and TextIOWrapper are quite different. The only part
that I could reuse, from StringIO, would be the newline translation
facilities. I think that a perfect translation to C of the current
Python implementation of TextIOWrapper will be burdensome (due to
things like function annotations) but not too hard to do.

Nevertheless, that would be neat project for me. I could start to work
it by mid-December, along with the other performance enhancements I
have in mind.

amauryfa · 2007-11-08T12:59:08Z

OK, I have taken another approach which seems to work (see io4.diff):
It uses an IncrementalNewlineDecoder, which wraps the initial (e.g.
utf-8) decoder.
All the tests in test_io pass on Windows, including those added by
io.diff and io2.diff. This was not the case with the other proposed patches.

While not completely finished, this approach seems much saner to me: it
is simple, does not introduce obscure variables or functions, and is
compatible with the (complex) code in tell() which reconstruct the file
position.

Next steps are:

move _seennl management inside this decoder, and remove _replacenl
make the decoder directly inherit from codecs.IncrementalDecoder; code
may be easier to understand.
fix test_mailbox.

About mailbox.py: it seems that the code cannot work: it uses statements
like
self._file.read(stop - self._file.tell())
where 'stop' was previously initialized with some other call to
self._file.tell(). But read() counts characters, whereas tell() returns
byte counts. The question is now: how does it work with python2.5? I'll
try to add debug prints to see where the differences are.

tiran · 2007-11-08T13:17:30Z

The patch doesn't apply

$ patch -p0 < io4.diff
(Stripping trailing CRs from patch.)
patching file Lib/io.py
patch: **** malformed patch at line 41: @@ -1133,7 +1160,10 @@

amauryfa · 2007-11-08T13:23:36Z

Sorry, I think I corrupted the file by hand. Here is another version

amauryfa · 2007-11-08T13:30:49Z

About mailbox.py: it seems that the code cannot work

Of course: the file mode was recently changed from rb+ to r+ (revision
57809). This means that every occurrence of os.linesep has to disappear.
Oh my.

tiran · 2007-11-08T14:40:03Z

The new io4.diff breaks test_io and test_univnewlines on Linux

amauryfa · 2007-11-08T17:37:27Z

The new io4.diff breaks test_io and test_univnewlines on Linux
oops, I missed this one.

Here is a new version: io5.diff, which should handle the "seen newlines"
better.

Two more bug fixes found by test_univnewlines:

write() should return the number of characters written, but before
they are newline-translated.
a problem in tell(), my decoder can return two characters even when
you feed only one (example: decode(b'\r') holds the \r and returns
nothing. then a decode(b'x') returns both chars "\rx")

draghuram · 2007-11-08T18:54:25Z

Considering that test_csv is failing on windows even without any changes
related to this issue, I looked at it and came up with this patch:

-----------------
Index: Lib/test/test_csv.py
===================================================================

--- Lib/test/test_csv.py        (revision 58914)
+++ Lib/test/test_csv.py        (working copy)
@@ -375,7 +375,7 @@
 
 class TestCsvBase(unittest.TestCase):
     def readerAssertEqual(self, input, expected_result):
-        with TemporaryFile("w+") as fileobj:
+        with TemporaryFile("w+", newline='') as fileobj:
             fileobj.write(input)
             fileobj.seek(0)
             reader = csv.reader(fileobj, dialect = self.dialect)

Does this look ok? The tests pass on windows and Linux.

draghuram · 2007-11-08T19:00:26Z

On 11/8/07, Amaury Forgeot d'Arc <report@bugs.python.org> wrote:

OK, I have taken another approach which seems to work (see io4.diff):
It uses an IncrementalNewlineDecoder, which wraps the initial (e.g.
utf-8) decoder.

I like this approach even though I haven't looked at the patch in detail.

amauryfa · 2007-11-08T23:30:01Z

Updated patch (io6.diff):

simplifications in readline
seennl is now completely handled by the NewlineDecoder

gvanrossum · 2007-11-09T00:42:13Z

Considering that test_csv is failing on windows even without any changes
related to this issue, I looked at it and came up with this patch:

-----------------
Index: Lib/test/test_csv.py
===================================================================
--- Lib/test/test_csv.py (revision 58914)
+++ Lib/test/test_csv.py (working copy)
@@ -375,7 +375,7 @@

class TestCsvBase(unittest.TestCase):
def readerAssertEqual(self, input, expected_result):
   with TemporaryFile("w+") as fileobj:
   with TemporaryFile("w+", newline='') as fileobj:
       fileobj.write(input)
       fileobj.seek(0)
       reader = csv.reader(fileobj, dialect = self.dialect)
-----------------

Does this look ok? The tests pass on windows and Linux.

Yes, especially since writerAssertEqual() already uses that. :-)

I do think there is something iffy here -- the 2.x version of this
test opens the files in binary mode. I wonder what end users are
supposed to do.

draghuram · 2007-11-09T15:50:32Z

On 11/8/07, Guido van Rossum <report@bugs.python.org> wrote:

I do think there is something iffy here -- the 2.x version of this
test opens the files in binary mode. I wonder what end users are
supposed to do.

I think that requirement (need to open in binary mode) is no more
present in 3.0. As per revision 56777 (bpo-1767398), csv module seems to
support unicode strings and as such I would expect that files *should
not* be opened in binary mode (This requires doc change both in csv
doc and "what's new in 3.0). The tests in test_csv are explicitly
writing "\r\n" which seems to be a hangover from 2.x. We seem to have
side-stepped the problem by passing "newline=''" at some places but
the real fix may be changing "\r\n"s to "\n" now that the temp files
are being opened in text mode.

tiran · 2007-11-11T15:09:51Z

By the way I've found the daily builds you were asking for, Raghuram.
http://www.python.org/dev/daily-msi/ :)

amauryfa · 2007-11-19T20:35:55Z

Committed the patch in r59060.

sable · 2011-11-23T13:47:43Z

I am trying to get Python working when compiled with Visual Studio 2010 (cf bpo-13210).

When running the tests with the python 2.7 branch compiled with VS2010, the "test_issue_1395_5" in test_io.py will cause Python to eat the whole memory within a few seconds and make the server completely unresponsive.

amauryfa · 2011-11-23T13:51:11Z

You should open a new issue for this new problem.

sable · 2011-11-23T14:03:39Z

OK, sorry. Done in bpo-13461.

tiran added stdlib Python modules in the Lib dir OS-windows labels Nov 7, 2007

amauryfa closed this as completed Nov 19, 2007

ezio-melotti transferred this issue from another repository Apr 10, 2022

py3k: duplicated line endings when using read(1) #45736

py3k: duplicated line endings when using read(1) #45736

Comments

amauryfa commented Nov 6, 2007

amauryfa commented Nov 6, 2007

gvanrossum commented Nov 6, 2007

draghuram mannequin commented Nov 6, 2007

amauryfa commented Nov 6, 2007

draghuram mannequin commented Nov 6, 2007

amauryfa commented Nov 6, 2007

gvanrossum commented Nov 6, 2007

draghuram mannequin commented Nov 6, 2007

tiran commented Nov 6, 2007

gvanrossum commented Nov 6, 2007

draghuram mannequin commented Nov 7, 2007

tiran commented Nov 7, 2007

draghuram mannequin commented Nov 7, 2007

gvanrossum commented Nov 7, 2007

amauryfa commented Nov 7, 2007

gvanrossum commented Nov 7, 2007

tiran commented Nov 7, 2007

gvanrossum commented Nov 7, 2007

draghuram mannequin commented Nov 7, 2007

tiran commented Nov 7, 2007

tiran commented Nov 7, 2007

avassalotti commented Nov 7, 2007

gvanrossum commented Nov 8, 2007

amauryfa commented Nov 8, 2007

avassalotti commented Nov 8, 2007

amauryfa commented Nov 8, 2007

tiran commented Nov 8, 2007

amauryfa commented Nov 8, 2007

amauryfa commented Nov 8, 2007

tiran commented Nov 8, 2007

amauryfa commented Nov 8, 2007

draghuram mannequin commented Nov 8, 2007

draghuram mannequin commented Nov 8, 2007

amauryfa commented Nov 8, 2007

gvanrossum commented Nov 9, 2007

draghuram mannequin commented Nov 9, 2007

tiran commented Nov 11, 2007

amauryfa commented Nov 19, 2007

sable mannequin commented Nov 23, 2011

amauryfa commented Nov 23, 2011

sable mannequin commented Nov 23, 2011