Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRFC: line-endings #1212
Conversation
RalfJung
referenced this pull request
Jul 15, 2015
Closed
Add a method `lines_any` to `BufRead` #26743
RalfJung
changed the title
Add RFC line_endings
RFC: line-endings
Jul 15, 2015
alexcrichton
added
the
T-libs
label
Jul 15, 2015
This comment has been minimized.
This comment has been minimized.
ben0x539
commented
Jul 15, 2015
|
What about just |
This comment has been minimized.
This comment has been minimized.
|
That is not a new line and does not behave like one on any system I am aware of. |
This comment has been minimized.
This comment has been minimized.
yigal100
commented
Jul 15, 2015
|
@nagisa Just anecdote:
It is ridiculous that this was carried to computers where there is no carriage that needs to be returned... |
This comment has been minimized.
This comment has been minimized.
Yeah, just imagine we had fully digital screens that still scaled and messed up the (perfectly pixel-aligned) image they get via HDMI input, just to be compatible with some analog screens from the 60ies. Oh wait... Back on topic though, I did not include |
This comment has been minimized.
This comment has been minimized.
|
I don't really have a strong opinion about this. I do think that this will make |
This comment has been minimized.
This comment has been minimized.
ruuda
commented
Jul 21, 2015
|
As a developer who works both on Windows and GNU/Linux I am in favour of this. I have never encountered a carriage return followed by a line feed that was not intended as a newline, and even if people worked around the current behaviour by manually stripping the carriage return, this will not break anything (assuming only carriage returns were stripped, not blindly the first character before a line feed). |
This comment has been minimized.
This comment has been minimized.
jminer
commented
Jul 22, 2015
|
I strongly support this change. On Windows, |
This comment has been minimized.
This comment has been minimized.
|
Why not extend this RFC to also recognize uncommon but well-defined line separators, specifically |
This comment has been minimized.
This comment has been minimized.
|
I don't know of any language framework that will treat these special characters as a newline per default, in their basic notion of what a "line" is. Adding |
This comment has been minimized.
This comment has been minimized.
|
Most of the languages are not unicode-aware or unicode-focused as rust is. Unicode standard specifies that these characters should be accounted for. |
This comment has been minimized.
This comment has been minimized.
|
Thinking about it again, I really agree with @withoutboats; these unicode line boundaries should be supported. |
This comment has been minimized.
This comment has been minimized.
|
I'm a bit out of the loop, why Anyway, big -1 to unicode, if you need exotic separators for some exotic reason, make a crate, use it, put it to crates.io and don't make Edit: actually I'm for separate method for |
aturon
self-assigned this
Jul 29, 2015
This comment has been minimized.
This comment has been minimized.
|
Super strong -1 on supporting The support for |
This comment has been minimized.
This comment has been minimized.
|
@mitsuhiko So, what do you propose instead? In my view, this is exactly what I am proposing: We have APIs that work on file data byte-per-byte, we have APIs that work char-per-char (and involve unicode), and then we have APIs that work line-by-line - and these are exactly the right place to handle different conventions of treating file endings. Note that I am defining pretty clearly where line conversion should take place: When you call the |
This comment has been minimized.
This comment has been minimized.
In my mind APIs should assume that lines are terminated with That's the same rule that is also applied for unicode handling. |
This comment has been minimized.
This comment has been minimized.
I'm a bit confused here -- isn't the proposed API part of the I/O system? Update: ah, I forgot this affected |
This comment has been minimized.
This comment has been minimized.
RFC proposes changing both |
This comment has been minimized.
This comment has been minimized.
|
@nagisa Yes, I'd forgotten that this RFC also touched So, @mitsuhiko, your argument is that I/O should actually convert, and not just parse, newlines? I disagree that this is how things work with Unicode: there's no conversion step there (unless you count the To be honest, having I/O do a conversion of the data seems more intrusive to me; we try pretty hard in Rust to expose system APIs without imposing a lot of extra semantics. And I agree with the basic thrust of the RFC, that the expectation when seeing a lines parsing function in I/O is that it will handle common forms of newline markers. The situation for
Can you elaborate on the problem? |
This comment has been minimized.
This comment has been minimized.
That's only the case because what you read just happens to match the encoding you deal internally with. If you had a hypothetical API that reads iso-8859-1 encoded files then in Rust that would be a re-encode into UTF-8 on the way in and not a string class that deals with iso-8859-1 encoded charpoints. You might have a stream based API that re-encodes charpoints step by step if that is more likely, but at least the individual chars would always be unicode charpoints. This is how things are "supposed" to work on Windows. If you open a file in text mode on Windows the newlines are
The problem is that it's nearly impossible to understand if code will deal with non There is plenty of code that works on All of this is emphasized by the fact that there are no good APIs to work with newline data other than |
This comment has been minimized.
This comment has been minimized.
|
Hi,
There is no conversion going on with unicode. Instead, you call APIs The API for line-based access corresponding to this would be a type that However, it's too late to go for such an API, we already have lines() It may be that I still did not understand your proposal. Clearly, you |
This comment has been minimized.
This comment has been minimized.
I completely disagree. This only works in a world where no conversions should be performed. Note that I am not saying that any IO operation should do this. However the place to perform encoding/decoding/newline conversions should be one transformation step at the boundary to not drag that problem into every single part of the system. While you can argue that rust currently does not have that problem because it only supports UTF-8, the reasonable place to perform unicode handling has traditionally always been the IO boundary and not "every part of the system".
For instance I would imagine that there could be a wrapper for a buffered reader or iterator that converts newlines. |
This comment has been minimized.
This comment has been minimized.
|
Also remember that Unicode also defines more terminators. If there is a hypothetical system in the future that has another one you don't want to have to update each and every library ever written that deals with newlines, but just the place where the conversion takes place. |
This comment has been minimized.
This comment has been minimized.
|
Also another note: there are still systems which use |
This comment has been minimized.
This comment has been minimized.
That sounds to me like what I wrote in the paragraph "The API for |
This comment has been minimized.
This comment has been minimized.
If that is one API isolated then I'm all for it. If that is like in the RFC then I am not a fan of it, because it will just add to a list of ever growing APIs that need to be aware of different newlines. |
This comment has been minimized.
This comment has been minimized.
|
@brson I don’t know. I suppose this was written in a world where most programs on your IMB S/360 would not support anything other than EBCDIC (an encoding competing with ASCII) with NEL for newlines. I’m not that familiar with the Windows ecosystem. I’d say let’s wait and see if someone asks for |
This comment has been minimized.
This comment has been minimized.
|
What I think is important is that Rust be able to read any file ending equally well. So if I ask for the next line, it'll treat What I don't think works too well is having Rust write different newlines based on the platform it is running on. When printing to the console, |
This comment has been minimized.
This comment has been minimized.
That would be very confusing, having some functions called Regarding the Unicode spec, I wonder if there is any precedent of a language using unicode line separators for their built-in notion of "lines"? Maybe the unicode crate (that some functionality of If we decide to go with unicode support, suddenly
The
I'd rather prefer an API similar to |
This comment has been minimized.
This comment has been minimized.
|
That's meaningless to read and output a file without changing its contents.
|
This comment has been minimized.
This comment has been minimized.
|
You just called
|
This comment has been minimized.
This comment has been minimized.
|
An implementation of Now, there's a discussion we could have about why preserving the separator is tied to not using an iterator. That's a good question. For uniformity, both |
This comment has been minimized.
This comment has been minimized.
|
Just for the record, on Windows 10, |
This comment has been minimized.
This comment has been minimized.
jminer
commented
Aug 21, 2015
|
I agree with @retep998 that it is fine to write a newline as |
This comment has been minimized.
This comment has been minimized.
|
You just called `cat` a meaningless program.
Writing to console do not product a file, and don't write back, so saying
"producing different file" here is meaningless.
|
This comment has been minimized.
This comment has been minimized.
|
'newlines' is something depends on platform, it variants on different OS,
not just \n or \r\n. Rust std is always trying to do things that 'cross
mainstream platform', which should cover newlines IMO. Support both \n and
\r\n is enough currently.
|
This comment has been minimized.
This comment has been minimized.
Emerentius
commented
Aug 25, 2015
|
If they are deemed necessary, they could also be included later in a separate RFC. |
This comment has been minimized.
This comment has been minimized.
jansegre
commented
Aug 25, 2015
|
I just found out that Unicode has a line breaking algorithm best practices. There's a simplistic view on wikipedia, which is already somewhat complicated. I think these recommendations are akin to I expect that TL;DR |
This comment has been minimized.
This comment has been minimized.
|
I'm wary of getting into the weeds of supporting unicode separators. Let's follow the principle of least surprise and try to do what people most generally expect from other languages, and I don't know of any language that has set the precedent of being aware of unicode linebreaks. |
This comment has been minimized.
This comment has been minimized.
|
These arguments for not supporting unicode separators are convincing to me (as the person who first raised the idea of supporting unicode separators). It wouldn't be surprising if, for example, some protocol defined |
This comment has been minimized.
This comment has been minimized.
birkenfeld
commented
Aug 25, 2015
This is already the case (although less prominent), since after In general, since |
This comment has been minimized.
This comment has been minimized.
Parsing such a protocol with |
This comment has been minimized.
This comment has been minimized.
|
@RalfJung Right (except that split('\n') can't be used on a BufRead interface), but someone could easily do it as a quick hack and they shouldn't encounter really unusual surprises like "this obscure character that I've never heard of is a recognized line seperator." |
This comment has been minimized.
This comment has been minimized.
Why not? |
This comment has been minimized.
This comment has been minimized.
|
@RalfJung - oh right, split(b'\n') can, but not if the line-ending is |
alexcrichton
referenced this pull request
Aug 27, 2015
Closed
Implement new semantics for lines(), deprecate lines_any() #28032
This comment has been minimized.
This comment has been minimized.
|
This was discussed this week at the libs team triage meeting and the conclusion was to merge. The APIs here seem to be a good improvement over what we have today without any loss of functionality, and pretty strong cases have been made to avoid unicode line endings for now. Thanks again for the RFC @RalfJung! |
alexcrichton
merged commit 8e0f4f2
into
rust-lang:master
Aug 27, 2015
This comment has been minimized.
This comment has been minimized.
|
Follow-up: we would like to land an implementation as soon as possible, and need to widely announce this potentially-breaking change when it lands. Unfortunately, this kind of semantic change is very hard to check for breakage using crater. |
This comment has been minimized.
This comment has been minimized.
|
Yay, my first RFC got accepted :) I can give the implementation a try. Shouldn't be hard. I'll be traveling a lot tomorrow, should have time to do the coding easily - but I can't really compile anything while on battery, considering Rust's size and compile time. |
This comment has been minimized.
This comment has been minimized.
|
@RalfJung looks like @alexcrichton beat you to it: rust-lang/rust#28034 -- but you could give the PR an initial review! |
This comment has been minimized.
This comment has been minimized.
|
Oh, okay, that was fast^^ - seems like I need to wait for my first non-trivial commit to the actual compiler ;-) . I'll check out his PR. |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton is a lord |
RalfJung commentedJul 15, 2015
•
edited by mbrubeck
This RFC proposes to define a "line" as terminated by either
\nor \r\n`. Also see rust-lang/rust#26743.Rendered output