Carriage return doesn't end // comments #7946

sp3d · 2013-07-21T20:07:23Z

The manual and compiler agree that \n (newline) is the only character which can end // comments. However, a large amount of software such as editors and document viewers consider any of "\r", "\r", and "\r\n" to introduce the start of a new line. In a file that has a \r without a following \n, users can easily be tricked into thinking a line is not commented out when it actually is.

For example, the rust program "//\rfn main() {}" is missing a main function.

sp3d · 2013-07-21T20:36:36Z

Discussion on IRC seemed to think that using Unicode character classes for newline determination in general is a reasonable approach, and I agree: https://en.wikipedia.org/wiki/Newline#Unicode

bstrie · 2013-07-22T17:58:33Z

The best thing to do here would be to look at what other Unicode-aware languages do, and then do that. This is purely for CYA, because I can contrive security risks regardless of which behavior we pick (thanks to idiosyncracies in editors, which we can never control).

(I'm not using the actual Unicode code points in these code examples, for purposes of visual illustration.)

Here's the risk in our current behavior. Imagine an editor that starts newlines on \r.

// imagine that this comment ends in a \r
if user.is_not_the_president { return; }
// the above line is commented out!
launch_all_nuclear_missiles();

Now imagine that we adopt the plan where line comments end on any Unicode newline-equivalent. Per sp3d's Wikipedia link:

Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8 and the code for NEL has been used as the ellipsis ('…') character in Windows-1252.

So we merely need to imagine an editor on Windows that doesn't recognize one of the less common newline characters, which to a user might look something like:

// whatever you do, DO NOT do anything like… send_user_data_to_hackers();
// the … in the above line ends the comment, leaving working code!

Though realistically the "…" would probably be rendered as a "▯" (or other "unknown character" symbol), but that doesn't much change the risk here, which is caused by potential disregard for Unicode and incorrect syntax highlighting.

brson · 2013-07-22T20:40:02Z

nominating

sp3d · 2013-07-22T21:01:45Z

Here's a survey of a few modern languages:

C#:
"new-line::
Carriage return character (U+000D)
Line feed character (U+000A)
Carriage return character (U+000D) followed by line feed character (U+000A)
Next line character (U+2085)
Line separator character (U+2028)
Paragraph separator character (U+2029)" - http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-334.pdf (§ 9.3.1, page 89)

Haskell:
"newline → return linefeed | return | linefeed | formfeed
return → a carriage return
linefeed → a line feed
vertab → a vertical tab
formfeed → a form feed" - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-160002.2
These are defined in terms of unicode characters. However, GHC 7.6.3 on Linux does not appear to consider "vertab" (0x0b), "formfeed" (0x0c), nor "return" (0x0d) to constitute a newline, but merely whitespace; it considers "return formfeed" (0x0d 0x0a) to be whitespace followed by a newline.

Java:
"Except for comments (§3.7), identifiers, and the contents of character and string literals (§3.10.4, §3.10.5), all input elements (§3.5) in a program are formed only from ASCII characters (or Unicode escapes (§3.3) which result in ASCII characters)."
...
"Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two characters CR immediately followed by LF are counted as one line terminator, not two. A line terminator specifies the termination of the // form of a comment (§3.7)." - http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.4

Javascript [ecmascript]:
"LineTerminatorSequence ::

[lookahead ∉ ]

"
"Only the characters in Table 3 [note. these are the component characters of the sequences listed above - sp3d] are treated as line terminators. Other new line or line breaking characters are treated as white space but not as line terminators. The character sequence is commonly used as a line terminator. It should be considered a single character for the purpose of reporting line numbers." - http://www.ecma-international.org/ecma-262/5.1/#sec-7.3

Python 3:
"A physical line is a sequence of characters terminated by an end-of-line sequence. In source files, any of the standard platform line termination sequences can be used - the Unix form using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of these forms can be used equally, regardless of platform." - http://docs.python.org/3/reference/lexical_analysis.html#physical-lines

All these languages (excluding the ghc implementation of Haskell) accept at least \r, \n, and \r\n.
Some additionally choose to support LS (U+2028) and PS (U+2029), the former two and NL (U+2085), or VT (0x0b) and FF (0x0c).

Are there other languages worth looking at?

bstrie · 2013-07-23T15:26:57Z

I'd be interested to see the rules for C++, Ruby, and Go (the former so that we know what our potential audience is familiar with, and the latter because they also support Unicode identifiers (and Go, like Rust, requires all source files to be UTF-8)).

sp3d · 2013-07-23T22:24:00Z

C++:
"Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens." - http://isocpp.org/files/papers/N3690.pdf (§ 2.7, page 35)
The actual character set isn't described, and the first phase of program translation is described as follows:
"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined." - http://isocpp.org/files/papers/N3690.pdf (§ 2.2, page 31)
This, to me, seems to indicate that C++ implementations can do whatever they consider sane for processing newline sequences.

Go:
"newline = /* the Unicode code point U+000A */ ."
...
"White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token." - http://golang.org/ref/spec#Lexical_elements
It looks like Go doesn't consider any high-unicode characters, vertical tabs, form-feeds, etc. to be whitespace or newlines--only 0x0a counts as a newline.

Ruby:
"line-terminator ::
0x0d? 0x0a" - http://www.ipa.go.jp/files/000011432.pdf (§ 8.3, page 40)
"whitespace ::
0x09 | 0x0b | 0x0c | 0x0d | 0x20 | line-terminator-escape-sequence" - http://www.ipa.go.jp/files/000011432.pdf (§ 8.3, page 41)
"source-character ::
[ any character in ISO/IEC 646:1991 IRV ]" (§ 8.2, page 39)
So Ruby is specified to be a 7-bit encoding (essentially ASCII), but states that "The support for any other character sets and encodings is unspecified.". As far as I could tell the standard implementation does nothing beyond the spec w/r/t newline handling, but the code was fairly hard to read.

Between the latter two, the convention seems to be to consider 0x0d characters as whitespace and 0x0a newlines, such that \n and \r\n newlines work, somewhat similarly to the current Rust behavior--Ruby will consider "0x0d 0x0a" a single token without external whitespace while Go will not, as I understand it.

catamorphism · 2013-09-12T17:23:42Z

Accepted for well-defined

pnkfelix · 2013-09-12T17:23:58Z

cc me

brson · 2013-09-17T17:55:46Z

Thanks for all the research on this topic!

We discussed this and decided that the use case for anything other than \n is not clear, and generally believe that all editors, even on windows, end lines with \n. We will leave the current behavior for now.

fiiiu · 2019-08-29T13:43:25Z

We came across this issue recently and were planning on reporting it when we found that you have been long aware.

We originally uncovered a similar problem in the Solidity language during our Solidity compiler audit. While this is arguably more severe in a blockchain setting where auditability substitutes trust, we believe that improper handling of \r characters is a very dangerous trait of any language. In the case of Rust, the fact that \r is outright banned in doc comments but silently ignored in line comments, makes matters even more confusing.

We'd love it if you would consider reopening this issue for further discussion :)

This issue has been fixed by [commit](rust-lang/rust-clippy@8c1c763) This PR is used for close rust-lang#7946(Fixes rust-lang#7946). changelog: Add test for pattern_type_mismatch.

Add test for pattern_type_mismatch. This issue has been fixed by [commit](rust-lang/rust-clippy@8c1c763) This PR is used for close rust-lang#7946(Fixes rust-lang#7946). changelog: Add test for pattern_type_mismatch.

brson closed this as completed Sep 17, 2013

SiegeLord mentioned this issue Jan 20, 2014

Multiline strings with carriage return - unknown string escape #11669

Closed

andersforsgren mentioned this issue Dec 2, 2014

Windows linebreaks not supported racer-rust/racer#82

Closed

declanvk mentioned this issue Oct 2, 2018

Flexible line breaking brendanzab/codespan#36

Merged

erak mentioned this issue Aug 9, 2019

Fixes doxygen style multiline comment parsing for files with CRLF ethereum/solidity#7174

Merged

6 tasks

sogaiu mentioned this issue Oct 5, 2023

Long-string parsing for CR LF vs LF eol sequences differ janet-lang/janet#1302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Carriage return doesn't end // comments #7946

Carriage return doesn't end // comments #7946

sp3d commented Jul 21, 2013

sp3d commented Jul 21, 2013

bstrie commented Jul 22, 2013

brson commented Jul 22, 2013

sp3d commented Jul 22, 2013

bstrie commented Jul 23, 2013

sp3d commented Jul 23, 2013

catamorphism commented Sep 12, 2013

pnkfelix commented Sep 12, 2013

brson commented Sep 17, 2013

fiiiu commented Aug 29, 2019

Carriage return doesn't end // comments #7946

Carriage return doesn't end // comments #7946

Comments

sp3d commented Jul 21, 2013

sp3d commented Jul 21, 2013

bstrie commented Jul 22, 2013

brson commented Jul 22, 2013

sp3d commented Jul 22, 2013

bstrie commented Jul 23, 2013

sp3d commented Jul 23, 2013

catamorphism commented Sep 12, 2013

pnkfelix commented Sep 12, 2013

brson commented Sep 17, 2013

fiiiu commented Aug 29, 2019