Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carriage return doesn't end // comments #7946

Closed
sp3d opened this issue Jul 21, 2013 · 10 comments
Closed

Carriage return doesn't end // comments #7946

sp3d opened this issue Jul 21, 2013 · 10 comments
Labels
A-grammar Area: The grammar of Rust

Comments

@sp3d
Copy link
Contributor

sp3d commented Jul 21, 2013

The manual and compiler agree that \n (newline) is the only character which can end // comments. However, a large amount of software such as editors and document viewers consider any of "\r", "\r", and "\r\n" to introduce the start of a new line. In a file that has a \r without a following \n, users can easily be tricked into thinking a line is not commented out when it actually is.

For example, the rust program "//\rfn main() {}" is missing a main function.

@sp3d
Copy link
Contributor Author

sp3d commented Jul 21, 2013

Discussion on IRC seemed to think that using Unicode character classes for newline determination in general is a reasonable approach, and I agree: https://en.wikipedia.org/wiki/Newline#Unicode

@bstrie
Copy link
Contributor

bstrie commented Jul 22, 2013

The best thing to do here would be to look at what other Unicode-aware languages do, and then do that. This is purely for CYA, because I can contrive security risks regardless of which behavior we pick (thanks to idiosyncracies in editors, which we can never control).

(I'm not using the actual Unicode code points in these code examples, for purposes of visual illustration.)

Here's the risk in our current behavior. Imagine an editor that starts newlines on \r.

// imagine that this comment ends in a \r
if user.is_not_the_president { return; }
// the above line is commented out!
launch_all_nuclear_missiles();

Now imagine that we adopt the plan where line comments end on any Unicode newline-equivalent. Per sp3d's Wikipedia link:

Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8 and the code for NEL has been used as the ellipsis ('…') character in Windows-1252.

So we merely need to imagine an editor on Windows that doesn't recognize one of the less common newline characters, which to a user might look something like:

// whatever you do, DO NOT do anything like… send_user_data_to_hackers();
// the … in the above line ends the comment, leaving working code!

Though realistically the "…" would probably be rendered as a "▯" (or other "unknown character" symbol), but that doesn't much change the risk here, which is caused by potential disregard for Unicode and incorrect syntax highlighting.

@brson
Copy link
Contributor

brson commented Jul 22, 2013

nominating

@sp3d
Copy link
Contributor Author

sp3d commented Jul 22, 2013

Here's a survey of a few modern languages:

C#:
"new-line::
Carriage return character (U+000D)
Line feed character (U+000A)
Carriage return character (U+000D) followed by line feed character (U+000A)
Next line character (U+2085)
Line separator character (U+2028)
Paragraph separator character (U+2029)" - http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-334.pdf (§ 9.3.1, page 89)

Haskell:
"newline → return linefeed | return | linefeed | formfeed
return → a carriage return
linefeed → a line feed
vertab → a vertical tab
formfeed → a form feed" - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-160002.2
These are defined in terms of unicode characters. However, GHC 7.6.3 on Linux does not appear to consider "vertab" (0x0b), "formfeed" (0x0c), nor "return" (0x0d) to constitute a newline, but merely whitespace; it considers "return formfeed" (0x0d 0x0a) to be whitespace followed by a newline.

Java:
"Except for comments (§3.7), identifiers, and the contents of character and string literals (§3.10.4, §3.10.5), all input elements (§3.5) in a program are formed only from ASCII characters (or Unicode escapes (§3.3) which result in ASCII characters)."
...
"Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two characters CR immediately followed by LF are counted as one line terminator, not two. A line terminator specifies the termination of the // form of a comment (§3.7)." - http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.4

Javascript [ecmascript]:
"LineTerminatorSequence ::

[lookahead ∉ ]


"
"Only the characters in Table 3 [note. these are the component characters of the sequences listed above - sp3d] are treated as line terminators. Other new line or line breaking characters are treated as white space but not as line terminators. The character sequence is commonly used as a line terminator. It should be considered a single character for the purpose of reporting line numbers." - http://www.ecma-international.org/ecma-262/5.1/#sec-7.3

Python 3:
"A physical line is a sequence of characters terminated by an end-of-line sequence. In source files, any of the standard platform line termination sequences can be used - the Unix form using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of these forms can be used equally, regardless of platform." - http://docs.python.org/3/reference/lexical_analysis.html#physical-lines

All these languages (excluding the ghc implementation of Haskell) accept at least \r, \n, and \r\n.
Some additionally choose to support LS (U+2028) and PS (U+2029), the former two and NL (U+2085), or VT (0x0b) and FF (0x0c).

Are there other languages worth looking at?

@bstrie
Copy link
Contributor

bstrie commented Jul 23, 2013

I'd be interested to see the rules for C++, Ruby, and Go (the former so that we know what our potential audience is familiar with, and the latter because they also support Unicode identifiers (and Go, like Rust, requires all source files to be UTF-8)).

@sp3d
Copy link
Contributor Author

sp3d commented Jul 23, 2013

C++:
"Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens." - http://isocpp.org/files/papers/N3690.pdf (§ 2.7, page 35)
The actual character set isn't described, and the first phase of program translation is described as follows:
"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined." - http://isocpp.org/files/papers/N3690.pdf (§ 2.2, page 31)
This, to me, seems to indicate that C++ implementations can do whatever they consider sane for processing newline sequences.

Go:
"newline = /* the Unicode code point U+000A */ ."
...
"White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token." - http://golang.org/ref/spec#Lexical_elements
It looks like Go doesn't consider any high-unicode characters, vertical tabs, form-feeds, etc. to be whitespace or newlines--only 0x0a counts as a newline.

Ruby:
"line-terminator ::
0x0d? 0x0a" - http://www.ipa.go.jp/files/000011432.pdf (§ 8.3, page 40)
"whitespace ::
0x09 | 0x0b | 0x0c | 0x0d | 0x20 | line-terminator-escape-sequence" - http://www.ipa.go.jp/files/000011432.pdf (§ 8.3, page 41)
"source-character ::
[ any character in ISO/IEC 646:1991 IRV ]" (§ 8.2, page 39)
So Ruby is specified to be a 7-bit encoding (essentially ASCII), but states that "The support for any other character sets and encodings is unspecified.". As far as I could tell the standard implementation does nothing beyond the spec w/r/t newline handling, but the code was fairly hard to read.

Between the latter two, the convention seems to be to consider 0x0d characters as whitespace and 0x0a newlines, such that \n and \r\n newlines work, somewhat similarly to the current Rust behavior--Ruby will consider "0x0d 0x0a" a single token without external whitespace while Go will not, as I understand it.

@catamorphism
Copy link
Contributor

Accepted for well-defined

@pnkfelix
Copy link
Member

cc me

@brson
Copy link
Contributor

brson commented Sep 17, 2013

Thanks for all the research on this topic!

We discussed this and decided that the use case for anything other than \n is not clear, and generally believe that all editors, even on windows, end lines with \n. We will leave the current behavior for now.

@fiiiu
Copy link

fiiiu commented Aug 29, 2019

We came across this issue recently and were planning on reporting it when we found that you have been long aware.

We originally uncovered a similar problem in the Solidity language during our Solidity compiler audit. While this is arguably more severe in a blockchain setting where auditability substitutes trust, we believe that improper handling of \r characters is a very dangerous trait of any language. In the case of Rust, the fact that \r is outright banned in doc comments but silently ignored in line comments, makes matters even more confusing.

We'd love it if you would consider reopening this issue for further discussion :)

flip1995 pushed a commit to flip1995/rust that referenced this issue Dec 2, 2021
This issue has been fixed by [commit](rust-lang/rust-clippy@8c1c763)
This PR is used for close rust-lang#7946(Fixes rust-lang#7946).

changelog: Add test for pattern_type_mismatch.
flip1995 pushed a commit to flip1995/rust that referenced this issue Dec 2, 2021
Add test for pattern_type_mismatch.

This issue has been fixed by [commit](rust-lang/rust-clippy@8c1c763)
This PR is used for close rust-lang#7946(Fixes rust-lang#7946).

changelog: Add test for pattern_type_mismatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: The grammar of Rust
Projects
None yet
Development

No branches or pull requests

6 participants