Invalid char literals are accepted #7945

sp3d · 2013-07-21T19:25:02Z

The byte sequences "0x27 0x0a 0x27", "0x27 0x0d 0x27", and "0x27 0x27 0x27" (newline, carriage return, and single-quote, respectively, sandwiched between single quotes) are accepted as character literals. The former two are, as far as I can tell, allowed per the manual's description of the language, but would not feature in any sane language; I assume this is merely an oversight. The latter is rejected by the grammar described in the manual but accepted by the compiler. Presumably the manual is the authority and the compiler is wrong to accept '''.

Kimundi · 2013-07-22T08:44:56Z

"Presumably the manual is the authority and the compiler is wrong to accept" - HA! (Hint: Don't trust either)

In all seriousness... I don't think any of those examples can be seen as illegal. Confusing yes, but not wrong: You write a ', followed by one unicode codepoint, either directly embedded as utf8 or as escaped string, and close with another '.

Because it has to be exactly one codepoint, the parser has no problem with it being the same character used to delimit it, and because ASCII is a subset of utf8, both \n and \r are valid character values.

sp3d · 2013-07-22T18:02:30Z

The ''' is a less clear case; I'm fine with whichever behavior as long as we make docs and compiler agree.

On the other hand, anyone using a literal newline like that deserves to be shot. Various transports used in the real world for code don't preserve line-endings: ftp, web services like pastebins, IRC, and anything using a "text" rather than "binary" mode in its I/O libs will corrupt literals of the 0x27 0x0a 0x27 or 0x27 0x0d 0x27 format, in some cases resulting in code that won't compile anymore (if it turns the sequence into 0x27 0x0d 0x0a 0x27), but in other cases silently changing semantics by converting the 0x0d into 0x0a. Literals of that form also result in indentation violations, so naïve auto-indent will also break said literals. Therefore, even though they are technically legal at present, it seems insane to leave it that way. Languages like C, Java, etc. similarly disallow unescaped 0x0a and 0x0d char literals.

bstrie · 2013-07-22T18:18:57Z

Nominating for Well-Defined.

Kimundi · 2013-07-23T09:47:55Z

One thing first: All the things you talked about are also true for string literals, so we need to think about them too.

So, you're right no one should actually do this, but I don't see that as a reason to only forbid those two. Rust source is utf8, you will have those problems with other byte sequences too.

If you are in a situation where \n and \l cause trouble with external tools: Well, don't use them in your source or change the tools.

But even if it's better to forbid them, it seems arbitrary to only exclude those two codepoints in a literal. What about the other ascii ctrl characters? All the other utf8 sequences that might trip up external tools? A rule like "All non-printable codepoints in the ascii range need to be annotated in escaped form" would at least be better in that case.

bstrie · 2013-07-23T16:14:17Z

If there's precedent in Java and C disallowing certain character literals, then that's a reasonable argument for us to disallow them as well. But the only reason I say this is because we can cite precedent, because it does seem somewhat arbitrary.

sp3d · 2013-07-24T00:36:41Z

@Kimundi: I agree that we should probably give string literals some related scrutiny. I believe the primary reason they are forbidden in other languages is that character literals are only allowed to span a single line, and these characters are those which terminate lines.

@bstrie: None of C, C++, or Java allows unescaped \r or \n in character literals (in C and C++ the interpretation of what constitutes newlines is up to compilers to an extent but gcc and clang behave as described):

From §2.14.3 of the latest C++ draft:
"character-literal:
’ c-char-sequence ’
u’ c-char-sequence ’
U’ c-char-sequence ’
L’ c-char-sequence ’
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash , or new-line character
escape-sequence
universal-character-name"

And in the Java SE 7 language spec, §3.10.4:
"CharacterLiteral:
' SingleCharacter '
' EscapeSequence '
SingleCharacter:
InputCharacter but not ' or "
where (§ 3.4)
"InputCharacter:
UnicodeInputCharacter but not CR or LF"

catamorphism · 2013-09-12T17:23:17Z

Accepted for well-defined

pnkfelix · 2013-09-12T17:23:25Z

cc me

As documented in issue #7945, these literal identifiers are all accepted by rust today, but they should probably be disallowed (especially `'''`). This changes all escapable sequences to being *required* to be escaped. Closes #7945 I wanted to write the tests with more exact spans, but I think #9308 will be fixing that?

Fix ICE in undocumented_unsafe_blocks changelog: Fix ICE in [`undocumented_unsafe_blocks`] closes: rust-lang#7934

alexcrichton mentioned this issue Sep 19, 2013

Disallow char literals which should be escaped #9335

Merged

alexcrichton closed this as completed in 2661b63 Sep 25, 2013

flip1995 pushed a commit to flip1995/rust that referenced this issue Nov 23, 2021

Auto merge of rust-lang#7945 - Serial-ATA:issue-7934, r=flip1995

6fcdf81

Fix ICE in undocumented_unsafe_blocks changelog: Fix ICE in [`undocumented_unsafe_blocks`] closes: rust-lang#7934

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid char literals are accepted #7945

Invalid char literals are accepted #7945

sp3d commented Jul 21, 2013

Kimundi commented Jul 22, 2013

sp3d commented Jul 22, 2013

bstrie commented Jul 22, 2013

Kimundi commented Jul 23, 2013

bstrie commented Jul 23, 2013

sp3d commented Jul 24, 2013

catamorphism commented Sep 12, 2013

pnkfelix commented Sep 12, 2013

Invalid char literals are accepted #7945

Invalid char literals are accepted #7945

Comments

sp3d commented Jul 21, 2013

Kimundi commented Jul 22, 2013

sp3d commented Jul 22, 2013

bstrie commented Jul 22, 2013

Kimundi commented Jul 23, 2013

bstrie commented Jul 23, 2013

sp3d commented Jul 24, 2013

catamorphism commented Sep 12, 2013

pnkfelix commented Sep 12, 2013