Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid char literals are accepted #7945

Closed
sp3d opened this issue Jul 21, 2013 · 8 comments · Fixed by #9335
Closed

Invalid char literals are accepted #7945

sp3d opened this issue Jul 21, 2013 · 8 comments · Fixed by #9335

Comments

@sp3d
Copy link
Contributor

sp3d commented Jul 21, 2013

The byte sequences "0x27 0x0a 0x27", "0x27 0x0d 0x27", and "0x27 0x27 0x27" (newline, carriage return, and single-quote, respectively, sandwiched between single quotes) are accepted as character literals. The former two are, as far as I can tell, allowed per the manual's description of the language, but would not feature in any sane language; I assume this is merely an oversight. The latter is rejected by the grammar described in the manual but accepted by the compiler. Presumably the manual is the authority and the compiler is wrong to accept '''.

@Kimundi
Copy link
Member

Kimundi commented Jul 22, 2013

"Presumably the manual is the authority and the compiler is wrong to accept" - HA! (Hint: Don't trust either)

In all seriousness... I don't think any of those examples can be seen as illegal. Confusing yes, but not wrong: You write a ', followed by one unicode codepoint, either directly embedded as utf8 or as escaped string, and close with another '.

Because it has to be exactly one codepoint, the parser has no problem with it being the same character used to delimit it, and because ASCII is a subset of utf8, both \n and \r are valid character values.

@sp3d
Copy link
Contributor Author

sp3d commented Jul 22, 2013

The ''' is a less clear case; I'm fine with whichever behavior as long as we make docs and compiler agree.

On the other hand, anyone using a literal newline like that deserves to be shot. Various transports used in the real world for code don't preserve line-endings: ftp, web services like pastebins, IRC, and anything using a "text" rather than "binary" mode in its I/O libs will corrupt literals of the 0x27 0x0a 0x27 or 0x27 0x0d 0x27 format, in some cases resulting in code that won't compile anymore (if it turns the sequence into 0x27 0x0d 0x0a 0x27), but in other cases silently changing semantics by converting the 0x0d into 0x0a. Literals of that form also result in indentation violations, so naïve auto-indent will also break said literals. Therefore, even though they are technically legal at present, it seems insane to leave it that way. Languages like C, Java, etc. similarly disallow unescaped 0x0a and 0x0d char literals.

@bstrie
Copy link
Contributor

bstrie commented Jul 22, 2013

Nominating for Well-Defined.

@Kimundi
Copy link
Member

Kimundi commented Jul 23, 2013

One thing first: All the things you talked about are also true for string literals, so we need to think about them too.

So, you're right no one should actually do this, but I don't see that as a reason to only forbid those two. Rust source is utf8, you will have those problems with other byte sequences too.

If you are in a situation where \n and \l cause trouble with external tools: Well, don't use them in your source or change the tools.

But even if it's better to forbid them, it seems arbitrary to only exclude those two codepoints in a literal. What about the other ascii ctrl characters? All the other utf8 sequences that might trip up external tools? A rule like "All non-printable codepoints in the ascii range need to be annotated in escaped form" would at least be better in that case.

@bstrie
Copy link
Contributor

bstrie commented Jul 23, 2013

If there's precedent in Java and C disallowing certain character literals, then that's a reasonable argument for us to disallow them as well. But the only reason I say this is because we can cite precedent, because it does seem somewhat arbitrary.

@sp3d
Copy link
Contributor Author

sp3d commented Jul 24, 2013

@Kimundi: I agree that we should probably give string literals some related scrutiny. I believe the primary reason they are forbidden in other languages is that character literals are only allowed to span a single line, and these characters are those which terminate lines.

@bstrie: None of C, C++, or Java allows unescaped \r or \n in character literals (in C and C++ the interpretation of what constitutes newlines is up to compilers to an extent but gcc and clang behave as described):

From §2.14.3 of the latest C++ draft:
"character-literal:
’ c-char-sequence ’
u’ c-char-sequence ’
U’ c-char-sequence ’
L’ c-char-sequence ’
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash , or new-line character
escape-sequence
universal-character-name"

And in the Java SE 7 language spec, §3.10.4:
"CharacterLiteral:
' SingleCharacter '
' EscapeSequence '
SingleCharacter:
InputCharacter but not ' or "
where (§ 3.4)
"InputCharacter:
UnicodeInputCharacter but not CR or LF"

@catamorphism
Copy link
Contributor

Accepted for well-defined

@pnkfelix
Copy link
Member

cc me

bors added a commit that referenced this issue Sep 25, 2013
As documented in issue #7945, these literal identifiers are all accepted by rust
today, but they should probably be disallowed (especially `'''`). This changes
all escapable sequences to being *required* to be escaped.

Closes #7945

I wanted to write the tests with more exact spans, but I think #9308 will be fixing that?
flip1995 pushed a commit to flip1995/rust that referenced this issue Nov 23, 2021
Fix ICE in undocumented_unsafe_blocks

changelog: Fix ICE in [`undocumented_unsafe_blocks`]

closes: rust-lang#7934
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants