Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

RFC: Syntax for raw string literals #9411

Closed
kballard opened this Issue Sep 22, 2013 · 43 comments

Comments

Projects
None yet
Contributor

kballard commented Sep 22, 2013

A raw string literal is a string literal that does not interpret any embedded sequence, meaning no backslash-escapes. A lot of languages (certainly most that I've used) support some syntax for raw string literals. They're useful for embedding any string that wants to have a bunch of backslashes in it (typically because the function the string is passed to wants to interpret them itself), such as regular expressions. Unfortunately, Rust does not have a raw string literal syntax.

There's been a discussion on the mailing list for the past few days about this. I will try to put a quick summary here.

There's two questions at stake. The first is, should Rust have a raw string literal syntax? The second is, if so, what particular syntax should be used? I think the answer to the first is definitely Yes. It's useful enough, and has enough overwhelming precedence in other languages, that we should add it. The question of concrete syntax is the harder one.

The syntaxes that have been proposed so far, along with their Pros and Cons:

  1. C++11 syntax, e.g. R"delim(raw text)delim".

    Pros:

    • Reasonably straightforward
    • Can embed any character sequence

    Cons:

    • Syntax is slightly complicated (editorial note: I think any syntax that's flexible enough to contain any character is going to be considered slightly complicated).
  2. Python syntax, e.g. r"foo"

    Pros:

    • Simple syntax

    Cons:

    • Can't embed any character sequence.
    • Python's implementation has really wacky handling of backslash escapes in conjunction with the quote character. Even reproducing that behavior does not allow for embedding any sequence, as r"foo\"" evaluates to the string foo\" (with the literal backslash).
  3. D syntax, e.g. r"raw text", raw text, or q"(raw text)"/q"delim\nraw text\ndelim"

    Pros:

    • Can embed any character sequence (with the third variant)

    Cons:

    • The first two forms aren't flexible enough, and the third form is a bit confusing. The delimiter behaves differently depending on whether it's a "nesting" delimiter (one of ([<{), another token, or an identifier.
  4. C#/SQL/something else, using a simple raw string syntax such as r"text" where doubling up the quote inserts a single quote, as in r"foo""bar"

    Pros:

    • Simple syntax

    Cons:

    • Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML).
  5. Perl quote-like operators, e.g. q{text}. Unfortunately, most viable delimiters will result in an ambiguous parse.

  6. Ruby quote-like operators, e.g. %q{text}. Unfortunately, this also is ambiguous (with the % token).

  7. Lua syntax, e.g. [=[text]=]

    Pros:

    • Simple syntax
    • Can embed any character sequence

    Cons:

    • Syntax looks decidedly non-string-like
    • Custom delimiters are limited to sequences of =
    • Alex Chrichton opined that seeing println!([[Hello, {}!]], "world") in an introduction to Rust would be awfully confusing (see previous point about being non-string-like).
  8. Go syntax, e.g. raw text. This is one of the variants of D strings as well

    Pros:

    • Simple syntax

    Cons:

    • Cannot embed any character sequence (notably, cannot embed backtick)
    • It's difficult or impossible to embed backticks in a markdown code sequence, which will make it awkward to use raw strings in markdown editors. May also be confusing with the usage of foo in doc comments.
  9. A new syntax using ASCII Control characters STX and ETX

    Pros:

    • I don't think there are any

    Cons:

    • Can't type the keys on any keyboard
    • Text editors probably won't render the characters correctly either
    • Can't technically embed any character sequence, because ETX cannot be embedded, but in fairness it can embed any printable sequence.
  10. A syntax proposed over IRC is delim"raw text"delim.

    Pros:

    • Can embed any character

    Cons:

    • Unusual syntax with no precedent in other languages. Functionally identical to C++11 syntax.
    • Hard to type in Markdown editors

Some form of Heredoc syntax was also suggested, but heredocs are really primarily concerned with embedding multiline input, not raw input. They also have issues around dealing with indentation and the first/last newline.

During this discussion, only two Rust team members (that I'm aware of) chimed in. Alex Chricton raised issues with the Lua syntax, and threw out the suggestion of Go's syntax, though only as something to consider rather than a recommendation. Felix Klock expressed a preference for C++11 syntax, and more generally stated that he wants a syntax with user-delimited sequences. There was also at least one community member in favor of C++11 syntax.

My own preference at this point is for C++11 syntax as well. At the very least, something similar to C++11 syntax, that shares all of its properties, but there seems to be no value in inventing a new syntax when there's precedent in C++11.

Contributor

thestinger commented Sep 22, 2013

I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem.

Contributor

kballard commented Sep 22, 2013

@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form <start token><user-supplied delimiter><delimiter end token><raw text><delimiter start token><user supplied delimiter><end token>. In C++11 the <start token> is R", the <delimiter end token> is (, the <delimiter start token> is ), and <end token> is ". I don't think you can remove any of these components without breaking functionality, and I don't think you can adjust the values to produce something less noisy.

The only adjustment I can think of would be to remove <delimiter start/end token> and instead require that the <user supplied delimiter> includes an appropriate punctuation to end it (and to start the close sequence), but if anything that makes it more confusing, not less.

@ghost

ghost commented Sep 22, 2013

How about R<space><user-supplied delimiter>"<raw text>"<user-supplied delimiter> and R"<raw text>" for the case where a simple " delimiter is sufficient.

R"c:\some\path\"
R eos"raw text"eos

Contributor

kballard commented Sep 22, 2013

@stevenashley The lexer will see the R<space> and tokenize that as an identifier R. Trying to look ahead past spaces is, at best, highly confusing.

@ghost

ghost commented Sep 22, 2013

Ah, of course. I can't think of a substitute for <space> that would simultaneously look nice and parse well. Consider my proposal retracted.

Member

Kimundi commented Sep 23, 2013

How about this: r"" syntax, with the option to pad the string on both ends with #:

foo  ==  r"foo"
fo"o ==  r#"fo"o"#
##   ==  r###"##"###

As far as I know we don't allow # in an expression context, it's only valid as part of the attribute syntax, so this should work.
Heck, it would even be ok in attributes themself, I think:

#[foo = r##"test"##];

Alternatively, we could also throw away the r token itself and say that any number of # followed by " starts an raw string literal:

let regex = ##"[/s]+"##;

Or we make both forms valid: r"" for short raw strings, ##""## as alternative to cover every possible string.

Contributor

kballard commented Sep 23, 2013

@Kimundi That looks like a reinvention of Lua's [==[foo]==] syntax. It's certainly workable, but it shares the same problems that Lua syntax does (as pointed out by @alexcrichton).

Owner

alexcrichton commented Sep 23, 2013

It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:

  1. Regular expressions. These contain lots of backslashes and normally escapes aren't even really that necessary. If we used the normal string syntax, everything would have to be double-escaped which is a pain. The main stickler about this desired syntax is that this would want to be very usable (in the sense that it shouldn't be a pain to write/read read the strings of regular expressions, at least no more than it already is).
  2. Literal windows paths. Perhaps these should be done in a different manner to be portable, but regardless having to escape the \ character is a real pain. As with regular expressions, in theory the string syntax isn't difficult to read.
  3. Giant blobs of raw text, such as formatting an HTML document (like what rustdoc does right now). This is different from regular expressions in that I they don't need to be so easily readable (because the body of the text is normally very large), so the custom delimiters surrounding the text I believe would be find in this case.
  4. format! string directives. Right now it's a pain to print a \ character because you have to type println!("\\\\"). As with regular expressions though, this should be easy to read and easy to use (because it may be fairly common). Perhaps this should use a different escape though, which would make this irrelevant.

Those are the use cases that I could think of, others may have more

@ghost

ghost commented Sep 23, 2013

Would case 4 become println!(R"(\\)") ?

Owner

alexcrichton commented Sep 23, 2013

I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes.

Member

Kimundi commented Sep 23, 2013

@kballard I would say it's better than Luas syntax here.

  • It has the same advantage of being able to delimit any text.
  • Only being limited to # is not a problem, you can still find a delimiter sequence for any input.
  • The default case r"" has very low typing overhead, and looks very similar to a regular string literal, no confusion about meaning.

Looking at @alexcrichton's use cases:

  1. Regular expressions: r"([^:]*):(\d+):(\d+): (\d+):(\d+) (.*)$".match_groups()
  2. Windows paths: r"C:\Program Files\rust\bin\rust.exe".to_path()
  3. format! strings: println!(r"\\");
  4. Blobs of text:
static MARKDOWN: &'static str = r###"
## Scope

This is an introductory tutorial for the Rust programming language. It
covers the fundamentals of the language, including the syntax, the
type system and memory model, generics, and modules. [Additional
tutorials](#what-next) cover specific language features in greater
depth.

This tutorial assumes that the reader is already familiar with one or
more languages in the C family. Understanding of pointers and general
memory management techniques will help.
"###;

Owner

alexcrichton commented Sep 23, 2013 edited by sanxiyn

Those all look totally reasonable to me.

@ghost

ghost commented Sep 23, 2013

Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done.

Owner

huonw commented Sep 23, 2013

(All of these mean the token language is no longer regular, right?)

Contributor

kballard commented Sep 23, 2013

@Kimundi: Regular expression:

r##"(\w+)   # match word chars
    "[^"]*" # followed by a quoted string
    (\d+)   # followed by digits"##.flag("x").match_groups();

# just looks like a very odd character here.

That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it.

@huonw I believe you are correct. Is that particularly important?

Member

pnkfelix commented Sep 23, 2013

The restriction to sequences of # has some bad corner cases, where I can imagine one sitting and manually counting, since the eye does not immediately distinguish and/or match { #####, ###### and ##### } the same way it can with { #five# #six# and #five# }.

I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to ##*) up to an end-developer policy (with checks for particular restrictions available as a lint).

The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp #(#|_)*, or perhaps even (#|_)*" (Not 100% sure whether the latter is too broad.) Then I still get to write e.g. { #_#, ##_, #_# } which is easier on my eyes than the above encodings of five and six.


But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of #.


(one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) )

Member

Kimundi commented Sep 23, 2013

@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two #: It's only necessary to add more if your raw string literally contains "#, "##, "### etc.

@kballard: Likewise, in that example there would be no need for more than one #:

r#"(\w+)   # match word chars
   "[^"]*" # followed by a quoted string
   (\d+)   # followed by digits"#.flag("x").match_groups();

Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one.

Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;)

Member

pnkfelix commented Sep 23, 2013

@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support.

E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.)

I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...)

@ghost

ghost commented Sep 23, 2013

@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded.

I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: [rR](#*)"(.*?)"\1.

A regex that parses #five# etc is a little more complex but still workable. [rR](#*)([^"]*)\1"(.*?)"\1\2\1.

Member

Kimundi commented Sep 23, 2013

@pnkfelix @huonw: You could also just hack around that:
If we pick a syntax that only differs in length, like my proposal, then external tools could hardcode, say, up to five variations. I don't think there are many cases in the wild that embedded the the string "#####.

Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure.

Contributor

kballard commented Sep 23, 2013

@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all.

Member

Kimundi commented Sep 23, 2013

@kballard Right, just wanted to throw that out there as fallback workaround. :)

Member

Kimundi commented Sep 23, 2013

Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC:

Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing " or ending with whitespace between r# and " (The initial # being needed to make the lexer recognize it as an raw string literal).

Would certainly give good opportunities for self documenting literals:

static RUSTCODE: &'static str = 
r## CODE ##"
fn main() {
    // Example: This uses a string raw literal to embed an windows-style file path directly.
    println(r"C:\Program Files\rust\bin\rust.exe");
}
"## CODE ##;
Contributor

kballard commented Sep 23, 2013

@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source.

Contributor

steveklabnik commented Sep 23, 2013

Ruby also uses ' to not interpret, and " to interpret.

a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}"
Contributor

kballard commented Sep 23, 2013

@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using 'FOUR' syntax.

Contributor

steveklabnik commented Sep 23, 2013

@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at.

Contributor

kballard commented Sep 26, 2013

According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular).

Contributor

sp3d commented Sep 27, 2013

I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes.

So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)."

No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check.

Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error.

I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data.

Contributor

ben0x539 commented Sep 27, 2013

I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like printf's and lua's %) over overloading \ to serve that role in way different contexts and then requiring people to select the right string literal syntax for every context. I realise that can't address the use case of hardcoded Windows paths. For embedding output by other programs into rust source code, I think it's reasonable to just pipe them through an adaptor first that properly escapes them if an include!() macro isn't appropriate there.

I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :(

Contributor

kballard commented Sep 27, 2013

@sp3d

No scheme will pass through verbatim every character in the source sequence for all sequences.

Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately.

@ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal).

Contributor

sp3d commented Sep 27, 2013

@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step".

In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code.

Contributor

kballard commented Sep 27, 2013

@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up.

The problem with the r"foo""bar" solution is that even for short strings, it really sucks when the string contains many quotes. For example if I need a raw string that contains a snippet of code ["this", "is", "a", "vector", "of", "&str"] then it's pretty bad: r"[""this"", ""is"", ""a"", ""vector"", ""of"", ""&str""]".

We could always do something like r(delim)textdelim, for example, r(;)Some text that is terminated by a semicolon;. Not sure that's very readable, but it definitely seems easy to parse, and doesn't require the use of " characters.

Member

pnkfelix commented Oct 1, 2013

@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through.

Contributor

kballard commented Oct 1, 2013

@pnkfelix Huzzah! I'm quite happy to do the implementation myself

Member

Kimundi commented Oct 1, 2013

I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm.

Contributor

kballard commented Oct 1, 2013

@Kimundi Yes I'm pretty sure it can be done in next_token_inner().

Contributor

ben0x539 commented Oct 1, 2013

@Kimundi I have most of a patch already, can we talk/compare notes on irc or so?

@bors bors added a commit that referenced this issue Oct 8, 2013

@bors bors auto merge of #9674 : ben0x539/rust/raw-str, r=alexcrichton
This branch parses raw string literals as in #9411.
c919629
Owner

alexcrichton commented Oct 8, 2013

Closed by #9674, nice work everyone!

@ilammy ilammy added a commit to ilammy/sash that referenced this issue Oct 14, 2015

@ilammy ilammy Scanning raw strings
These strings can contain arbitrary characters and do not process *any*
escape sequences. The only special characters are line endings which are
normalized to \n as in regular strings. Everything else is represented
verbatim.

After careful consideration and studying this thread [1], I have decided
to inherit Rust's syntax for raw strings. Seriously, it's very good:

  - Double quotes as a 'this is a string' marker.

  - Low level of syntactic noise in simple cases.

  - Arbitrary sequences of characters can be embedded by using
    a sufficient number of # characters for padding.

  - Only one dimension of variance: padding length. This gives us
    consistent syntax and makes it easier for humans to recognize
    the raw strings in text.

Thank you, Kimundi, for your brilliance.

Though, the usage of # for padding may be reconsidered in Sash as I intend
to use # in so-called 'multipart identifiers' to adopt mixfix call syntax.
It may be better to choose some other character to not overload the #.

Also, raw string do report bare CR characters as regular strings do.

[1] rust-lang/rust#9411
da2a580

boosh commented Jan 3, 2017

r#""# really was a poor choice of delimiter. Didn't anyone think people might want to quote HTML which potentially contains loads of '"#' substrings? 👎

Contributor

jonas-schievink commented Jan 3, 2017

@boosh You can use an arbitrary number of # on both sides

boosh commented Jan 3, 2017

@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment