New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Syntax for raw string literals #9411

Closed
kballard opened this Issue Sep 22, 2013 · 43 comments

Comments

Projects
None yet
@kballard
Contributor

kballard commented Sep 22, 2013

A raw string literal is a string literal that does not interpret any embedded sequence, meaning no backslash-escapes. A lot of languages (certainly most that I've used) support some syntax for raw string literals. They're useful for embedding any string that wants to have a bunch of backslashes in it (typically because the function the string is passed to wants to interpret them itself), such as regular expressions. Unfortunately, Rust does not have a raw string literal syntax.

There's been a discussion on the mailing list for the past few days about this. I will try to put a quick summary here.

There's two questions at stake. The first is, should Rust have a raw string literal syntax? The second is, if so, what particular syntax should be used? I think the answer to the first is definitely Yes. It's useful enough, and has enough overwhelming precedence in other languages, that we should add it. The question of concrete syntax is the harder one.

The syntaxes that have been proposed so far, along with their Pros and Cons:

  1. C++11 syntax, e.g. R"delim(raw text)delim".

    Pros:

    • Reasonably straightforward
    • Can embed any character sequence

    Cons:

    • Syntax is slightly complicated (editorial note: I think any syntax that's flexible enough to contain any character is going to be considered slightly complicated).
  2. Python syntax, e.g. r"foo"

    Pros:

    • Simple syntax

    Cons:

    • Can't embed any character sequence.
    • Python's implementation has really wacky handling of backslash escapes in conjunction with the quote character. Even reproducing that behavior does not allow for embedding any sequence, as r"foo\"" evaluates to the string foo\" (with the literal backslash).
  3. D syntax, e.g. r"raw text", raw text, or q"(raw text)"/q"delim\nraw text\ndelim"

    Pros:

    • Can embed any character sequence (with the third variant)

    Cons:

    • The first two forms aren't flexible enough, and the third form is a bit confusing. The delimiter behaves differently depending on whether it's a "nesting" delimiter (one of ([<{), another token, or an identifier.
  4. C#/SQL/something else, using a simple raw string syntax such as r"text" where doubling up the quote inserts a single quote, as in r"foo""bar"

    Pros:

    • Simple syntax

    Cons:

    • Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML).
  5. Perl quote-like operators, e.g. q{text}. Unfortunately, most viable delimiters will result in an ambiguous parse.

  6. Ruby quote-like operators, e.g. %q{text}. Unfortunately, this also is ambiguous (with the % token).

  7. Lua syntax, e.g. [=[text]=]

    Pros:

    • Simple syntax
    • Can embed any character sequence

    Cons:

    • Syntax looks decidedly non-string-like
    • Custom delimiters are limited to sequences of =
    • Alex Chrichton opined that seeing println!([[Hello, {}!]], "world") in an introduction to Rust would be awfully confusing (see previous point about being non-string-like).
  8. Go syntax, e.g. raw text. This is one of the variants of D strings as well

    Pros:

    • Simple syntax

    Cons:

    • Cannot embed any character sequence (notably, cannot embed backtick)
    • It's difficult or impossible to embed backticks in a markdown code sequence, which will make it awkward to use raw strings in markdown editors. May also be confusing with the usage of foo in doc comments.
  9. A new syntax using ASCII Control characters STX and ETX

    Pros:

    • I don't think there are any

    Cons:

    • Can't type the keys on any keyboard
    • Text editors probably won't render the characters correctly either
    • Can't technically embed any character sequence, because ETX cannot be embedded, but in fairness it can embed any printable sequence.
  10. A syntax proposed over IRC is delim"raw text"delim.

    Pros:

    • Can embed any character

    Cons:

    • Unusual syntax with no precedent in other languages. Functionally identical to C++11 syntax.
    • Hard to type in Markdown editors

Some form of Heredoc syntax was also suggested, but heredocs are really primarily concerned with embedding multiline input, not raw input. They also have issues around dealing with indentation and the first/last newline.

During this discussion, only two Rust team members (that I'm aware of) chimed in. Alex Chricton raised issues with the Lua syntax, and threw out the suggestion of Go's syntax, though only as something to consider rather than a recommendation. Felix Klock expressed a preference for C++11 syntax, and more generally stated that he wants a syntax with user-delimited sequences. There was also at least one community member in favor of C++11 syntax.

My own preference at this point is for C++11 syntax as well. At the very least, something similar to C++11 syntax, that shares all of its properties, but there seems to be no value in inventing a new syntax when there's precedent in C++11.

@thestinger

This comment has been minimized.

Show comment
Hide comment
@thestinger

thestinger Sep 22, 2013

Contributor

I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem.

Contributor

thestinger commented Sep 22, 2013

I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 22, 2013

Contributor

@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form <start token><user-supplied delimiter><delimiter end token><raw text><delimiter start token><user supplied delimiter><end token>. In C++11 the <start token> is R", the <delimiter end token> is (, the <delimiter start token> is ), and <end token> is ". I don't think you can remove any of these components without breaking functionality, and I don't think you can adjust the values to produce something less noisy.

The only adjustment I can think of would be to remove <delimiter start/end token> and instead require that the <user supplied delimiter> includes an appropriate punctuation to end it (and to start the close sequence), but if anything that makes it more confusing, not less.

Contributor

kballard commented Sep 22, 2013

@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form <start token><user-supplied delimiter><delimiter end token><raw text><delimiter start token><user supplied delimiter><end token>. In C++11 the <start token> is R", the <delimiter end token> is (, the <delimiter start token> is ), and <end token> is ". I don't think you can remove any of these components without breaking functionality, and I don't think you can adjust the values to produce something less noisy.

The only adjustment I can think of would be to remove <delimiter start/end token> and instead require that the <user supplied delimiter> includes an appropriate punctuation to end it (and to start the close sequence), but if anything that makes it more confusing, not less.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 22, 2013

How about R<space><user-supplied delimiter>"<raw text>"<user-supplied delimiter> and R"<raw text>" for the case where a simple " delimiter is sufficient.

R"c:\some\path\"
R eos"raw text"eos

ghost commented Sep 22, 2013

How about R<space><user-supplied delimiter>"<raw text>"<user-supplied delimiter> and R"<raw text>" for the case where a simple " delimiter is sufficient.

R"c:\some\path\"
R eos"raw text"eos

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 22, 2013

Contributor

@stevenashley The lexer will see the R<space> and tokenize that as an identifier R. Trying to look ahead past spaces is, at best, highly confusing.

Contributor

kballard commented Sep 22, 2013

@stevenashley The lexer will see the R<space> and tokenize that as an identifier R. Trying to look ahead past spaces is, at best, highly confusing.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 22, 2013

Ah, of course. I can't think of a substitute for <space> that would simultaneously look nice and parse well. Consider my proposal retracted.

ghost commented Sep 22, 2013

Ah, of course. I can't think of a substitute for <space> that would simultaneously look nice and parse well. Consider my proposal retracted.

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

How about this: r"" syntax, with the option to pad the string on both ends with #:

foo  ==  r"foo"
fo"o ==  r#"fo"o"#
##   ==  r###"##"###

As far as I know we don't allow # in an expression context, it's only valid as part of the attribute syntax, so this should work.
Heck, it would even be ok in attributes themself, I think:

#[foo = r##"test"##];

Alternatively, we could also throw away the r token itself and say that any number of # followed by " starts an raw string literal:

let regex = ##"[/s]+"##;

Or we make both forms valid: r"" for short raw strings, ##""## as alternative to cover every possible string.

Member

Kimundi commented Sep 23, 2013

How about this: r"" syntax, with the option to pad the string on both ends with #:

foo  ==  r"foo"
fo"o ==  r#"fo"o"#
##   ==  r###"##"###

As far as I know we don't allow # in an expression context, it's only valid as part of the attribute syntax, so this should work.
Heck, it would even be ok in attributes themself, I think:

#[foo = r##"test"##];

Alternatively, we could also throw away the r token itself and say that any number of # followed by " starts an raw string literal:

let regex = ##"[/s]+"##;

Or we make both forms valid: r"" for short raw strings, ##""## as alternative to cover every possible string.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 23, 2013

Contributor

@Kimundi That looks like a reinvention of Lua's [==[foo]==] syntax. It's certainly workable, but it shares the same problems that Lua syntax does (as pointed out by @alexcrichton).

Contributor

kballard commented Sep 23, 2013

@Kimundi That looks like a reinvention of Lua's [==[foo]==] syntax. It's certainly workable, but it shares the same problems that Lua syntax does (as pointed out by @alexcrichton).

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Sep 23, 2013

Member

It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:

  1. Regular expressions. These contain lots of backslashes and normally escapes aren't even really that necessary. If we used the normal string syntax, everything would have to be double-escaped which is a pain. The main stickler about this desired syntax is that this would want to be very usable (in the sense that it shouldn't be a pain to write/read read the strings of regular expressions, at least no more than it already is).
  2. Literal windows paths. Perhaps these should be done in a different manner to be portable, but regardless having to escape the \ character is a real pain. As with regular expressions, in theory the string syntax isn't difficult to read.
  3. Giant blobs of raw text, such as formatting an HTML document (like what rustdoc does right now). This is different from regular expressions in that I they don't need to be so easily readable (because the body of the text is normally very large), so the custom delimiters surrounding the text I believe would be find in this case.
  4. format! string directives. Right now it's a pain to print a \ character because you have to type println!("\\\\"). As with regular expressions though, this should be easy to read and easy to use (because it may be fairly common). Perhaps this should use a different escape though, which would make this irrelevant.

Those are the use cases that I could think of, others may have more

Member

alexcrichton commented Sep 23, 2013

It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:

  1. Regular expressions. These contain lots of backslashes and normally escapes aren't even really that necessary. If we used the normal string syntax, everything would have to be double-escaped which is a pain. The main stickler about this desired syntax is that this would want to be very usable (in the sense that it shouldn't be a pain to write/read read the strings of regular expressions, at least no more than it already is).
  2. Literal windows paths. Perhaps these should be done in a different manner to be portable, but regardless having to escape the \ character is a real pain. As with regular expressions, in theory the string syntax isn't difficult to read.
  3. Giant blobs of raw text, such as formatting an HTML document (like what rustdoc does right now). This is different from regular expressions in that I they don't need to be so easily readable (because the body of the text is normally very large), so the custom delimiters surrounding the text I believe would be find in this case.
  4. format! string directives. Right now it's a pain to print a \ character because you have to type println!("\\\\"). As with regular expressions though, this should be easy to read and easy to use (because it may be fairly common). Perhaps this should use a different escape though, which would make this irrelevant.

Those are the use cases that I could think of, others may have more

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 23, 2013

Would case 4 become println!(R"(\\)") ?

ghost commented Sep 23, 2013

Would case 4 become println!(R"(\\)") ?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Sep 23, 2013

Member

I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes.

Member

alexcrichton commented Sep 23, 2013

I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes.

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

@kballard I would say it's better than Luas syntax here.

  • It has the same advantage of being able to delimit any text.
  • Only being limited to # is not a problem, you can still find a delimiter sequence for any input.
  • The default case r"" has very low typing overhead, and looks very similar to a regular string literal, no confusion about meaning.

Looking at @alexcrichton's use cases:

  1. Regular expressions: r"([^:]*):(\d+):(\d+): (\d+):(\d+) (.*)$".match_groups()
  2. Windows paths: r"C:\Program Files\rust\bin\rust.exe".to_path()
  3. format! strings: println!(r"\\");
  4. Blobs of text:
static MARKDOWN: &'static str = r###"
## Scope

This is an introductory tutorial for the Rust programming language. It
covers the fundamentals of the language, including the syntax, the
type system and memory model, generics, and modules. [Additional
tutorials](#what-next) cover specific language features in greater
depth.

This tutorial assumes that the reader is already familiar with one or
more languages in the C family. Understanding of pointers and general
memory management techniques will help.
"###;

Member

Kimundi commented Sep 23, 2013

@kballard I would say it's better than Luas syntax here.

  • It has the same advantage of being able to delimit any text.
  • Only being limited to # is not a problem, you can still find a delimiter sequence for any input.
  • The default case r"" has very low typing overhead, and looks very similar to a regular string literal, no confusion about meaning.

Looking at @alexcrichton's use cases:

  1. Regular expressions: r"([^:]*):(\d+):(\d+): (\d+):(\d+) (.*)$".match_groups()
  2. Windows paths: r"C:\Program Files\rust\bin\rust.exe".to_path()
  3. format! strings: println!(r"\\");
  4. Blobs of text:
static MARKDOWN: &'static str = r###"
## Scope

This is an introductory tutorial for the Rust programming language. It
covers the fundamentals of the language, including the syntax, the
type system and memory model, generics, and modules. [Additional
tutorials](#what-next) cover specific language features in greater
depth.

This tutorial assumes that the reader is already familiar with one or
more languages in the C family. Understanding of pointers and general
memory management techniques will help.
"###;

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Sep 23, 2013

Member

Those all look totally reasonable to me.

Member

alexcrichton commented Sep 23, 2013

Those all look totally reasonable to me.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 23, 2013

Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done.

ghost commented Sep 23, 2013

Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done.

@huonw

This comment has been minimized.

Show comment
Hide comment
@huonw

huonw Sep 23, 2013

Member

(All of these mean the token language is no longer regular, right?)

Member

huonw commented Sep 23, 2013

(All of these mean the token language is no longer regular, right?)

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 23, 2013

Contributor

@Kimundi: Regular expression:

r##"(\w+)   # match word chars
    "[^"]*" # followed by a quoted string
    (\d+)   # followed by digits"##.flag("x").match_groups();

# just looks like a very odd character here.

That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it.

@huonw I believe you are correct. Is that particularly important?

Contributor

kballard commented Sep 23, 2013

@Kimundi: Regular expression:

r##"(\w+)   # match word chars
    "[^"]*" # followed by a quoted string
    (\d+)   # followed by digits"##.flag("x").match_groups();

# just looks like a very odd character here.

That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it.

@huonw I believe you are correct. Is that particularly important?

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Sep 23, 2013

Member

The restriction to sequences of # has some bad corner cases, where I can imagine one sitting and manually counting, since the eye does not immediately distinguish and/or match { #####, ###### and ##### } the same way it can with { #five# #six# and #five# }.

I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to ##*) up to an end-developer policy (with checks for particular restrictions available as a lint).

The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp #(#|_)*, or perhaps even (#|_)*" (Not 100% sure whether the latter is too broad.) Then I still get to write e.g. { #_#, ##_, #_# } which is easier on my eyes than the above encodings of five and six.


But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of #.


(one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) )

Member

pnkfelix commented Sep 23, 2013

The restriction to sequences of # has some bad corner cases, where I can imagine one sitting and manually counting, since the eye does not immediately distinguish and/or match { #####, ###### and ##### } the same way it can with { #five# #six# and #five# }.

I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to ##*) up to an end-developer policy (with checks for particular restrictions available as a lint).

The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp #(#|_)*, or perhaps even (#|_)*" (Not 100% sure whether the latter is too broad.) Then I still get to write e.g. { #_#, ##_, #_# } which is easier on my eyes than the above encodings of five and six.


But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of #.


(one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) )

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two #: It's only necessary to add more if your raw string literally contains "#, "##, "### etc.

@kballard: Likewise, in that example there would be no need for more than one #:

r#"(\w+)   # match word chars
   "[^"]*" # followed by a quoted string
   (\d+)   # followed by digits"#.flag("x").match_groups();

Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one.

Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;)

Member

Kimundi commented Sep 23, 2013

@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two #: It's only necessary to add more if your raw string literally contains "#, "##, "### etc.

@kballard: Likewise, in that example there would be no need for more than one #:

r#"(\w+)   # match word chars
   "[^"]*" # followed by a quoted string
   (\d+)   # followed by digits"#.flag("x").match_groups();

Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one.

Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;)

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Sep 23, 2013

Member

@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support.

E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.)

I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...)

Member

pnkfelix commented Sep 23, 2013

@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support.

E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.)

I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...)

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 23, 2013

@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded.

I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: [rR](#*)"(.*?)"\1.

A regex that parses #five# etc is a little more complex but still workable. [rR](#*)([^"]*)\1"(.*?)"\1\2\1.

ghost commented Sep 23, 2013

@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded.

I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: [rR](#*)"(.*?)"\1.

A regex that parses #five# etc is a little more complex but still workable. [rR](#*)([^"]*)\1"(.*?)"\1\2\1.

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

@pnkfelix @huonw: You could also just hack around that:
If we pick a syntax that only differs in length, like my proposal, then external tools could hardcode, say, up to five variations. I don't think there are many cases in the wild that embedded the the string "#####.

Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure.

Member

Kimundi commented Sep 23, 2013

@pnkfelix @huonw: You could also just hack around that:
If we pick a syntax that only differs in length, like my proposal, then external tools could hardcode, say, up to five variations. I don't think there are many cases in the wild that embedded the the string "#####.

Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 23, 2013

Contributor

@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all.

Contributor

kballard commented Sep 23, 2013

@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all.

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

@kballard Right, just wanted to throw that out there as fallback workaround. :)

Member

Kimundi commented Sep 23, 2013

@kballard Right, just wanted to throw that out there as fallback workaround. :)

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Sep 23, 2013

Member

Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC:

Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing " or ending with whitespace between r# and " (The initial # being needed to make the lexer recognize it as an raw string literal).

Would certainly give good opportunities for self documenting literals:

static RUSTCODE: &'static str = 
r## CODE ##"
fn main() {
    // Example: This uses a string raw literal to embed an windows-style file path directly.
    println(r"C:\Program Files\rust\bin\rust.exe");
}
"## CODE ##;
Member

Kimundi commented Sep 23, 2013

Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC:

Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing " or ending with whitespace between r# and " (The initial # being needed to make the lexer recognize it as an raw string literal).

Would certainly give good opportunities for self documenting literals:

static RUSTCODE: &'static str = 
r## CODE ##"
fn main() {
    // Example: This uses a string raw literal to embed an windows-style file path directly.
    println(r"C:\Program Files\rust\bin\rust.exe");
}
"## CODE ##;
@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 23, 2013

Contributor

@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source.

Contributor

kballard commented Sep 23, 2013

@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source.

@steveklabnik

This comment has been minimized.

Show comment
Hide comment
@steveklabnik

steveklabnik Sep 23, 2013

Member

Ruby also uses ' to not interpret, and " to interpret.

a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}"
Member

steveklabnik commented Sep 23, 2013

Ruby also uses ' to not interpret, and " to interpret.

a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}"
@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 23, 2013

Contributor

@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using 'FOUR' syntax.

Contributor

kballard commented Sep 23, 2013

@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using 'FOUR' syntax.

@steveklabnik

This comment has been minimized.

Show comment
Hide comment
@steveklabnik

steveklabnik Sep 23, 2013

Member

@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at.

Member

steveklabnik commented Sep 23, 2013

@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 26, 2013

Contributor

According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular).

Contributor

kballard commented Sep 26, 2013

According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular).

@sp3d

This comment has been minimized.

Show comment
Hide comment
@sp3d

sp3d Sep 27, 2013

Contributor

I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes.

So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)."

No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check.

Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error.

I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data.

Contributor

sp3d commented Sep 27, 2013

I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes.

So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)."

No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check.

Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error.

I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data.

@ben0x539

This comment has been minimized.

Show comment
Hide comment
@ben0x539

ben0x539 Sep 27, 2013

Contributor

I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like printf's and lua's %) over overloading \ to serve that role in way different contexts and then requiring people to select the right string literal syntax for every context. I realise that can't address the use case of hardcoded Windows paths. For embedding output by other programs into rust source code, I think it's reasonable to just pipe them through an adaptor first that properly escapes them if an include!() macro isn't appropriate there.

I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :(

Contributor

ben0x539 commented Sep 27, 2013

I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like printf's and lua's %) over overloading \ to serve that role in way different contexts and then requiring people to select the right string literal syntax for every context. I realise that can't address the use case of hardcoded Windows paths. For embedding output by other programs into rust source code, I think it's reasonable to just pipe them through an adaptor first that properly escapes them if an include!() macro isn't appropriate there.

I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :(

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 27, 2013

Contributor

@sp3d

No scheme will pass through verbatim every character in the source sequence for all sequences.

Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately.

@ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal).

Contributor

kballard commented Sep 27, 2013

@sp3d

No scheme will pass through verbatim every character in the source sequence for all sequences.

Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately.

@ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal).

@sp3d

This comment has been minimized.

Show comment
Hide comment
@sp3d

sp3d Sep 27, 2013

Contributor

@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step".

In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code.

Contributor

sp3d commented Sep 27, 2013

@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step".

In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 27, 2013

Contributor

@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up.

The problem with the r"foo""bar" solution is that even for short strings, it really sucks when the string contains many quotes. For example if I need a raw string that contains a snippet of code ["this", "is", "a", "vector", "of", "&str"] then it's pretty bad: r"[""this"", ""is"", ""a"", ""vector"", ""of"", ""&str""]".

Contributor

kballard commented Sep 27, 2013

@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up.

The problem with the r"foo""bar" solution is that even for short strings, it really sucks when the string contains many quotes. For example if I need a raw string that contains a snippet of code ["this", "is", "a", "vector", "of", "&str"] then it's pretty bad: r"[""this"", ""is"", ""a"", ""vector"", ""of"", ""&str""]".

@campadrenalin

This comment has been minimized.

Show comment
Hide comment
@campadrenalin

campadrenalin Sep 30, 2013

We could always do something like r(delim)textdelim, for example, r(;)Some text that is terminated by a semicolon;. Not sure that's very readable, but it definitely seems easy to parse, and doesn't require the use of " characters.

campadrenalin commented Sep 30, 2013

We could always do something like r(delim)textdelim, for example, r(;)Some text that is terminated by a semicolon;. Not sure that's very readable, but it definitely seems easy to parse, and doesn't require the use of " characters.

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Oct 1, 2013

Member

@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through.

Member

pnkfelix commented Oct 1, 2013

@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Oct 1, 2013

Contributor

@pnkfelix Huzzah! I'm quite happy to do the implementation myself

Contributor

kballard commented Oct 1, 2013

@pnkfelix Huzzah! I'm quite happy to do the implementation myself

@Kimundi

This comment has been minimized.

Show comment
Hide comment
@Kimundi

Kimundi Oct 1, 2013

Member

I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm.

Member

Kimundi commented Oct 1, 2013

I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Oct 1, 2013

Contributor

@Kimundi Yes I'm pretty sure it can be done in next_token_inner().

Contributor

kballard commented Oct 1, 2013

@Kimundi Yes I'm pretty sure it can be done in next_token_inner().

@ben0x539

This comment has been minimized.

Show comment
Hide comment
@ben0x539

ben0x539 Oct 1, 2013

Contributor

@Kimundi I have most of a patch already, can we talk/compare notes on irc or so?

Contributor

ben0x539 commented Oct 1, 2013

@Kimundi I have most of a patch already, can we talk/compare notes on irc or so?

bors added a commit that referenced this issue Oct 8, 2013

auto merge of #9674 : ben0x539/rust/raw-str, r=alexcrichton
This branch parses raw string literals as in #9411.

bors added a commit that referenced this issue Oct 8, 2013

auto merge of #9674 : ben0x539/rust/raw-str, r=alexcrichton
This branch parses raw string literals as in #9411.
@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Oct 8, 2013

Member

Closed by #9674, nice work everyone!

Member

alexcrichton commented Oct 8, 2013

Closed by #9674, nice work everyone!

ilammy added a commit to ilammy/sash that referenced this issue Oct 14, 2015

Scanning raw strings
These strings can contain arbitrary characters and do not process *any*
escape sequences. The only special characters are line endings which are
normalized to \n as in regular strings. Everything else is represented
verbatim.

After careful consideration and studying this thread [1], I have decided
to inherit Rust's syntax for raw strings. Seriously, it's very good:

  - Double quotes as a 'this is a string' marker.

  - Low level of syntactic noise in simple cases.

  - Arbitrary sequences of characters can be embedded by using
    a sufficient number of # characters for padding.

  - Only one dimension of variance: padding length. This gives us
    consistent syntax and makes it easier for humans to recognize
    the raw strings in text.

Thank you, Kimundi, for your brilliance.

Though, the usage of # for padding may be reconsidered in Sash as I intend
to use # in so-called 'multipart identifiers' to adopt mixfix call syntax.
It may be better to choose some other character to not overload the #.

Also, raw string do report bare CR characters as regular strings do.

[1] rust-lang/rust#9411
@boosh

This comment has been minimized.

Show comment
Hide comment
@boosh

boosh Jan 3, 2017

r#""# really was a poor choice of delimiter. Didn't anyone think people might want to quote HTML which potentially contains loads of '"#' substrings? 👎

boosh commented Jan 3, 2017

r#""# really was a poor choice of delimiter. Didn't anyone think people might want to quote HTML which potentially contains loads of '"#' substrings? 👎

@jonas-schievink

This comment has been minimized.

Show comment
Hide comment
@jonas-schievink

jonas-schievink Jan 3, 2017

Contributor

@boosh You can use an arbitrary number of # on both sides

Contributor

jonas-schievink commented Jan 3, 2017

@boosh You can use an arbitrary number of # on both sides

@boosh

This comment has been minimized.

Show comment
Hide comment
@boosh

boosh Jan 3, 2017

@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍

boosh commented Jan 3, 2017

@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment