Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement raw strings (r#"foo"#) #9674

Merged
merged 7 commits into from Oct 8, 2013
Merged

implement raw strings (r#"foo"#) #9674

merged 7 commits into from Oct 8, 2013

Conversation

ben0x539
Copy link
Contributor

@ben0x539 ben0x539 commented Oct 2, 2013

This branch parses raw string literals as in #9411.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 2, 2013

I'm submitting this PR because I'd like some review and comments on whether I'm going in the right direction here.

What bothers me is described in the commit message for 06eee38 (edit: removed that commit): I want the pretty-printer to reproduce the right style of string literal, but I hesitate to thread that extra info through all the mangling and data structures between the parser itself and what ends up in the AST. Would that be the right way to go?

@pnkfelix
Copy link
Member

pnkfelix commented Oct 2, 2013

@ben0x539

My initial thought is that maybe the pretty printer should be in charge of deciding for itself, based on the content of the string, what kind of string literal is best.

E.g. if the author didn't "need" to use a raw string, then it seems to me like the pretty-printer could emit a normal string literal. And likewise, if the author painfully put in a string with many escapes, the pretty-printer might scan it for any embedded doublequotes, and if so, then the longest occurrence of "##... (which will hopefully be none at all), and use that to determine what kind of raw string to produce.

(This would raise some policy questions like "how many escapes is too many" (as in, what is the threshold where I should switch to using a raw string). Off the top of my head, I would say more than two escapes implies print raw-string. Of course this would hopefully be easy to change after the fact.)


Also, @kballard and @Kimundi will probably want to weigh in. :)

@pnkfelix
Copy link
Member

pnkfelix commented Oct 2, 2013

addendum to my prior suggestion: My suggested policy was a bit naive; e.g. I am not sure I would want "\n\n\n" or "\t\t\t" to be pretty-printed as a raw-string. And unicode escapes may also be exceptions of note (though I don't know if the pretty-printer actually prints those via escape sequences currently anyway). So maybe it would be better to count up the number of escapes for things like \\ and \", and if that is more than two, then use a raw-string. Anyway, I trust you to work out these sorts of details.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 2, 2013

I suppose pretty-printing outputting the precise input sequence is already not viable since it's not recorded anywhere whether something like a newline a string resulted from a literal newline or an escape sequence (also \x, \u, \U, other special characters). Inspecting spans would also fall short in the ast bits where there's no lit_str, just another kind of node with strs in it (though maybe they should be changed to carry a lit_ around and only extract the string later on?)

There's so many ways to encode the same string, I'm not sure how far a heuristic would go towards finding the "ideal" string literal every time. :(

@pnkfelix
Copy link
Member

pnkfelix commented Oct 2, 2013

@ben0x539 IMO the pretty-printer already breaks down pretty badly in a number of cases (e.g. the way that comments get moved around), so I do not think you should invest too much effort in trying to come up with the perfect answer here.

My main piece of advice: get the pretty printer to the point where it can emit either variant of string literal, and then factor how the choice is made into a cleanly separated policy. Then the team can refine the policy later as we gain more experience.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 2, 2013

I have a change that would make the pretty printer emit whichever is the shorter kind of string literal. I'm not sure about that because that doesn't affect the --pretty normal mode which doesn't re-render literals but just gets them out of another run of the tokenizer via syntax::parse::comments from what I can tell.

With that change (added above), only the lexer and the parser are aware of the two kinds of string literals and everybody else just gets ast::lit_str nodes, which the pretty-printer selectively renders as either kind of literal in the --pretty expanded etc modes.

// option. This file may not be copied, modified, or distributed
// except according to those terms.

//error-pattern:unterminated raw string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use the //~ ERROR syntax here to make sure it's put in the right spot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it's not possible because the //~ ERROR syntax must follow the syntax error, while the string necessarily can't be followed by a comment or it wouldn't be unterminated. Maybe there is a non-obvious way to get it right, but I haven't found it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does using the //~^^... ERROR variant, where you use as many ^ as needed to count backwards to the offending line, help here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that any comment after the opening " will be included in the string if it isn't terminated (like here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unless the compile-test runner just checks for //~^...^ ERROR as a string match, rather than anything lexical?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. That actually works. I take everything back.

@alexcrichton
Copy link
Member

This looks good to me, and I would be willing to r+ this. I'd like to get at least one other opinion though. This is a pretty major change and I want to make sure we've covered all our bases in making sure it's well integrated.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 3, 2013

history-edited the unterminated string literal test to use //~^^ ERROR. thanks, @pnkfelix / @huonw!

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 3, 2013

rebased to replace loop with continue

escaped in order to denote *itself*.

Raw string literals do not process any escapes. They start with the character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth putting an example or two here? e.g. r"foo\bar" r####"string "### still in the string"#### or something.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 3, 2013

@huonw I added a few examples in rust.md and changed the error message to "only # is allowed in raw string delimitation; found illegal character: <whatever>"


"foo #\"# bar", r##"foo #"# bar"## // foo #"# bar

"\x2603", "☃", r"☃" // ☃
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be \u2603.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 3, 2013

@huonw fixed, thanks

@@ -353,6 +353,10 @@ whose literals are written between single quotes, as in `'x'`.
Just like C, Rust understands a number of character escapes, using the backslash
character, such as `\n`, `\r`, and `\t`. String literals,
written between double quotes, allow the same escape sequences.

Raw string literals on the other hand do not process any escape sequences.
They are written as `r##"blah"##`, with any (matching) number of `#` on either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"optionally with a matching number of # on each side", or something else to make it clear that zero #s is allowed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or perhaps: "with any non-negative number of # characters on each side" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the phrasing "zero or more" to "non-negative". The idea of a negative number of characters is just confusing.

@brson
Copy link
Contributor

brson commented Oct 4, 2013

I haven't read the details here but @pnkfelix I think I disagree that the pretty-printer should decide whether to use raw strings or not. Rather I think it would be more consistent if we store exactly what was written, as we do with constants, to write out during pretty-printing.

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 4, 2013

I agree it's more consistent.

Any chance we could change the extern "Rust" fn() syntax to something like extern(Rust) fn() though? ;)

@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 5, 2013

As per @brson above, try to store the kind of string literal in the ast. Seems to work in expression context, extern r"Rust" ABI decls, extern mod blah = r"blah" position, with extra fields in a bunch of ast nodes storing an enum StrStyle { CookedStr, RawStr(uint) } that are mostly ignored or blindly reproduced outside of the pretty printer. I hope I didn't miss string literals impliclty carried in the ast elsewhere.

The asm!() syntax extension half-works; the pretty printer will reproduce the asm code as a raw string if it was given as such. I wasn't sure whether adding another StrStyle field for all the other positions for string literals in the asm!() parameters so I'm only allowing non-raw strings there for now. Tempted to suggest the syntax extension could parse things like =r(0) instead of "r"(0). :}

I apologize for holding up this feature for so long. This probably still isn't ready to merge, I don't feel very good about how the pretty-printing of raw string literals impacts the AST and its consumers and I'm still trying to think of a better way to handle it.

Raw string literals are lexed into regular string literals. This is okay
for them to "work" and be usable/testable, but the pretty-printer does
not know about them yet and will just emit regular string literals.
Treat it as a synonym for LIT_STR for now.
For the benefit of the pretty printer we want to keep track of how
string literals in the ast were originally represented in the source
code.

This commit changes parser functions so they don't extract strings from
the token stream without at least also returning what style of string
literal it was. This is stored in the resulting ast node for string
literals, obviously, for the package id in `extern mod = r"package id"`
view items, for the inline asm in `asm!()` invocations.

For `asm!()`'s other arguments or for `extern "Rust" fn()` items, I just
the style of string, because it seemed disproportionally cumbersome to
thread that information through the string processing that happens with
those string literals, given the limited advantage raw string literals
would provide in these positions.

The other syntax extensions don't seem to store passed string literals
in the ast, so they also discard the style of strings they parse.
@ben0x539
Copy link
Contributor Author

ben0x539 commented Oct 8, 2013

@alexcrichton thinks we can get away without pretty-printing ABI names in extern "C" etc as the right kind of string, so I'm removing that (fairly invasive) commit and just letting the pretty printer print those strings, as well as the latter asm! parameters, as regular non-raw strings. Some examples for the behavior as of this push: http://ix.io/8nr becomes http://ix.io/8ns with --pretty normal. With --pretty expanded, we get http://ix.io/8nt. That the "Rust" in the extern "Rust" fn disappears is the same behavior as on master.

bors added a commit that referenced this pull request Oct 8, 2013
This branch parses raw string literals as in #9411.
bors added a commit that referenced this pull request Oct 8, 2013
This branch parses raw string literals as in #9411.
@bors bors closed this Oct 8, 2013
@bors bors merged commit d7dfe0a into rust-lang:master Oct 8, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 21, 2022
Fix `needless_borrow` false positive

The PR fixes the false positive exposed by `@BusyJay's` example in: rust-lang/rust-clippy#9111 (comment)

The current approach is described in rust-lang/rust-clippy#9674 (comment) and rust-lang/rust-clippy#9674 (comment).

The original approach appears below.

---

The proposed fix is to flag only "simple" trait implementations involving references, a concept
that I introduce next.

Intuitively, a trait implementation is "simple" if all it does is dereference and apply the trait
implementation of a type named by a type parameter. `AsRef` provides a good example of a simple
implementation: https://doc.rust-lang.org/std/convert/trait.AsRef.html#impl-AsRef%3CU%3E-for-%26T

We can make this idea more precise as follows. Given a trait implementation, first determine
whether the implementation is "used defined." If so, then examine its nested obligations.
Consider the implementation simple if-and-only-if:
- there is at least one nested obligation for the same trait
- for each type `X` in the nested obligation's substitution, either `X` is the same as that of
  the original obligation's substitution, or the original type is `&X`

For example, the following implementation from `@BusyJay's` example is "complex" (i.e., not simple)
because it produces no nested obligations:

```rust
impl<'a> Extend<&'a u8> for A { ... }
```

On the other hand, the following slightly modified implementation is simple, because it produces
a nested obligation for `Extend<X>`:

```rust
impl<'a, X> Extend<&'a X> for A where A: Extend<X> { ... }
```

How does flagging only simple implementations help? One way of interpreting the false positive in
`@BusyJay's` example is that it separates a reference from a concrete type. Doing so turns a
successful type inference into a failing one. By flagging only simple implementations, we
separate references from type variables only, thereby eliminating this class of false positives.

Note that `Deref` is a special case, as the obligations generated for it already involve the
underlying type.

r? `@Jarcho` (Sorry to keep pinging you with `needless_borrow` stuff. But my impression is no one knows this code better than you.)

changelog: fix `needless_borrow` false positive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants