Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"-NUM.method()" is parsed in surprising ways #117155

Open
RalfJung opened this issue Oct 25, 2023 · 30 comments · May be fixed by #117161 or #121364
Open

"-NUM.method()" is parsed in surprising ways #117155

RalfJung opened this issue Oct 25, 2023 · 30 comments · May be fixed by #117161 or #121364
Labels
A-grammar Area: The grammar of Rust A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. C-feature-request Category: A feature request, i.e: not implemented / a PR.

Comments

@RalfJung
Copy link
Member

This came up here but I am pretty sure I ran into this before: Rust parses -5.5f32.method() as -(5.5f32.method()). I don't know if that is the parsing result that anyone expects, I certainly would never expect it -5.5f32 is a float literals, and that is what I want to apply the method to, I never even considered that the compiler might interpret this as a unary minus applied to the method result.

Assuming that we cannot change the parser, I think we should spend some effort defending against the possible confusion here. For instance:

  • lint against -5.5f32.method() (suggesting to write -(5.5f32.method()) if that's really what one meant)
  • have rustfmt add the parentheses that make precedence explicit
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 25, 2023
@jhorstmann
Copy link
Contributor

This also made me question my sanity some time ago, a lint would have been very helpful. The code in question was

assert_eq!(1_i32, -1_i32.rem_euclid(12)) // should be (-i_i32)

@fmease fmease added the A-grammar Area: The grammar of Rust label Oct 25, 2023
@Urgau Urgau linked a pull request Oct 25, 2023 that will close this issue
@saethlin saethlin added A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. C-feature-request Category: A feature request, i.e: not implemented / a PR. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 25, 2023
@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 26, 2023

I would personally be very surprised if -x.method() and -1.method() parsed differently, and it feels proper to me that -(x.method()) would be the way it's parsed.

This matches similar precedence rules in maths, like how -2x is -(2x).

Another good test is if adding 0 to the beginning would change the result: 0-x.method() shouldn't parse as (0-x).method(), so why should -x.method() be different?

@RalfJung
Copy link
Member Author

In math I would insist on parentheses since -2^x looks ambiguous.

Another good test is if adding 0 to the beginning would change the result:

That's not at all how think about negative integer literals. -2 is just the mathematical integer negative-two, and having -2.method() split up the literal is totally unexpected.

It's great that your intuition aligns with the current parser, but that doesn't mean that everybody's intuition aligns with the parser. I literally thought the gamma_ln function had a bug (see the comment linked in the OP); it didn't even occur to me that the Rust parser could dis-associate the - from my integer literal -- that possibility was so outlandish it didn't enter my mind. So clearly this is ambiguous, and the right thing to do is add parentheses.

@chfogelman
Copy link
Contributor

I too would be very surprised if it was not parsed the way it currently is. If -5.method() were parsed as (-5).method(), then how would we expect - 5.method() to be parsed? You either end up having very unintuitive behavior, where - 5.method() is parsed as (-5).method(), or you end up in a situation where whitespace affects the way an expression is parsed.

To my knowledge, every language with both unary minus and a . operator does exactly what Rust currently does here.

I also don't think it's fair to say that the compiler is disassociating the - from the literal. The lexer has always treated - as a separate token from whatever follows it.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 26, 2023 via email

@chfogelman
Copy link
Contributor

Even ignoring what the lexer does, my point is that for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms that obey standard operator precedence and just happen not to have a space between them. For those of us in that group, anything other than the current behavior would be confusing. I don't necessarily disagree that it can create confusion, though I do think that changing the parser would have been a step too far.

As a side note, rustfmt currently reformats let _ = - 5.method() to let _ = -5.method(). Should this be another angle by which we can reduce confusion?

@Ghabriel
Copy link

The way it's currently being parsed feels very natural and correct to me as well. It not only matches similar rules in math as pointed out earlier but is also how it works in every language that allows an equivalent syntax. Adding extra parentheses in situations like this would just feel like visual noise to me, personally.

@clarfonthey
Copy link
Contributor

Note that there are cases where a negative literal is treated as a fused literal, in particular in cases with the literal macro matcher. In these cases, it was weird that 1.0 would be accepted but not -1.0, which makes sense from a syntax perspective but not from a user perspective.

I don't really think that any of the proposed suggestions for negation really work well, though. Adding an extra space looks like a typo, and adding extra parentheses also makes rewriting code more obnoxious since it's a non-local change (adding a negative sign requires adding a closing parenthesis at the end of the expression).

Even though the precedence rules are well-motivated, parentheses exist regardless and Rust even has lints that encourage using them even in unambiguous situations, like converting x && y || z to (x && y) || z.

So, I do think that at minimum there could be an allow-by-default lint suggesting one or more of these methods, and/or style rules for rustfmt.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 26, 2023

Even ignoring what the lexer does, my point is that for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms that obey standard operator precedence and just happen not to have a space between them.

As a data point, I work in computing. Programming languages are literally my job.

I guess we'd need a study to see how many "many" is. 🤷

though I do think that changing the parser would have been a step too far.

It's not realistic to change the parser so we agree on the outcome here, if not on the reasons.

It not only matches similar rules in math

I'm not aware of math having well-defined operator precedence for cases like this. Do you have a source for this claim?

Even though the precedence rules are well-motivated, parentheses exist regardless and Rust even has lints that encourage using them even in unambiguous situations, like converting x && y || z to (x && y) || z.

Yeah, that's why I was thinking a lint would make sense here. FWIW those cases are ambiguous if you can't remember whether "and" or "or" binds stronger. And in languages like bash, they actually bind equally strong (and are left-associative), so this is inconsistent across languages.

I don't think it is realistic for everyone to remember all the precedence rules, so I agree with the compiler that the parentheses should be added here to avoid ambiguity.

@jdahlstrom
Copy link

jdahlstrom commented Oct 26, 2023

for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms

I doubt that many mathematicians would consider "-5" anything but a single unit, the negative integer that is the additive inverse of 5, just like they wouldn't consider "½" a division but the rational number that's the multiplicative inverse of 2.

@chfogelman
Copy link
Contributor

Note that there are cases where a negative literal is treated as a fused literal, in particular in cases with the literal macro matcher.

Interestingly, $lit:literal will even match unfused literals like - 1. The parser follows standard operator precedence, and only then raises negated literal expressions to be literals themselves.

As someone who can't imagine this expression being parsed any other way, I'm curious how everyone else views whitespace. My assumption has always been that I can add arbitrary whitespace around any operator and get the same results. So if you find the way this expression is parsed confusing, how would you mentally parse the following expressions?

-1.method()
- 1.method()
-x.method()
- x.method()

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 27, 2023

It not only matches similar rules in math

I'm not aware of math having well-defined operator precedence for cases like this. Do you have a source for this claim?

Even though this wasn't directed at me, I figured I'd say that my primary argument is based upon the exponent rules (which always bind exponents before unary minus in mathematics), and the fact that function calls are generally put at the highest precedence above even that. Note that these rules are variable, for example, some programs like Excel will bind the unary minus before exponents, but I consider this unnatural since it goes against standard mathematics. In primary school, we were explicitly taught that expressions like -2x were a potentially confusing case under these rules, although if you consider unary minus the same as any other arithmetic operator, it makes sense that it would have lower precedence.

Function calls having the highest precedence has also been the standard since C, and I would imagine that methods would have similarly high precedence. Since unary minus doesn't have to apply to just numeric literals, I would always imagine it to be parsed separately, since as shown, the precedence of -1.method() even mattering would depend on - being lexed as a separate operator.

@not-an-aardvark
Copy link
Contributor

From the peanut gallery: there's an additional complication for minimum signed integer values:

  • The literal -128i8 is correctly interpreted as -128.
  • This is not a result of applying the unary minus operator to the literal 128i8, which would be ill-formed because 128i8 is out of range. Intuitively, this suggests that -128i8 is interpreted as a single literal.
  • However, a method call like -128i8.pow(1) is parsed as -((128i8).pow(1)), resulting in an out-of-range error.
  • It seems bizarre for a literal to become out of range as a result of adding a method call. It's not just a question about whether the - operator has confusing precedence -- until the method call is added at the end, the - evidently isn't acting as a unary operator at all.

@chfogelman
Copy link
Contributor

This is not a result of applying the unary minus operator to the literal 128i8, which would be ill-formed because 128i8 is out of range. Intuitively, this suggests that -128i8 is interpreted as a single literal.

The parser parses everything according to standard precedence rules, treating - as unary minus. So this gets parsed as something like Unary(Neg, Literal("128i8")). The compiler then treats this entire unit as a literal for purposes of overflow checks and such. It's not that -128_i8 is interpreted as a single literal; it's that Unary(Neg, Literal(_)) is treated as a sort of "honorary literal" after parsing. This is why let x = - 128i8; doesn't result in an error either.

However, a method call like -128i8.pow(1) is parsed as -((128i8).pow(1)), resulting in an out-of-range error.

How does your brain parse - 128i8.pow(1)?

It seems bizarre for a literal to become out of range as a result of adding a method call.

You've added another operator (.) with a higher associativity, which changes what the - is being applied to. As an analogy, 2 + 2 is 4, but 2 + 2 * 3 is not 12. Adding the * 3 changes what 2 + is being applied to.

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 27, 2023

However, a method call like -128i8.pow(1) is parsed as -((128i8).pow(1)), resulting in an out-of-range error.

While this makes sense, it's certainly surprising and I never thought of this case. I definitely agree that situations like this should have a warn-by-default lint at minimum, and maybe there can be a specific list of methods where this particular parsing rule has "weird" results, like this one.

@chfogelman
Copy link
Contributor

maybe there can be a specific list of methods where this particular parsing rule has "weird" results, like this one.

To clarify, the method being called doesn't matter here. It's a compile-time error because the compiler can't create an 128_i8 to call the method on. -128_i8.to_string() results in the same error.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 27, 2023

My assumption has always been that I can add arbitrary whitespace around any operator and get the same results. So if you find the way this expression is parsed confusing, how would you mentally parse the following expressions?

My intuition parses -5.method() different from - 5.method(). This makes sense since -5 is the atom "integer: negative five", while - 5 is unary minus applied to 5. Those two terms evaluate to the same result but they are not the same term. (Yes I do realize that's not how the Rust parser works.)

I'd say that my primary argument is based upon the exponent rules (which always bind exponents before unary minus in mathematics),

Yeah you mentioned that above. I wasn't aware of this rule, but Wikipedia confirms. Wikipedia also lists some exceptions where the rule does not apply, though.

Anyway, while we can go on discussing which version is "more intuitive", that's not quite the point of this issue. The point of this issue is that the current parsing is ambiguous and surprising to some people. There's two outcomes here:

  1. this is ambiguous for so few people that it's not worth doing anything about
  2. this is ambiguous for sufficiently many people that we should do something about it (a lint and/or rustfmt rule)

I prefer outcome 2, but I don't have the data to prove that it's ambiguous for "sufficiently many" people. In this issue we seem to have a fairly even split, but of course this is far from representative.

Function calls having the highest precedence has also been the standard since C,

C doesn't have method calls so the issue can't even arise there? But I take it this means Java and C++ parse this like Rust. That's a good argument to keep the parser as-is, but it's not an argument against a lint IMO.

@clarfonthey
Copy link
Contributor

C actually does have method calls since you can put function pointers in structs, although since function application has a higher precedence than field access, you need to add parentheses anyway. Although -(x.y)() would still be parsed as -((x.y)()).

I should clarify my position here: I'm fully in support of adding rules to both rustfmt and rust, it's mostly whether they'd be enabled by default that is my particular concern.

@RalfJung
Copy link
Member Author

I should clarify my position here: I'm fully in support of adding rules to both rustfmt and rust, it's mostly whether they'd be enabled by default that is my particular concern.

Thanks for clarifying. I do think the lint should be warn-by-default, since otherwise it won't actually help people that run into this without realizing the possibility of running into this (like my past self).

@traviscross
Copy link
Contributor

traviscross commented Oct 27, 2023

How languages handle exponentiation is a close analogy. If a language were to treat -2 as a single atom, then -2**2 would equal 4.

However, in Fortran (1957-), Perl, Python, and Ruby, -2**2 == -4.

In Haskell, -2^2 == -4 (and -2**2 == -4.0).

And in Rust, assert_eq!(-2i8.pow(2), -4).

Linting may seem somewhat opinionated given the precedent and history.

We should also think about symmetry with the other unary operator, !. Should these lint?

#[test]
fn test() {
    assert_eq!(127, !2u8.pow(7));
    assert_eq!(false, !2u8.pow(7).is_power_of_two());
}

(I would not expect linting here. But I would expect consistency with - in any case.)

@traviscross
Copy link
Contributor

Mathematically, adding a negative sign is taught as being exactly equivalent to a multiplication by negative one. That's at least one justification for why $-2^2 = -4$ in common notation.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 27, 2023

Linting may seem somewhat opinionated given the precedent and history.

And how may people programming in these languages know this rule? In a Python code review I would definitely insist on adding parentheses to -2**2.

Should these lint?

No. These are clearly not literals.

But I would expect consistency with - in any case.

! and - are very different though. - exists as part of the syntax for literals like -5, unlike !. Even the rustc parser recognizes this difference in its treatment for -128u8.

@traviscross
Copy link
Contributor

traviscross commented Oct 27, 2023

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

@chfogelman
Copy link
Contributor

I definitely support warn-by-default. Parsing is invisible to the user, so if you don't know that operator precedence is the issue, it'll take a very long time to figure out.

I don't think ! should be linted on. Unlike -, prefix ! doesn't have baggage from mathematics or everyday life.

Note that -(128i8) also works.

Yep, and so does (((((-(((((128i8)))))))))). The AST doesn't carry any information about parentheses.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 27, 2023

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

It's not just overflow checks. It's part of turning the tokens into an AST. -128 becomes a constant with value -128, !128 becomes the unary operator ! applied to the constant 128. So there's a fundamental difference between - and ! baked pretty deep into the rust parser (or a later stage during AST/HIR/MIR construction).

@traviscross
Copy link
Contributor

traviscross commented Oct 27, 2023

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

It's not just overflow checks. It's part of turning the tokens into an AST. -128 becomes a constant with value -128, !128 becomes the unary operator ! applied to the constant 128. So there's a fundamental difference between - and ! baked pretty deep into the rust parser (or a later stage during AST/HIR/MIR construction).

At the level of AST generation, the literal is always 128i8, regardless of whether the input is -128i8 or !128i8. The AST for these is exactly the same except that Neg is replaced by Not.

Proof

We can see this with rustc +nightly -Z unpretty=ast-tree.

fn main() {
    _ = -128i8;
}

Becomes:

Expr {
    kind: Unary(
        Neg,
        Expr {
            kind: Lit(
                Lit {
                    kind: Integer,
                    symbol: "128",
                    suffix: Some(
                        "i8",
                    ),
                },
            ),
            attrs: [],
            tokens: None,
        },
    ),
    attrs: [],
    tokens: None,
},

And this:

fn main() {
    _ = !128i8;
}

Becomes:

Expr {
    kind: Unary(
        Not,
        Expr {
            kind: Lit(
                Lit {
                    kind: Integer,
                    symbol: "128",
                    suffix: Some(
                        "i8",
                    ),
                },
            ),
            attrs: [],
            tokens: None,
        },
    ),
        },
    ),
    attrs: [],
    attrs: [],
    tokens: None,
},

Here's one place where there is special handling in the compiler: compiler/rustc_lint/src/types.rs:414

fn lint_int_literal<'tcx>(
...
) {
...
    if (negative && v > max + 1) || (!negative && v > max) {
    ...
    }
}

Looking at:

fn main() {
    dbg!(!127i8);
    dbg!(-128i8);
}

In the MIR, these both end up as const i8::MIN in release mode. In the HIR, these are both still just !127i8 or -128i8.

@chfogelman
Copy link
Contributor

chfogelman commented Oct 28, 2023

@traviscross -128i8 starts being treated as a fused literal during MIR lowering:

hir::ExprKind::Unary(hir::UnOp::Neg, ref arg) => {
if self.typeck_results().is_method_call(expr) {
let arg = self.mirror_expr(arg);
self.overloaded_operator(expr, Box::new([arg]))
} else if let hir::ExprKind::Lit(ref lit) = arg.kind {
ExprKind::Literal { lit, neg: true }
} else {
ExprKind::Unary { op: UnOp::Neg, arg: self.mirror_expr(arg) }
}
}

Basically, nothing in the entire compiler treats !127 as a literal, because it doesn't have to. - needs special treatment because it's used to form i*::MIN literals, but ! never forms part of a literal, not in the AST, not in the HIR, not in everyday life.

I can get behind linting -1.method() because of this special treatment of -, but since ! doesn't get any special treatment, I don't see the point in linting on !1.method().

@traviscross
Copy link
Contributor

@chfogelman: Interesting. Thanks for pointing to where that happens during MIR lowering.

Agreed ! doesn't get any special treatment because mathematically it doesn't need it. The question is whether that special treatment, motivated by the desire to make one number per type work in a convenient way, is enough to justify different treatment for the two otherwise similar unary operators. It's an interesting question to consider.

@clarfonthey
Copy link
Contributor

I think that no matter what here, the solution shouldn't be motivated by what the parser actually does here, and instead by what people expect the parser should do. The reason why I personally would rather not format -1.method() differently is because I expect it to work a certain way and what it does matches my expectations.

However, my expectations are not everyone else's expectations, and that's reason to look into this. Explaining what the parser does here ultimately doesn't make a difference because we're not talking about the current implementation, but how we expect the parser to act regardless of implementation.

@ankane
Copy link
Contributor

ankane commented Nov 16, 2023

fwiw, Ruby parses it the way that's expected in the original issue (different from Rust):

-2.abs
#=> 4

x = 2
-x.abs
#=> -4

@Urgau Urgau linked a pull request Feb 20, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: The grammar of Rust A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. C-feature-request Category: A feature request, i.e: not implemented / a PR.
Projects
None yet