-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in Rascal grammar in an expression with both ^
and $
does not arise from a semantic ambiguity
#1200
Comments
- Ensure that ambiguity errors that arise from the Rascal grammar itself
appears as syntax errors in the IDE. Given the code I've seen, I'd guess
that the lack of reporting might be generic to all ambiguity errors. For
example, there are the same symptoms with this production:
lexical AnotherLine = ^ Contents ? ;
I think there may be a logging level for it but I haven't seen this issue
come up in a while and I don't remember if or where you might find it in
the documentation. You may try to look at the logging method as well..
…On Tue, Jul 17, 2018, 10:09 PM Eric ***@***.***> wrote:
The following grammar fails to compile. WholeLine fails to compile
because it triggers an ambiguity in the Rascal grammar, which then has no
case statement to handle the ambiguity node in sym2symbol, whose default
throws an exception that terminates compilation.
module ssce
lexical WholeLine = ^ Contents $ ;
lexical WholeLineWorkaround = (^ Contents ()) $ ;
lexical Contents = ![\r\n]+ ;
The grammar rules that provided the opportunity for ambiguity are these,
from Rascal.rsc:
Syntax Sym
[...]
| endOfLine: Sym symbol "$"
| startOfLine: "^" Sym symbol
There is no associativity declared for these productions. Because they're
non-associative, you can't put both of them on the same symbol without
causing an ambiguity. The semantics of WholeLine, however, are
unambiguous. The Rascal grammar here is simply counter-intuitive.
I was working inside Eclipse (0.9 stable, although I first had the problem
with the unstable branch), there was a compilation error that was silent.
The problem was showing up only in the Eclipse window Error Log, which is
not part of the default Rascal perspective. I only saw the compilation
error after I made a fresh Eclipse installation and loaded the project
before the Rascal perspective took over.
Furthermore, this error was not showing up as a syntax error in the
editing window. Apparently there's some mismatch in behavior between the
IDE grammar and the compiler grammar for this case.
The ordinary workaround non-associative operators is to use parentheses to
make the association explicit. This doesn't work in the current grammar,
though, because of the way that parentheses-terms work.
Syntax Sym
[...]
| sequence: "(" Sym first Sym+ sequence ")"
Parentheses only apply to lists of two or more symbols and do not apply to
single symbol. Thus the workaround has extra parentheses as an empty symbol
just to get a second symbol for the disambiguating parentheses. It's not
clear to me why there's no single symbol production for parentheses, but
there's not, and it turns the obvious workaround into something more
complicated.
This one issue report contains more than one problem, admittedly. Here's a
list of the things to do:
- Semantically speaking, ^ and $ are just special versions of precede
and follow rules, which *do* have associativity defined. Perhaps the
easiest fix is to move the productions for ^ and $ in with those of
precede and follow. If that's not right for some reason, they could just
get their own associativity group.
- Ensure that there's a production for "single-symbol sequences".
These are simple disambiguating parentheses. Either add another production
for it or allow existing sequences to contain only one element.
- Ensure that ambiguity errors that arise from the Rascal grammar
itself appears as syntax errors in the IDE. Given the code I've seen, I'd
guess that the lack of reporting might be generic to all ambiguity errors.
For example, there are the same symptoms with this production:
lexical AnotherLine = ^ Contents ? ;
- Add the Eclipse Error Log to the Rascal perspective. It will expose
errors to the user that would otherwise be masked.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1200>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAfD_49_0D_ih-Wo00W9QNakuNQ2rBflks5uHpi-gaJpZM4VT07P>
.
|
Thanks for your report, we should indeed look at removing this ambiguity from the rascal grammar. (there are a few more sadly). Error reportingI completly agree, in our current setup we do not report all errors of imports. I would have to look into why again, but there is a technical constraint at the moment. And we try to have as a rule that the error log should only be for internal errors that should never be shown to the user. This is obviously not one of those, so it should be reported in a user view. Importing the module with the grammar in the REPL does tend to show the error message. But then you would have had to know that that was the faulty module to begin with. Our eclipse editor used color ambiguous parts red, it seems we have now chosen for just pick one of the alternatives and use that for highlighting. I'll look into this code to see if we can do both, keep the highlighting, but also report an error. (although I fear this is an Eclipse thing, where inside the token colloring code it's hard to report an error, but I will look into it, so that we get earlier feedback). I myself was also bitten by this just yesterday (a different ambiguity). WorkaroundYou are right, for most cases you could use precede and follow restrictions, however they don't correctly work together with begining of file/stream and end of file/stream. So your workaround with the empy symbol is indeed the option I would go for as well. I remember there was a specific reason for enforcing that the brackets can only be used to group more than one symbol, I think @jurgenvinju can give that context. module GrammarTest
import ParseTree;
lexical A1 = ^ (() "a" $);
lexical A2 = [\n\r] << "a" >> [\r\n];
test bool a1() = appl(_,_) := parse(#A1, "a");
test bool a2() = appl(_,_) := parse(#A2, "a"); running:
Implementation detailA small warning, the But the parser you defined will fail on this input: For the |
Thanks for mentioning that I recommend showing the Error Log not as a point of principle, but purely one of practicality. At the current state of maturity, it's better to err on the side of letting the user see too much than too little. At some point in the system's maturity, the Error Log would just become UI cruft that would never be used. But that's not the state of things today. Line endings are indeed difficult to get right. Having |
Because the brackets introduce another level in the tree, i.e. The same brackets, meta-syntactically, would mean something different. I thought that would be confusing. Since non-terminal juxtapositioning is assocative, the brackets would never have any semantics I believe; their only use would be to disambiguate our own implementation as a workaround for the current bug, and perhaps to clarify a complex rule to the reader of a grammar. However, if such a rule exists that it needs brackets, perhaps to best way would be to introduce another named rule instead? |
indeed "must-follow" rules will fail at EOF and at BOF; that's still a point to ponder. We thought about introducing a fictional character notation for EOF and BOF, such that people could write I like the |
About the OS dependent meaning of However, so far no extremely bright solution has descended from above :-) Preprocessing every file towards a universal "rascal" normal-form (let's say the Uninx way) is not good, since it introduces "noise" and distances parse trees from the characters which an editor sees. We'd be shifting the problem from the parser to the bridges to the editors we use. Not a real solution. Another solution direction might be to give
Again, your thoughts, both, on this would help perhaps to clarify this the ins and outs. This is a minefield. |
what if then the only thing that will be messed up is inconsistent new-line character usage. I think that |
in unix an accidental |
it would be great if we could map CR+LF during character reading into a single Unicode-24 character, without loss of information, and also have a special escape character for that particular character. That would solve a lot of the above issues. I just don't see how yet. Could we use an index higher than the 24-bit range? We have 32-bits after all. When writing that character again, it should be decoded back of course into the proper UTF-8 or whatever encoding was chosen. There's probably downsides to this as well. PS: I'm going offline for three weeks now. See you after that! |
On parentheses. Unfortunately for formal language designers, mathematicians long ago overloaded the usage of parentheses:
Computer language authors have also broadly used parentheses as part of the syntax of control flow statements. I'm somewhat of a pragmatic traditionalist about parentheses in expressions. Traditional, because I think parentheses should never be required unless you need them because of operator precedence and never be prohibited because they're extraneous. Pragmatic, because this is the way all mathematics uses parentheses and that's a huge ship I'd never try to steer against. Rascal is non-traditional in a few ways. I'd likely also remove mandatory parentheses from control flow statements where the parentheses surround single expressions, but maybe that's just me. |
ok thanks! the brackets around The reason for the non-ignorable brackets around sequence The main underlying principle is that every non-terminal in Rascal's grammars are types in its type system, so non-terminal constructors like |, ? and sequence, lists, etc. are also type constructors. Rascal does not have type coercion or sub-typing (for other reasons), so every one of these type constructors must leads to a new level in a parse tree. The second principle is that the concrete syntax of symbols is very close to their abstract syntax (which is also their internal type representation), so a rule for Sym in the Rascal grammar corresponds directly to a rule for the abstract data type Symbol in the reified type representation. If we would make the brackets less consequential, as they'd normally be in other expression languages, we could still generate the same parsers (I think), but we wouldn't have a proper types for parts of parse trees generated by regular expressions over non-terminals. For example, the optional sequence The current design for regular expressions over non-terminals is completely low brow and simple: every one of the combinators generates both a type and a set of productions to derive sentences of that type. The parentheses seems to be the only real problematic issue, and its mostly a meta-syntax design choice. Perhaps a different syntax for nested sequence, not overloading the parentheses, would be a good idea? I'm open to suggestions. Or, perhaps indeed interpreting introducing |
ok, Jurgen out, now I'm really gone fishing for three weeks :-) Hope to continue this after, unless you guys have moved on by then! Cheers. |
The goal, then is preserve this and also get type-transparency with respect to parentheses. That means, almost of necessity, that you don't want to operate on the syntax tree of the original definition because, by assumption, it has parentheses. The solution to me seems very Rascally: transform the tree by collapsing all the parentheses and define the types on a canonical tree. You've already got an Canonical type trees could, with more effort and new language constructs, do more than just parentheses. But that's getting far afield from just dealing with semantically-neutral variation in the syntax of definition. |
Yes, the current implementation is already quite close to that. I'll have a look later this summer if/how we can follow this through. Thanks for the discussion and suggestions. It helps. |
See https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/rascal/grammar/definition/Symbols.rsc which implodes parse trees of Sym to the Symbol datatype. That's the code we are talking about. It avoids concrete syntax in pattern matching for bootstrapping reasons. |
Regarding the newlines, this is why I think my earlier proposal is a good tradeoff.
lexical LineTerminator
= lineFeed: [\r] !<< [\n]
| verticalTab: [\a0B]
| formFeed: [\a0C]
| carriageReturn: [\r] !>> [\n]
| windowsNewline: [\r][\n]
| nextLine: [\u0085]
| lineSeparator: [\u2028]
| paragraphSeparator: [\u2029]
; |
Ok. It's the best trade-off for now. The lexical Library is good for documentation. Later the alias feature should allow us to substitute names for character classes and also compute unions and intersections and such.. |
Unicode has a definition for line boundaries: Unicode Technical Standard #18, Unicode Regular Expressions: 1.6 Line Boundaries. It's in the context of regular expressions, a bit different than Rascal grammars, but they do address all the issues raised here, including proper definitions for There is a definition of a "newline sequence" that is basically identical to
Rascal isn't a regex language, so I can't advocate for It's worth noting that by calling the section "Line Boundary", they're remaining agnostic about the semantics of line counting, whether the boundary represents a line terminator or a line separator. There's also an interesting note about the semantics of regex
An analogous facility would be useful in Rascal, particularly to be able to subtract character classes from |
There is one usage of |
The following grammar fails to compile.
WholeLine
fails to compile because it triggers an ambiguity in the Rascal grammar, which then has nocase
statement to handle the ambiguity node insym2symbol
, whose default throws an exception that terminates compilation.The grammar rules that provided the opportunity for ambiguity are these, from
Rascal.rsc
:There is no associativity declared for these productions. Because they're non-associative, you can't put both of them on the same symbol without causing an ambiguity. The semantics of
WholeLine
, however, are unambiguous. The Rascal grammar here is simply counter-intuitive.I was working inside Eclipse (0.9 stable, although I first had the problem with the unstable branch), there was a compilation error that was silent. The problem was showing up only in the Eclipse window Error Log, which is not part of the default Rascal perspective. I only saw the compilation error after I made a fresh Eclipse installation and loaded the project before the Rascal perspective took over.
Furthermore, this error was not showing up as a syntax error in the editing window. Apparently there's some mismatch in behavior between the IDE grammar and the compiler grammar for this case.
The ordinary workaround non-associative operators is to use parentheses to make the association explicit. This doesn't work in the current grammar, though, because of the way that parentheses-terms work.
Parentheses only apply to lists of two or more symbols and do not apply to single symbol. Thus the workaround has extra parentheses as an empty symbol just to get a second symbol for the disambiguating parentheses. It's not clear to me why there's no single symbol production for parentheses, but there's not, and it turns the obvious workaround into something more complicated.
This one issue report contains more than one problem, admittedly. Here's a list of the things to do:
^
and$
are just special versions of precede and follow rules, which do have associativity defined. Perhaps the easiest fix is to move the productions for^
and$
in with those of precede and follow. If that's not right for some reason, they could just get their own associativity group.lexical AnotherLine = ^ Contents ? ;
The text was updated successfully, but these errors were encountered: