Skip to content

Parser Analysis

Pieter edited this page May 14, 2024 · 2 revisions

Over time it's been noticed that there are two large issues with the parsing stage across multiple input languages:

  • Parsing is slow for certain languages
  • Memory usage can be extreme due to _sharedContextCache being a shared static field in each parser

We've long suspected that both could be fixed by making the grammar more straightforward to parse, but so far we haven't conclusively found what we should change. There are now a couple flags that analyze what is happening during parsing to help with that. I've tried to add (in an overly confident tone) what ANTLR is now reporting to us. I'm pretty sure there are holes in my understanding of ANTLR still, so apologies in advance.

Flags to analyse ANTLR

You can pass --dev-parsing-ambiguities and --dev-parsing-sensitivities. ANTLR only cares about ambiguities in the sense that it has to tell you for each rule which alternative it chose. A decision is considered context-sensitive when we cannot decide with one token of lookahead what alternative to choose. While matching a production like (expr expr)* expr expr expr we do need more lookahead than one, because it is not clear while reading expressions whether to continue in the * or read the last three expressions. Nevertheless this is not context-sensitive, because ANTLR dumps all terminals/non-terminals flatly into the context that represents the current production, so even in the face of a context-sensitive decision like this, the resulting data structure is the same anyway.

Format of the output

When enabling either flag you will receive outputs like this:

======================================
At /tmp/vercors-interpreted-13716491317094529415.ipp
--------------------------------------
   20    void parallel_for(sycl::range<1> numWorkItems, VERCORS::LAMBDA lambda_method);
   21    void parallel_for(sycl::range<2> numWorkItems, VERCORS::LAMBDA lambda_method);
                                                       [---------
   22    void parallel_for(sycl::range<3> numWorkItems, VERCORS::LAMBDA lambda_method);
                                                        ---------]
   23  
   24    void parallel_for(sycl::nd_range<1> numWorkItems, VERCORS::LAMBDA lambda_method);
--------------------------------------
Prediction of rule theTypeName required full context and took 0.00s:
[x] simpleTemplateId
 >  className
 >  enumName
 >  typedefName
======================================
  • The start of the indicated region is the position at which a context-sensitivity was detected
  • The end of the region indicates when we have returned to normalcy, and we know which alternative to choose
  • The message indicates the name of the rule where an ambiguity occurred, and how much time it took to run the analysis (this does not seem to be indicative of anything yet)
  • The bottom lists the productions for the given rule, potentially prefixed by an indicator
  • An indicator of > indicates a definitively viable alternative
  • An indicator of [x] indicates an alternative that was discarded only after running a full-context analysis
  • No indicator means it was not considered a viable alternative, even before considering context. (There are no lines without an indicator in this example)

Note 1: if there are multiple > lines, this means that there is an ambiguity in the parser: both are valid. If there is only one >, there must be a [x]: it means the grammar is sensitive to context. I think that context-sensitivities can be normal/fine if the indicated region is quite small.

Note 2: The name of the rule refers to the rule in the input grammar. In the case of left-recursive grammars (e.g. most of our expressions), the grammar is instead rewritten, so you may see alternatives that do not make much sense, like this:

======================================
At /home/pieter/vercors/res/universal/res/jdk/java/lang/String.java
--------------------------------------
   61      ensures \result.data() == other.data() + data();
   62      String right+(String other) {
                                             [-
   63          return new String(other.data() + data());
                                              -]
   64      }
   65      @*/
--------------------------------------
Prediction of rule expr required full context and took 0.00s:
 >  ({9 >= _p}? ('=='|'!=') expr | {28 >= _p}? '.' 'new' nonWildcardTypeArguments? innerCreator | {3 >= _p}? impOp expr | {27 >= _p}? '.' 'super' superSuffix | {19 >= _p}? ('++'|'--') | {29 >= _p}? '.' 'this' | {24 >= _p}? '->' javaIdentifier arguments | {11 >= _p}? relOp expr | {7 >= _p}? '^' expr | {15 >= _p}? prependOp expr | {6 >= _p}? '|' expr | {4 >= _p}? '||' expr | {22 >= _p}? postfixOp | {5 >= _p}? andOp expr | {13 >= _p}? ('+'|'-') expr | {1 >= _p}? assignOp expr | {26 >= _p}? '.' explicitGenericInvocation | {12 >= _p}? shiftOp expr | {2 >= _p}? '?' expr ':' expr | {8 >= _p}? '&' expr | {23 >= _p}? '.' javaIdentifier predicateEntryType? arguments valEmbedGiven? valEmbedYields? | {25 >= _p}? '[' expr ']' | {10 >= _p}? 'instanceof' type | {14 >= _p}? mulOp expr | {30 >= _p}? '.' javaIdentifier)+
[x] ()
======================================

Note that the first alternative ends in +. This language represents the tail part of the left-recursive alternatives of expr. The operator precedence is here already encoded by antlr automatically through semantic predicates {_ >= _p}?, I don't know the exact mechanics.

Note 3: Note that the alternatives are recovered from the augmented transition network in the parser, and then pretty-printed as a production rule. This means that they may not exactly correspond to the input, even when not left-recursive. Especially note that a rule such as (a | b?) is pretty-printed as (a | b)?, which is equivalent (!)

Now what

I'm not certain what you should actually change when something is reported. Nevertheless I offer some thoughts:

  • I think ambiguities are universally bad: an input should really have at most one derivation. It can be indicative of duplicate rules.
  • Maybe we should try doing stuff like exp2: exp1 (('*'|'/') exp1)* instead of exp2: exp1 | exp2 '*' exp1 | exp2 '/' exp1
Clone this wiki locally