Deviations From Flex Bison

Ger Hobbelt edited this page Jul 7, 2016 · 17 revisions
Clone this wiki locally

Lex Patterns

Literal tokens

WARNING: vanilla zaach/jison has 'easy keyword' support turned on all the time

The section currently describes the GerHobbelt fork which has the

%options easy_keyword_rules

feature while vanilla jison has not (at least not a pullreq for this is posted by me (@GerHobbelt) and accepted.

Hence vanilla jison will work as if you implicitly specified %options easy_keyword_rules in every lexer of yours.

When the lexer 'easy keyword' option has been turned on in your lexer file / section using

%options easy_keyword_rules

you will notice that token "foo" will match whole word only, while ("foo") will match foo anywhere unless.

See issue #63 and GHO commit 64759c43.

Under The Hood

Technically what happens is that %options easy_keyword_rules turns on lexer rule inspection and where it recognizes that a rule ends with a literal character, then the regex word edge \\b check is appended to the lexer regex for the given rule.

Longest rule matching

The lexer will use the first rule that matches the input string unless you use %options flex, in which case it will use the rule with the longest match.

Additions

Because Jison uses JavaScript’s regular expression engine, it is possible to use some metacharacters that are not present in Flex patterns.

See for a full list of available regex metacharacters the MDN documentation: Using Special Characters

Negative Lookahead

Flex patterns support lookahead using /, Jison adds negative lookahead using /!.

Under The Hood

Technically what happens is that /\<atom> and /!\<atom> are 1:1 replaced by the regex expressions (?=\<atom>) and (?!\<atom>) respectively.

Advanced Grouping Options

Jison supports as advanced grouping options

  • non-grouping brackets (?:PATTERN),
  • positive lookahead (?=PATTERN) and
  • negative lookahead (?!PATTERN).

yymore, yyless, etc...

The flex macros yymore() and yyless must be rewritten to use the Jison lexer's JavaScript API calls:

Braces in actions

Within lexer actions use %{ ... %} delimiters if you want to use block-style statements, e.g.:

.*  %{
  if (true) {
    console.log('test');
  }
  // ...
%}

Within parser actions you may alternatively use {{ .. }} delimiters for the same purpose:

test
  : STRING EOF  {{
    if (true) {
      console.log('test');
    }
    // ...
    return $1;
  }}
  ;

though Jison also supports %{ ... %} multi-line action blocks in the grammar rules:

test
  : STRING EOF  %{
    if (true) {
      console.log('test');
    }
    // ...
    return $1;
  }%
  ;

See issue #85

Semantic Actions

Actions should contain JavaScript instead of C, naturally.

Braces

As of Jison v0.2.8, you no longer need to use double braces {{...}} around grammar rule action code blocks.

From now on, single braces {...} suffice.

Short-hand syntax

There is a short-hand arrow syntax:

 exp:    ...
         | '(' exp ')' -> $2
         | exp '+' exp -> $1 + $3

Accessing values and location information

Normally, you’ld have to use the position of the corresponding nonterminal or terminal in the production, prefixed by a dollar sign $, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $2; }

Now, you can also access the value by using the name of the nonterminal instead of its position, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $exp; }

If the rule is ambiguous (the nonterminal appears more than once,) append a number to the end of the nonterminal name to disambiguate the desired value:

 exp:    ...
         | exp '+' exp
             { $$ = $exp1 + $exp2; }

Association by name leads to a looser coupling (and is easier to grok.)

This also works for accessing location information (compare with the Bison manual on Named references and their Actions and Locations section):

 exp:    ...
         | '(' exp ')'
             { @$ = @exp; /* instead of @$ = $2 */ }

Another way to resolve ambiguity would be to use aliases in square brackets, for example:

 exp:    ...
         | exp[left] '+' exp[right]
             { $$ = $left + $right; }

Auto-numbered named accessors

'Auto-numbering' means that the first occurrence of label (token name or alias) nnn will also be available as nnn*1*, and so on.

In the section above you may have seen one example where the nonterminal names have been auto-numbered to provide unambiguous access to each:

 exp:    ...
         | exp '+' exp
             { $$ = $exp1 + $exp2; }

Note that in every Jison rule production, all the nonterminal names and all the aliases are always also available in 'auto-numbered' form, that is: when the same nonterminal name or alias occurs multiple times in the same rule, the action block can uniquely address a particular nonterminal or alias by using the auto-numbered form.

An example:

test
: subrule[alt] subrule[wicked_middle] subrule[alt] '?'[alt]
%{
    // These are all unambiguous and legal to address $1, $2, $3 and $4:
    //
    // $1 === $subrule1 === $alt1
    // $1 === $alt  <-- first occurrence also registers the name itself!
    // $2 === $subrule2 === $wicked_middle
    // $3 === $subrule3 === $alt2
    // $4 === $alt3
    //
    // @1 === @subrule1 === @alt1
    // @1 === @alt  <-- first occurrence also registers the name itself!
    // @2 === @subrule2 === @wicked_middle
    // @3 === @subrule3 === @alt2
    // @4 === @alt3
%}
Caveat Emptor

It doesn't say what'll happen if you go and game the system by using aliases with the same name as the nonterminals, e.g.

exp:    ...
         | exp[exp] '+' exp[exp]
             { $$ = $exp1 + $exp3 /* 3? Are we sure about this? */; }

If you wonder, RTFC: vanilla vs. RTFC: GerHobbelt


WARNING: vanilla zaach/jison doesn't behave the same when it comes to mixing aliases and nonterminal names.

The section currently describes the GerHobbelt fork. With vanilla zaach/jison the safe rule of thumb here is that when you specify an alias for a nonterminal, then you SHOULD NOT USE the nonterminal name itself any more in your action code.

RTFC to compare and check each's behaviour here: vanilla vs. GerHobbelt

Extended BNF

Jison now supports EBNF syntax, showcased here.

Extended BNF: how it works and what to keep in mind when using this

EBNF is accepted by the jison grammar engine and transposed to a BNF grammar using equivalence transforms for each of the EBNF *, +, ? and (...) operators.

For these EBNF wildcards & groups the following treatment must be kept in mind:

  • Only the outermost wild-carded group's label or index is addressable in your action. That group is translated to a single nonterminal, e.g.

    rule: A (B C D E)?
    

    becomes

    rule: A subrule_option0
    
    subrule_option0: /* nil */ | B C D E;
    

    hence your action block for rule rule will only have $1 and $2 (the subrule_option0 nonterminal) to play with.

    As jison allows labeling the wildcarded group, such an alias might keep things more readable:

    rule: A (B C D E)?[choice]
    

    becomes

    rule: A subrule_option0[choice]
    
    subrule_option0: /* nil */ | B C D E;
    

    WARNING: it's illegal to attempt to access $B, $C et al from your rule's action code block and very bad things will happen you.

    • vanilla zaach/jison will not translate those references and your code will be TOAST.

    • GerHobbelt/jison analyzes your action code chunk and attempts to locate all your $whatever and @whatever references in there and barfs a hairball (i.e. fails at jison compile time) with a big fat error message if you do.

      Do note that we are a little dumb scanner, so we will recognize those references even when they sit in a nice cozy comment in there! [Edit: not since GerHobbelt/jison@80b6de1d311778d2bdfd71a2c39db570049d092a -> GerHobbelt/jison@0.4.17-132 -- since then the action code macro expansion is smart enough to skip (almost all) strings and comments.]

  • (...)*, (...)+ and (...)? are the wildcarded ones and will be rewritten to equivalent BNF rules.

    You MAY nest these constructs.

  • The (...) group is also recognized (no wildcard operator there): it will be unrolled. Unless there's a label attached to it. In that case it's rewritten.

    Hence

    rule: A (B C D E)
    

    becomes

    rule: A B C D E;
    

    while

    rule: A (B C D E)\[groupies]
    

    becomes

    rule: A subrule\[groupies]
    
    subrule: B C D E;
    

    so be aware that a little change like that can play havoc on your (action) code: the former, unrolled, grouping gives you access to all it terms (nonterminals and terminals alike), while the labeled a.k.a. aliased version hides those inner terms from you.

  • In order to have something decent to work with in your action code, every wildcard or non-wilcarded group which is not unrolled will collect all its terms' values (yytext) as produced by the lexer and store it in an array, thus constructing a Poor Man's AST:

    rule: A (B C+ (D E)\[hoobahop])?\[choice]
    

    becomes

    rule: A subrule_option0[choice]
    
    subrule_option0: /* nil */ | subrule_option1;
    
    subrule_option1: B C+ (D E)\[hoobahop];
    

    which becomes

    rule: A subrule_option0[choice]
    
    subrule_option0: /* nil */ | subrule_option0;
    
    subrule_option1: B subrule_series1 hoobahop_group0;
    
    subrule_series1: subrule_series1 C | C;
    
    hoobahop_group0: D E;
    

    which will deliver in your $choice reference an array shaped like this (comments show the origin of each bit):

    // subrule_option0
    [
      // **Note**:
      // as this is choice, you get only the value 
      //
      //     undefined
      // 
      // when you've hit the **nil** epsilon choice instead!
    
      // subrule_option1: 
      [
        B,
        // subrule_series1
        [
          // the EBNF rewriter is smart enough to see that there's
          // only 1(one) term in this one: `C` so no extra arrays-in-array
          // for you here:
          C,
          C,
          ...
        ], 
        // hoobahop_group0
        [
          D,
          E
        ]
      ]
    ]
    

BIG FAT WARNING:

The above is written for the GerHobbelt fork as currently EBNF support in vanilla zaach/jison is ever so slightly b0rked.

But that's not what this warning is about!

As I (Gerhobbelt) write this, I wonder if this really really is the truth. It may be that the current bleeding edge (master branch) still has a few, ahh..., sub-optimalities in reality compared to the above.

To Be Checked


Next blurb copy-pasta-d from the gist listed further above. Has some details which needs to be updated in the docs too...

Some improvements have been made for parser and lexer grammars in Jison 0.3 (demonstrated in the FlooP/BlooP example below.)

For lexers:

  • Patterns may use unquoted characters instead of strings
  • Two new options, %options flex case-insensitive
    • flex: the rule with the longest match is used, and no word boundary patterns are added
    • case-insensitive: all patterns are case insensitive
  • User code section is included in the generated module

For parsers:

  • Arrow syntax for semantic actions
  • EBNF syntax (enabled using the %ebnf declaration)
    • Operators include repetition (*), non-empty repetition (+), grouping (()), alternation within groups (|), and option (?)
  • User code section and code blocks are included in the generated module

Also, Robert Plummer has created a PHP port of Jison's parser.

See the grammar below for more examples.