The Rule Tree

Mathias edited this page Jan 26, 2011 · 6 revisions
Clone this wiki locally

Like most programs related to parsing parboiled relies heavily on graph and tree structures. The first such structure created during a parboiled parsing process is the rule “tree” (you’ll see why it is a “tree” and not a tree in moment). This rule “tree” is still independent of the actual parsing input. The latter is consumed by the parser during a parsing run and applied to the rule “tree”, which optionally yields The Parse Tree.

Consider this example grammar (ignore the pretty nonsensical language it describes, just look at the rule references):

a ← b ‘a’ c
b ← ‘b’ d
c ← ‘c’ d
d ← ‘d’ c?

When you turn this grammar into a graph just showing rules as nodes and rule references as directed edges this grammar can be represented by the following structure:

As you can see this graph has a cycle and node d has two parents, which is why this graph is not a tree but just a plain directed graph. Most “real-world” grammars however are in large parts very tree-like, with a pretty clear hierarchy of rules referencing sub rules. Although loop-backs (recursions) are not exactly rare their number is small compared with “regular”, hierarchical references.
Because of this (and the fact that to many people trees are a more common mental picture) you might choose to think of the rule graph as a rule “tree” with two potential exceptions: multi-parent nodes and loop-backs.

Incidentally this directed graph nature of PEG grammars almost matches the way method calls work in the JVM, with a method calling other methods, potentially including call stack ancestors. This is why parboiled maps rule declaration (or rather “construction”) more or less directly onto Java or Scala methods. Each of your parser “rule” methods constructs a rule object, potentially calling other rule constructing methods in the process.

There are however two complications:

  1. Java or Scala methods recursing into themselves or an ancestor need a way of terminating the recursion. Usually this happens through some logic, which “exits” the recursion based on some condition. A rule declaration however does no such thing, only the parsed input text (which is always finite) will terminate any rule recursion. Therefore parboiled needs to pull some tricks to prevent the rule constructing Java / Scala methods from recursing infinitely.
  2. When a rule constructing method is called several times (because it is referenced from several other rules) it would normally create a new rule instance of the same identical structure with every call. Even though this is not a problem per se it could become very inefficient for larger grammars, with a huge rule tree potentially containing many duplicate rules instances.

parboiled for Java takes care of these two problems by rewriting the rule methods (public, instance, Rule returning) of your parser class and injecting code that is invisible to the developer. This code makes sure that each method is called only once, i.e. each rule is only created once, and all further calls (“references”) receive the same Rule instance. Should the rule creation recurse back into itself proxy rules are inserted that prevent infinite recursions. All these things happen transparently behind the scenes, you do not need to worry about them.

parboiled for Scala achieves the same thing without rewriting the actual method byte code by encapsulating your actual rule creation code in a function (block) argument to the main rule building method (method “rule” in the Parser trait).

In the end, when you call your top level rule method to pass it to the ParseRunner of your choice you get back the root of a rule “tree” just like show in Fig. 1, without duplicated nodes and proper links, even for recursions.

Matchers

As you might have already seen in the Javadoc API documentation the Rule interface is merely a “facade” interface with only a few methods specifying special Rule attributes. The classes actually implementing this interface are defined in package org.parboiled.matchers. There is one Matcher implementation for every rule primitive (a SequenceMatcher, a OneOrMoreMatcher and so on), which implements the logic for this particular rule type. So the Rule “tree” is in fact a Matcher “tree”. However, most of the time you will not need to know about these parboiled internals, but rather concentrate on working with the parser rules and the value stack.