Indentation Based Grammars

Ragmaanir edited this page Apr 20, 2011 · 5 revisions
Clone this wiki locally

Sometimes a DSL provides for block scoping by indentation levels (something also know as the off-side rule). Languages exhibiting this behavior are, for example, Python or YAML. These languages are classic examples of the “non-context free” kind and generally require some custom logic input preprocessing in order to simplify parser grammar design. Traditionally this job is left to the lexer but since PEGs (and therefore parboiled) are lexer-less another solution is needed.
It would of course be possible to provide for an input preprocessor that is completely separate from parboileds parsing engine. However, this level of indirection would greatly complicate error reporting, since users still expect to receive error messages for their original input text and not some intermediate representation.

Starting with release 0.9.9 parboiled provides for a dedicated InputBuffer implementation that takes care of the required indentation preprocessing while maintaining full original input transparency for error messages and recovery. You can use it by passing a newly created IndentDedentInputBuffer instance to the ParseRunner of your choice (Java) or call the “transformIndents” method on the (probably implicitly) created Input instance (Scala).

Java Example:

new ReportingParseRunner(rule).run(new IndentDedentInputBuffer("... input ...", 2))

Scala Example:

ReportingParseRunner(rule).run("... input ...".transformIndents(2))

Upon creation the IndentDedentInputBuffer collapses all line indentations, i.e. space and tab characters at the beginning of a line, replacing them with special INDENT or DEDENT characters where required. All other characters in the input are left untouched, including the “newline” sequences at the end of the lines. Specifically the InputBuffer collapses indentations into

  • nothing, if the line has the same indentation level as the previous line
  • one INDENT, if the line has a greater indentation level than previous line
  • one or more DEDENTs, if the line has a lower indentation level than the previous line

Here is an example of how the IndentDedentBuffer transforms indented input text:

level 1
    level 2
    still level 2
      level 3
      also 3
    2 again
        another 3
and back to 1
 another level 2 again

will be turned into the following (the brackets indicating the special, unprintable INDENT and DEDENT characters)

level 1
>level 2
still level 2
>level 3
also 3
<2 again
>another 3
<<and back to 1
>another level 2 again
<

With the line indentation transformed in the described fashion grammar specification becomes much simpler since INDENTs and DEDENTs can now be matched just like brace pairs in other languages.