parboiled2 – A Macro-Based PEG Parser Generator for Scala 2.12+
Contents of this Document
parboiled2 is a Scala 2.12+ library enabling lightweight and easy-to-use, yet powerful, fast and elegant parsing of arbitrary input text. It implements a macro-based parser generator for Parsing Expression Grammars (PEGs), which runs at compile time and translates a grammar rule definition (written in an internal Scala DSL) into corresponding JVM bytecode.
PEGs are an alternative to Context-Free Grammars (CFGs) for formally specifying syntax, they make a good replacement for regular expressions and have some advantages over the "traditional" way of building parsers via CFGs (like not needing a separate lexer/scanner phase).
parboiled2 is the successor of parboiled 1.x , which provides a similar capability (for Scala as well as Java) but does not actually generate a parser. Rather parboiled 1.x interprets a rule tree structure (which is also created via an internal DSL) against the input, which results in a much lower parsing performance. For more info on how parboiled 1.x and parboiled2 compare see parboiled2 vs. parboiled 1.x. You might also be interested in reading about parboiled2 vs. Scala Parser Combinators and parboiled2 vs. Regular Expressions.
- Concise, flexible and type-safe DSL for expressing parsing logic
- Full expressive power of Parsing Expression Grammars, for effectively dealing with most real-world parsing needs
- Excellent reporting of parse errors
- Parsing performance comparable to hand-written parsers
- Easy to learn and use (just one parsing phase (no lexer code required), rather small API)
- Light-weight enough to serve as a replacement for regular expressions (also strictly more powerful than regexes)
The artifacts for parboiled2 live on Maven Central and can be tied into your SBT-based Scala project like this:
libraryDependencies += "org.parboiled" %% "parboiled" % "2.5.0"
The latest released version is 2.5.1. It is available for Scala 2.12, 2.13 and 3, Scala.js and Scala Native.
parboiled2 has no external dependencies. (It used to depend on shapeless, but the few bits it was using have been internalized at some point).
Once on your classpath you can use this single import to bring everything you need into scope:
import org.parboiled2._
This is what a simple parboiled2 parser looks like:
import org.parboiled2._
class Calculator(val input: ParserInput) extends Parser {
def InputLine = rule { Expression ~ EOI }
def Expression: Rule1[Int] = rule {
Term ~ zeroOrMore(
'+' ~ Term ~> ((_: Int) + _)
| '-' ~ Term ~> ((_: Int) - _))
}
def Term = rule {
Factor ~ zeroOrMore(
'*' ~ Factor ~> ((_: Int) * _)
| '/' ~ Factor ~> ((_: Int) / _))
}
def Factor = rule { Number | Parens }
def Parens = rule { '(' ~ Expression ~ ')' }
def Number = rule { capture(Digits) ~> (_.toInt) }
def Digits = rule { oneOrMore(CharPredicate.Digit) }
}
new Calculator("1+1").InputLine.run() // evaluates to `scala.util.Success(2)`
This implements a parser for simple integer expressions like 1+(2-3*4)/5
and runs the actual calculation in-phase
with the parser. If you'd like to see it run and try it out yourself check out Running the Examples.
A parboiled2 parser is a class deriving from org.parboiled2.Parser
, which defines one abstract member:
def input: ParserInput
holding the input for the parsing run. Usually it is best implemented as a val
parameter in the constructor
(as shown in the Example above). As you can see from this design you need to (re-)create a new parser instance for
every parsing run (parser instances are very lightweight).
The "productions" (or "rules") of your grammar are then defined as simple methods, which in most cases consist of a
single call to the rule
macro whose argument is a DSL expression defining what input the rule is to match and
what actions to perform.
In order to run your parser against a given input you create a new instance and call run()
on the top-level rule,
e.g:
val parser = new MyParser(input)
parser.topLevelRule.run() // by default returns a ``scala.util.Try``
For more info on what options you have with regard to accessing the results of a parsing run check out the section on Access to Parser Results.
PEG parsers are quite easy to understand as they work just like most people without a lot of background in parsing theory would build a parser "by hand": recursive-descent with backtracking. They have only one parsing phase (not two, like most parsers produced by traditional parser generators like ANTLR), do not require any look-ahead and perform quite well in most real-world scenarios (although they can exhibit exponential runtime for certain pathological languages and inputs).
A PEG parser consists of a number of rules that logically form a "tree", with one "root" rule at the top calling zero or more lower-level rules, which can each call other rules and so on. Since rules can also call themselves or any of their parents the rule "tree" is not really a tree but rather a potentially cyclic directed graph, but in most cases the tree structure dominates, which is why its useful to think of it as a tree with potential cycles.
When a rule is executed against the current position in an input buffer it applies its specific matching logic to the input, which can either succeed or fail. In the success case the parser advances the input position (the cursor) and potentially executes the next rule. Otherwise, when the rule fails, the cursor is reset and the parser backtracks in search of another parsing alternative that might succeed.
For example consider this simple parboiled2 rule:
def foo = rule { 'a' ~ ('b' ~ 'c' | 'b' ~ 'd') }
When this rule is confronted with the input abd
the parser matches the input in these steps:
- Rule
foo
starts executing, which calls its first sub-rule'a'
. The cursor is at position 0. - Rule
'a'
is executed against input position 0, matches (succeeds) and the cursor is advanced to position 1. - Rule
'b' ~ 'c' | 'b' ~ 'd'
starts executing, which calls its first sub-rule'b' ~ 'c'
. - Rule
'b' ~ 'c'
starts executing, which calls its first sub-rule'b'
. - Rule
'b'
is executed against input position 1, matches (succeeds) and the cursor is advanced to position 2. - Rule
'c'
is executed against input position 2 and mismatches (fails). - Rule
'b' ~ 'c' | 'b' ~ 'd'
notices that its first sub-rule has failed, resets the cursor to position 1 and calls its 2nd sub-rule'b' ~ 'd'
. - Rule
'b' ~ 'd'
starts executing, which calls its first sub-rule'b'
. - Rule
'b'
is executed against input position 1, matches and the cursor is advanced to position 2. - Rule
'd'
is executed against input position 2, matches and the cursor is advanced to position 3. - Rule
'b' ~ 'd'
completes successfully, as its last sub-rule has succeeded. - Rule
'b' ~ 'c' | 'b' ~ 'd'
completes successfully, as one of its sub-rules has succeeded. - Rule
foo
completes execution successfully, as its last sub-rule has succeeded. The whole input "abd" was matched and the cursor is left at position 3 (after the last-matched character).
In order to work with parboiled2 effectively you should understand the core concepts behind its rule DSL, mainly the "Value Stack" and how parboiled2 encodes value stack operations in the Scala type system.
Apart from the input buffer and the cursor the parser manages another important structure: the "Value Stack". The value stack is a simple stack construct that serves as temporary storage for your Parser Actions. In many cases it is used for constructing an AST during the parsing run but it can also be used for "in-phase" computations (like in the Example above) or for any other purpose.
When a rule of a parboiled2 parser executes it performs any combination of the following three things:
- match input, i.e. advance the input cursor
- operate on the value stack, i.e. pop values off and/or push values to the value stack
- perform side-effects
Matching input is done by calling Basic Character Matching rules, which do nothing but match input and advance the cursor. Value stack operations (and other potential side-effects) are performed by Parser Actions.
It is important to understand that rules in parboiled2 (i.e. the rule methods in your parser class) do not directly return some custom value as a method result. Instead, all their consuming and producing values happens as side-effects to the value stack. Thereby the way that a rule interacts with value stack is encoded in the rule's type.
This is the general definition of a parboiled2 rule:
class Rule[-I <: HList, +O <: HList]
This can look scary at first but is really quite simple. An HList
is defined by shapeless and is essentially a type
of list whose element number and element types are statically known at compile time. The I
type parameter on
Rule
encodes what values (the number and types) the rule pops off the value stack and the O
type parameter
encodes what values (the number and types) the rule then pushes onto the value stack.
Luckily, in most cases, you won't have to work with these types directly as they can either be inferred or you can use one of these predefined aliases:
type Rule0 = RuleN[HNil]
type Rule1[+T] = RuleN[T :: HNil]
type Rule2[+A, +B] = RuleN[A :: B :: HNil]
type RuleN[+L <: HList] = Rule[HNil, L]
type PopRule[-L <: HList] = Rule[L, HNil]
Here is what these type aliases denote:
- Rule0
- A rule that neither pops off nor pushes to the value stack, i.e. has no effect on the value stack whatsoever. All Basic Character Matching rules are of this type.
- Rule1[+T]
- Pushes exactly one value of type
T
onto the value stack. AfterRule0
this is the second-most frequently used rule type. - Rule2[+A, +B]
- Pushes exactly two values of types
A
andB
onto the value stack. - RuleN[+L <: HList]
- Pushes a number of values onto the value stack, which correspond to the given
L <: HList
type parameter. - PopRule[-L <: HList]
- Pops a number of values off the value stack (corresponding to the given
L <: HList
type parameter) and does not produce any new value itself.
The rule DSL makes sure that the rule types are properly assembled and carried through your rule structure as you combine Basic Character Matching with Rule Combinators and Modifiers and Parser Actions, so as long as you don't write any logic that circumvents the value stack your parser will be completely type-safe and the compiler will be able to catch you if you make mistakes by combining rules in an unsound way.
The following basic character matching rules are the only way to cause the parser to match actual input and "make progress". They are the "atomic" elements of the rule DSL which are then used by the Rule Combinators and Modifiers to form higher-level rules.
- implicit def ch(c: Char): Rule0
Char
values can be directly used in the rule DSL and match themselves. There is one notable case where you will have to use the explicitch
wrapper: You cannot use the|
operator directly on chars as it denotes the built-in Scala binary "or" operator defined on numeric types (Char
is an unsigned 16-bit integer). So rather than saying'a' | 'b'
you will have to saych('a') | 'b'
.
- implicit def str(s: String): Rule0
String
values can be directly used in the rule DSL and match themselves.
- implicit def predicate(p: CharPredicate): Rule0
- You can use
org.parboiled2.CharPredicate
values directly in the rule DSL.CharPredicate
is an efficient implementation of character sets and already comes with a number pre-defined character classes likeCharPredicate.Digit
orCharPredicate.LowerHexLetter
.
- implicit def valueMap[T](m: Map[String, T]): R
Values of type
Map[String, T]
can be directly used in the rule DSL and match any of the given map's keys and push the respective value upon a successful match. The resulting rule type depends onT
:T
R
Unit
Rule0
L <: HList
RuleN[L]
(pushes all values ofL
)T
(otherwise)Rule1[T]
(pushes only one value)
- def anyOf(chars: String): Rule0
- This constructs a
Rule0
which matches any of the given strings characters.
- def noneOf(chars: String): Rule0
- This constructs a
Rule0
which matches any single character except the ones in the given string and except EOI.
- def ignoreCase(c: Char): Rule0
- Matches the given single character case insensitively. Note: The given character must be specified in lower-case! This requirement is currently NOT enforced!
- def ignoreCase(s: String): Rule0
- Matches the given string of characters case insensitively. Note: The given string must be specified in all lower-case! This requirement is currently NOT enforced!
- def ANY: Rule0
- Matches any character except EOI (end-of-input).
- def EOI: Char
- The EOI (end-of-input) character, which is a virtual character that the parser "appends" after the last character of the actual input.
- def MATCH: Rule0
- Matches no character (i.e. doesn't cause the parser to make any progress) but succeeds always. It's the "empty" rule that is mostly used as a neutral element in rule composition.
- def MISMATCH[I <: HList, O <: HList]: Rule[I, O]
- A rule that always fails. Fits any rule signature.
- def MISMATCH0: Rule0
- Same as
MISMATCH
but with a clearly defined type. Use it (rather thenMISMATCH
) if the call site doesn't clearly "dictate" a certain rule type and usingMISMATCH
therefore gives you a compiler error.
Rules can be freely combined/modified with these operations:
- a ~ b
Two rules
a
andb
can be combined with the~
operator resulting in a rule that only matches if firsta
matches and thenb
matches. The computation of the resulting rule type is somewhat involved. Here is an illustration (using an abbreviated HList notation):a b a ~ b Rule[, A]
Rule[, B]
Rule[, A:B]
Rule[A:B:C, D:E:F]
Rule[F, G:H]
Rule[A:B:C, D:E:G:H]
Rule[A, B:C]
Rule[D:B:C, E:F]
Rule[D:A, E:F]
Rule[A, B:C]
Rule[D:C, E:F]
Illegal if D
!=B
- a | b
- Two rules
a
andb
can be combined with the|
operator to form an "ordered choice" in PEG speak. The resulting rule tries to matcha
and succeeds if this succeeds. Otherwise the parser is reset andb
is tried. This operator can only be used on compatible rules.
- &(a)
Creates a "positive syntactic predicate", i.e. a rule that tests if the underlying rule matches but doesn't cause the parser to make any progress (i.e. match any input) itself. Also, all effects that the underlying rule might have had on the value stack are cleared out, the resulting rule type is therefore always
Rule0
, independently of the type of the underlying rule.Note that
&
not itself consuming any input can have surprising implications in repeating constructs, see Non-Termination when using Syntactic Predicates for more details.
- !a
Creates a "negative syntactic predicate", i.e. a rule that matches only if the underlying one mismatches and vice versa. A syntactic predicate doesn't cause the parser to make any progress (i.e. match any input) and also clears out all effects that the underlying rule might have had on the value stack. The resulting rule type is therefore always
Rule0
, independently of the type of the underlying rule.Note that
!
not itself consuming any input can have surprising implications in repeating constructs, see Non-Termination when using Syntactic Predicates for more details.
- optional(a)
Runs its inner rule and succeeds even if the inner rule doesn't. The resulting rule type depends on the type of the inner rule:
Type of a
Type of optional(a)
Rule0
Rule0
Rule1[T]
Rule1[Option[T]]
Rule[I, O <: I]
Rule[I, O]
The last case is a so-called "reduction rule", which leaves the value stack unchanged on a type level. This is an example of a reduction rule wrapped with
optional
:capture(CharPredicate.Digit) ~ optional(ch('h') ~> ((s: String) => s + "hex"))
The inner rule of
optional
here has typeRule[String :: HNil, String :: HNil]
, i.e. it pops oneString
off the stack and pushes another one onto it, which means that the number of elements on the value stack as well as their types remain the same, even though the actual values might have changed.As a shortcut you can also use
a.?
instead ofoptional(a)
.
- zeroOrMore(a)
Runs its inner rule until it fails, always succeeds. The resulting rule type depends on the type of the inner rule:
Type of a
Type of zeroOrMore(a)
Rule0
Rule0
Rule1[T]
Rule1[Seq[T]]
Rule[I, O <: I]
Rule[I, O]
The last case is a so-called "reduction rule", which leaves the value stack unchanged on a type level. This is an example of a reduction rule wrapped with
zeroOrMore
:(factor :Rule1[Int]) ~ zeroOrMore('*' ~ factor ~> ((a: Int, b) => a * b))
The inner rule of
zeroOrMore
here has typeRule[Int :: HNil, Int :: HNil]
, i.e. it pops oneInt
off the stack and pushes another one onto it, which means that the number of elements on the value stack as well as their types remain the same, even though the actual values might have changed.As a shortcut you can also use
a.*
instead ofzeroOrMore(a)
.
- oneOrMore(a)
Runs its inner rule until it fails, succeeds if its inner rule succeeded at least once. The resulting rule type depends on the type of the inner rule:
Type of a
Type of oneOrMore(a)
Rule0
Rule0
Rule1[T]
Rule1[Seq[T]]
Rule[I, O <: I]
Rule[I, I]
The last case is a so-called "reduction rule", which leaves the value stack unchanged on a type level. This is an example of a reduction rule wrapped with
oneOrMore
:(factor :Rule1[Int]) ~ oneOrMore('*' ~ factor ~> ((a: Int, b) => a * b))
The inner rule of
oneOrMore
here has typeRule[Int :: HNil, Int :: HNil]
, i.e. it pops oneInt
off the stack and pushes another one onto it, which means that the number of elements on the value stack as well as their types remain the same, even though the actual values might have changed.As a shortcut you can also use
a.+
instead ofoneOrMore(a)
.
- xxx.times(a)
Repeats a rule a given number of times.
xxx
can be either a positiveInt
value or a range(<x> to <y>)
whereby both<x>
and<y>
are positiveInt
values. The resulting rule type depends on the type of the inner rule:Type of a
Type of xxx.times(a)
Rule0
Rule0
Rule1[T]
Rule1[Seq[T]]
Rule[I, O <: I]
Rule[I, O]
The last case is a so-called "reduction rule", which leaves the value stack unchanged on a type level. This is an example of a reduction rule wrapped with
oneOrMore
:(factor :Rule1[Int]) ~ (1 to 5).times('*' ~ factor ~> ((a: Int, b) => a * b))
The inner rule here has type
Rule[Int :: HNil, Int :: HNil]
, i.e. it pops oneInt
off the stack and pushes another one onto it, which means that the number of elements on the value stack as well as their types remain the same, even though the actual values might have changed.
- a.separatedBy(separator: Rule0)
You can use
a.separatedBy(b)
to create a rule with efficient and automatic support for element separators ifa
is a rule produced by thezeroOrMore
,oneOrMore
orxxx.times
modifier andb
is aRule0
. The resulting rule has the same type asa
but expects the individual repetition elements to be separated by a successful match of theseparator
rule.As a shortcut you can also use
a.*(b)
or(a * b)
instead ofzeroOrMore(a).separatedBy(b)
. The same shortcut also works for+
(oneOrMore
).
- a ~!~ b
- Same as ~ but with "cut" semantics, meaning that the parser will never backtrack across this boundary. If the rule being concatenated doesn't match a parse error will be triggered immediately. Usually you don't need to use this "cut" operator but in certain cases it can help in simplifying grammar construction.
The Basic Character Matching rules and the Rule Combinators and Modifiers allow you to build recognizers for potentially complex languages, but usually your parser is supposed to do more than simply determine whether a given input conforms to the defined grammar. In order to run custom logic during parser execution, e.g. for creating custom objects (like an AST), you will have to add some "actions" to your rules.
- push(value)
push(value)
creates a rule that matches no input (but always succeeds, as a rule) and pushes the given value onto the value stack. Its rule type depends on the given value:Type of value
Type of push(value)
Unit
Rule0
(identical torun
in this case)L <: HList
RuleN[L]
(pushes all values ofL
)T
(otherwise)Rule1[T]
(pushes only one value)Also note that, due to the macro expansion the parboiled2 rule DSL is based on, the given value expression behaves like a call-by-name parameter even though it is not marked as one! This means that the argument expression to
push
is (re-)evaluated for every rule execution.
- capture(a)
Wrapping a rule
a
withcapture
turns that rule into one that pushes an additionalString
instance onto the value stack (in addition to all values thata
already pushes itself): the input text matched bya
.For example
capture(oneOrMore(CharPredicate.Digit))
has typeRule1[String]
and pushes one value onto the value stack: the string of digit characters matched byoneOrMore(CharPredicate.Digit)
.Another example:
capture("foo" ~ push(42))
has typeRule2[Int, String]
and will match input "foo". After successful execution the value stack will have the String"foo"
as its top element and42
underneath.
- test(condition: Boolean): Rule0
test
implements "semantic predicates". It creates a rule that matches no input and succeeds only if the given condition expression evaluates to true. Note that, due to the macro expansion the parboiled2 rule DSL is based on, the given argument behaves like a call-by-name parameter even though it is not marked as one! This means that the argument expression totest
is (re-)evaluated for every rule execution, just as iftest
would have been defined asdef test(condition: => Boolean): Rule0
.
- a ~> (...)
The
~>
operator is the "action operator" and as such the most frequently used way to add custom logic to a rule. It can be applied to any rule and appends action logic to it. The argument to~>
is always a function, what functions are allowed and what the resulting rule type is depends on the type ofa
.The basic idea is that the input of the function is popped off the value stack and the result of the function is pushed back onto it. In its basic form the
~>
operator therefore transforms the top elements of the value stack into some other object(s).Let's look at some examples:
(foo: Rule1[Int]) ~> (i => i * 2)
This results in a
Rule1[Int]
which multiplies the "output" of rulefoo
by 2.(foo: Rule2[Int, String]) ~> ((i, s) => s + i.toString)
This results in a
Rule1[String]
which combines the two "outputs" of rulefoo
(anInt
and aString
) into one singleString
.(foo: Rule2[Int, String]) ~> (_.toDouble)
This results in a
Rule2[Int, Double]
. As you can see the function argument to~>
doesn't always have to "take" the complete output of the rule its applied to. It can also take fewer or even more elements. Its parameters are simply matched left to right against the top of the value stack (the right-most parameter matching the top-level element).(foo: Rule1[String]) ~> ((i :Int, s) => s + i.toString)
This results in a
Rule[Int :: HNil, String :: HNil]
, i.e. a rule that pops oneInt
value off the stack and replaces it with aString
. Note that, while the parameter types to the action function can be inferred if they can be matched against an "output" of the underlying rule, this is not the case for parameters that don't directly correspond to an underlying output. In these cases you need to add an explicit type annotation to the respective action function parameter(s).If an action function returns
Unit
it doesn't push anything on the stack. So this rule(foo: Rule1[String]) ~> (println(_))
has type
Rule0
.Also, an action function can also be a
Function0
, i.e. a function without any parameters:(foo: Rule1[String]) ~> (() => 42)
This rule has type
Rule2[String, Int]
and is equivalent to this:(foo: Rule1[String]) ~ push(42)
An action function can also produce more than one output by returning an
HList
instance:(foo: Rule1[String]) ~> (s => s.toInt :: 3.14 :: HNil)
This has type
Rule2[Int, Double]
.One more very useful feature is special support for case class instance creation:
case class Person(name: String, age: Int) (foo: Rule2[String, Int]) ~> Person
This has type
Rule1[Person]
. The top elements of the value stack are popped off and replaced by an instance of the case class if they match in number, order and types to the case class members. This is great for building AST-like structures! Check out the Calculator2 example to see this form in action.Note that there is one quirk: For some reason this notation stops working if you explicitly define a companion object for your case class. You'll have to write
~> (Person(_, _))
instead.And finally, there is one more very powerful action type: the action function can itself return a rule! If an action returns a rule this rule is immediately executed after the action application just as if it had been concatenated to the underlying rule with the
~
operator. You can therefore do things like(foo: Rule1[Int]) ~> (i => test(i % 2 == 0) ~ push(i))
which is a
Rule1[Int]
that only produces even integers and fails for all others. Or, somewhat unusual but still perfectly legal:capture("x") ~> (str(_))
which is a
Rule0
that is identical to'x' ~ 'x'
.
- run(expression)
run
is the most versatile parser action. It can have several shapes, depending on the type of its argument expression. If the argument expression evaluates to- a rule (i.e. has type
R <: Rule[_, _]
) the result type ofrun
is this rule's type (i.e.R
) and the produced rule is immediately executed. - a function with 1 to 5 parameters these parameters are mapped against the top of the value stack, popped
and the function executed. Thereby the function behaves just like an action function for the
~>
operator, i.e. if it produces aUnit
value this result is simply dropped.HList
results are pushed onto the value stack (all their elements individually), rule results are immediately executed and other result values are pushed onto the value stack as a single element. The difference between usingrun
and attaching an action function with the~>
operator is that in the latter case the compiler can usually infer the types of the function parameters (if they map to "output" values of the base rule) while withrun
you always have to explicitly attach type annotation to the function parameters. - a function with one
HList
parameter the behavior is similar to the previous case with the difference that the elements of this parameterHList
are mapped against the value stack top. This allows for consumption of an arbitrary number of value stack elements (Note: This feature ofrun
is not yet currently implemented.) - any other value the result type of
run
is an always succeedingRule0
. Since in this case it doesn't interact with the value stack and doesn't match any input all it can do is perform "unchecked" side effects. Note that by usingrun
in this way you are leaving the "safety-net" that the value stack and the rule type system gives you! Make sure you understand what you are doing before using these kinds ofrun
actions!
Also note that, due to the macro expansion the parboiled2 rule DSL is based on, the given block behaves like a call-by-name parameter even though it is not marked as one! This means that the argument expression to
run
is (re-)evaluated for every rule execution.- a rule (i.e. has type
- runSubParser(f: ParserInput => Rule[I, O]): Rule[I, O]
- This action allows creation of a sub parser and running of one of its rules as part of the current parsing process.
The subparser will start parsing at the current input position and the outer parser (the one calling
runSubParser
) will continue where the sub-parser stopped.
There are a few more members of the Parser
class that are useful for writing efficient action logic:
- def cursor: Int
- The index of the next (yet unmatched) input character.
Note: Might be equal to
input.length
if the cursor is currently behind the last input character! - def cursorChar: Char
- The next (yet unmatched) input character, i.e. the one at the
cursor
index. Identical toif (cursor < input.length) input.charAt(cursor) else EOI
but more efficient. - def lastChar: Char
- Returns the last character that was matched, i.e. the one at index
cursor - 1
and as such is equivalent tocharAt(-1)
. Note that for performance optimization this method does not do a range check, i.e. depending on theParserInput
implementation you might get an exception when calling this method before any character was matched by the parser. - def charAt(offset: Int): Char
- Returns the character at the input index with the given delta to the cursor and as such is equivalent to
input.charAt(cursor + offset)
. Note that for performance optimization this method does not do a range check, i.e. depending on theParserInput
implementation you might get an exception if the computed index is out of bounds. - def charAtRC(offset: Int): Char
- Same as
charAt
but range-checked. Returns the input character at the index with the given offset from the cursor. If this index is out of range the method returnsEOI
.
You can use these to write efficient character-level logic like this:
def hexDigit: Rule1[Int] = rule {
CharPredicate.HexAlpha ~ push(CharUtils.hexValue(lastChar))
}
- Base64Parsing
- For parsing RFC2045 (Base64) encoded strings parboiled provides the
Base64Parsing
trait which you can mix into yourParser
class. See its source for more info on what exactly it provides. parboiled also comes with theorg.parboiled2.util.Base64
class which provides an efficient Base64 encoder/decoder for the standard as well as custom alphabets.
- DynamicRuleDispatch
- Sometimes an application cannot fully specify at compile-time which of a given set of rules is to be called at
runtime. For example, a parser for parsing HTTP header values might need to select the right parser rule for a
header name that is only known once the HTTP request has actually been read from the network.
To prevent you from having to write a large (and not really efficient)
match
against the header name for separating out all the possible cases parboiled provides theDynamicRuleDispatch
facility. Check out its test for more info on how to use it.
- StringBuilding
- For certain high-performance use-cases it is sometimes better to construct Strings that the parser is to
produce/extract from the input in a char-by-char fashion. To support you in doing this parboiled provides
the
StringBuilding
trait which you can mix into yourParser
class. It provides convenient access to a single and mutableStringBuilder
instance. As such it operates outside of the value stack and therefore without the full "safety net" that parboiled's DSL otherwise gives you. If you don't understand what this means you probably shouldn't be using theStringBuilding
trait but resort tocapture
and ordinary parser actions instead.
In many applications, especially with grammars that are not too complex, parboiled provides good error reports right out of the box, without any additional requirements on your part. However, there are cases where you want to have more control over how parse errors are created and/or formatted. This section gives an overview over how parse error reporting works in parboiled and how you can influence it.
As described in the section about How the Parser matches Input above the parser consumes input by applying grammar rules and backtracking in the case of mismatches. As such rule mismatches are an integral part of the parsers operation and do not generally mean that there is something wrong with the input. Only when the root rule itself mismatches and the parser has no backtracking options remaining does it become clear that a parse error is present. At that point however, when the root rule mismatches, the information about where exactly the problematic input was and which of the many rule mismatches that the parser experienced during the run were the "bad" ones is already lost.
parboiled overcomes this problem by simply re-running the failed parser, potentially many times, and "watching" it as
it tries to consume the erroneous input. With every re-run parboiled learns a bit more about the position and nature
of the error and when this analysis is complete a ParseError
instance is constructed and handed to the application
as the result of the parsing run, which can then use the error information on its level (e.g. for formatting it and
displaying it to the user).
Note that re-running the parser in the presence of parse errors does result in unsuccessful parsing runs being
potentially much slower than successful ones. However, since in the vast majority of use cases failed runs constitute
only a small minority of all parsing runs and the normal flow of application logic is disrupted anyway, this slow-down
is normally quite acceptable, especially if it results in better error messages. See the section on
Limiting Error Re-Runs if this is not true for your application.
In principle the error reporting process looks like this:
- The grammar's root rule is run at maximum speed against the parser input. If this succeeds then all is well and the parsing result is immediately dispatched to the user.
- If the root rule did not match we know that there we have a parsing error.
The parser is then run again to establish the "principal error location". The principal error location is the
first character in the input that could not be matched by any rule during the parsing run. In order words, it is
the maximum value that the parser's
cursor
member had during the parsing run. - Once the error location is known the parser is run again. This time all rule mismatches against the input character at error location are recorded. These rule mismatches are used to determine what input the grammar "expects" at the error location but failed to see. For every such "error rule mismatch" the parser collects the "rule trace", i.e. the stack of rules that led to it. Currently this is done by throwing a special exception that bubbles up through the JVM call stack and records rule stack information on its way up. A consequence of this design is that the parser needs to be re-run once per "error rule mismatch".
- When all error rule traces have been collected all the relevant information about the parse error has been extracted
and a
ParseError
instance can be constructed and dispatched to the user.
Note: The real process contains a few more steps to properly deal with the atomic
and quiet
markers described
below. However, knowledge of these additional steps is not important for understanding the basic approach for how
ParseError
instances are constructed.
If a parsing runs fails and you receive a ParseError
instance you can call the formatError
method on your
parser instance to get the error rendered into an error message string:
val errorMsg = parser.formatError(error)
The formatError
message can also take an explicit ErrorFormatter
as a second argument, which allows you to
influence how exactly the error is to be rendered. For example, in order to also render the rule traces you can do:
val errorMsg = parser.formatError(error, new ErrorFormatter(showTraces = true))
Look at the signature of the ErrorFormatter
constructor for more information on what rendering options exist.
If you want even more control over the error rendering process you can extend the ErrorFormatter
and override
its methods where you see fit.
While the error collection process described above yields all information required for a basic "this character was not matched and these characters were expected instead" information you sometimes want to have more control over what exactly is reported as "found" and as "expected".
Since PEG parsers are scanner-less (i.e. without an intermediate "TOKEN-stream") they operate directly on the input buffer's character level. As such, by default, parboiled reports all errors on this character level.
For example, if you run the rule "foo" | "fob" | "bar"
against input "foxes" you'll get this error message:
Invalid input 'x', expected 'o' or 'b' (line 1, column 3): foxes ^
While this error message is certainly correct, it might not be what you want to show your users, e.g. because foo
,
fob
and bar
are regarded as "atomic" keywords of your language, that should either be matched completely or not
at all. In this case you can use the atomic
marker to signal this to the parser.
For example, running the rule atomic("foo") | atomic("fob") | atomic("bar")
against input "foxes" yields this error
message:
Invalid input "fox", expected "foo", "fob" or "bar" (line 1, column 1): foxes ^
Of course you can use the atomic
marker on any type of rule, not just string rules. It essentially moves the
reported error position forward from the principal error position and lifts the level at which errors are reported
from the character level to a rule level of your choice.
Another problem that more frequently occurs with parboiled's default error reporting is that the list of "expected" things becomes too long. Often the reason for this are rules that deal match input which can appear pretty much anywhere, like whitespace or comments.
Consider this simple language:
def Expr = rule { oneOrMore(Id ~ Keyword ~ Id).separatedBy(',' ~ WS) ~ EOI }
def Id = rule { oneOrMore(CharPredicate.Alpha) ~ WS }
def Keyword = rule { atomic(("has" | "is") ~ WS) }
def WS = rule { zeroOrMore(anyOf(" \t \n")) }
When we run the Expr
rule against input "Tim has money, Tom Is poor" we get this error:
Invalid input 'I', expected [ \t \n] or Keyword (line 1, column 20): Tim has money, Tom Is poor ^
Again the list of "expected" things is technically correct but we don't want to bother the user with the information
that whitespace is also allowed at the error location. The quiet
marker let's us suppress a certain rule from the
expected list if there are also non-quiet alternatives:
def WS = rule { quiet(zeroOrMore(anyOf(" \t \n"))) }
With that change the error message becomes:
Invalid input 'I', expected Keyword (line 1, column 20): Tim has money, Tom Is poor ^
which is what we want.
parboiled uses a somewhat involved logic to determine what exactly to report as "mismatched" and "expected" for a given parse error. Essentially the process looks like this:
- Compare all rule trace for the error and drop a potentially existing common prefix. This is done because, if all traces share a common prefix, this prefix can be regarded as the "context" of the error which is probably apparent to the user and as such doesn't need to be reported.
- For each trace (suffix), find the first frame that tried to start its match at the reported error position. The string representation of this frame (which might be an assigned name) is selected for "expected" reporting.
- Duplicate "expected" strings are removed.
So, apart from placing atomic
and quiet
markers you can also influence what gets reported as "expected" by
explicitly naming rules. One way to do this is to pick good names for the rule methods as they automatically attach
their name to their rules. The names of val
or def
members that you use to reference CharPredicate
instances also automatically name the respective rule.
If you don't want to split out rules into their own methods you can also use the named
modifier.
With it you can attach an explicit name to any parser rule. For example, if you run the rule foo
from this snippet:
def foo = rule { "aa" | atomic("aaa").named("threeAs") | 'b' | 'B'.named("bigB") }
against input x
you'll get this error message:
Invalid input 'x', expected 'a', threeAs, 'b' or bigB (line 1, column 1): x ^
If you want to completely bypass parboiled's built-in error reporting logic you can do so by exclusively relying on
the fail
helper, which causes the parser to immediately and fatally terminate the parsing run with a single
one-frame rule trace with a given "expected" message.
For example, the rule "foo" | fail("a true FOO")
will produce this error when run against x
:
Invalid input 'x', expected a true FOO (line 1, column 1): x ^
Really large grammars, especially ones with bugs as they commonly appear during development, can exhibit a very large
number of rule traces (potentially thousands) and thus cause the parser to take longer than convenient to terminate an
error parsing run.
In order to mitigate this parboiled has a configurable limit on the maximum number of rule traces the parser will
collect during a single error run. The default limit is 24, you can change it by overriding the
errorTraceCollectionLimit
method of the Parser
class.
Currently parboiled only ever parses up to the very first parse error in the input. While this is all that's required for a large number of use cases there are applications that do require the ability to somehow recover from parse errors and continue parsing. Syntax highlighting in an interactive IDE-like environment is one such example.
Future versions of parboiled might support parse error recovery. If your application would benefit from this feature please let us know in this github ticket.
Sometimes you might find yourself in a situation where you'd like to DRY up your grammar definition by factoring out common constructs from several rule definitions in a "meta-rule" that modifies/decorates other rules. Essentially you'd like to write something like this (illegal code!):
def expression = rule { bracketed(ab) ~ bracketed(cd) }
def ab = rule { "ab" }
def cd = rule { "cd" }
def bracketed(inner: Rule0) = rule { '[' ~ inner ~ ']' }
In this hypothetical example bracketed
is a meta-rule which takes another rule as parameter and calls it from within
its own rule definition.
Unfortunately enabling a syntax such as the one shown above it not directly possible with parboiled.
When looking at how the parser generation in parboiled actually works the reason becomes clear.
parboiled "expands" the rule definition that is passed as argument to the rule
macro into actual Scala code.
The rule methods themselves however remain what they are: instance methods on the parser class.
And since you cannot simply pass a method name as argument to another method the calls bracketed(ab)
and
bracketed(cd)
from above don't compile.
However, there is a work-around which might be good enough for your meta-rule needs:
def expression = rule { bracketed(ab) ~ bracketed(cd) }
val ab = () => rule { "ab" }
val cd = () => rule { "cd" }
def bracketed(inner: () => Rule0) = rule { '[' ~ inner() ~ ']' }
If you model the rules that you want to pass as arguments to other rules as Function0
instances you can pass
them around. Assigning those function instances to val
members avoids re-allocation during every execution of
the expression
rule which would come with a potentially significant performance cost.
There is one mistake that new users frequently make when starting out with writing PEG grammars: disregarding the
"ordered choice" logic of the |
operator. This operator always tries all alternatives in the order that they were
defined and picks the first match.
As a consequence earlier alternatives that are a prefix of later alternatives will always "shadow" the later ones, the later ones will never be able to match!
For example in this simple rule
def foo = rule { "foo" | "foobar" }
"foobar" will never match. Reordering the alternatives to either "factor out" all common prefixes or putting the more specific alternatives first are the canonical solutions.
If your parser is not behaving the way you expect it to watch out for this "wrong ordering" problem, which might be not that easy to spot in more complicated rule structures.
The syntactic predicate operators, &
and !
, don't themselves consume any input, so directly wrapping them with a
repeating combinator (like zeroOrMore
or oneOrMore
) will lead to an infinite loop as the parser continuously
runs the syntactic predicate against the very same input position without making any progress.
If you use syntactic predicates in a loop make sure to actually consume input as well. For example:
def foo = rule { capture(zeroOrMore( !',' )) }
will never terminate, while
def foo = rule { capture(zeroOrMore( !',' ~ ANY )) }
will capture all input until it reaches a comma.
parboiled2 parsers work with mutable state as a design choice for achieving good parsing performance. Matching input and operating on the value stack happen as side-effects to rule execution and mutate the parser state. However, as long as you confine yourself to the value stack and do not add parser actions that mutate custom parser members the rule DSL will protect you from making mistakes.
It is important to understand that, in case of rule mismatch, the parser state (cursor and value stack) is reset to what it was before the rule execution was started. However, if you write rules that have side-effects beyond matching input and operating on the value stack than these side-effects cannot be automatically rolled-back! This means that you will have to make sure that you action logic "cleans up after itself" in the case of rule mismatches or is only used in locations where you know that rule execution can never fail. These techniques are considered advanced and are not recommended for beginners.
The rule DSL is powerful enough to support even very complex parsing logic without the need to resort to custom mutable state, we consider the addition of mutable members as an optimization that should be well justified.
One disadvantage of PEGs over lexer-based parser can be the handling of white space. In a "traditional" parser with a separate lexer (scanner) phase this lexer can simply skip all white space and only generate tokens for the actual parser to operate on. This can free the higher-level parser grammar from all white space treatment.
Since PEGs do not have a lexer but directly operate on the raw input they have to deal with white space in the grammar itself. Language designers with little experience in PEGs can sometime be unsure of how to best handle white space in their grammar.
The common and highly recommended pattern is to match white space always immediately after a terminal (a single character or string) but not in any other place. This helps with keeping your grammar rules properly structured and white space "taken care of" without it getting in the way.
In order to reduce boilerplate in your grammar definition parboiled allows for cleanly factoring out whitespace matching
logic into a dedicated rule. By defining a custom implicit conversion from String
to Rule0
you can implicitly
match whitespace after a string terminal:
class FooParser(val input: ParserInput) extends Parser {
implicit def wspStr(s: String): Rule0 = rule {
str(s) ~ zeroOrMore(' ')
}
def foo = rule { "foobar" | "foo" } // implicitly matches trailing blanks
def fooNoWSP = rule { str("foobar") | str("foo") } // doesn't match trailing blanks
}
In this example all usages of a plain string literals in the parser rules will implicitly match trailing space characters.
In order to not apply the implicit whitespace matching in this case simply say str("foo")
instead of just "foo"
.
If you don't explicitly match EOI
(the special end-of-input pseudo-character) in your grammar's root rule
the parser will not produce an error if, at the end of a parsing run, there is still unmatched input left.
This means that if the root rule matches only a prefix of the whole input the parser will report a successful parsing
run, which might not be what you want.
As an example, consider this very basic parser:
class MyParser(val input: ParserInput) extends Parser {
def InputLine = rule { "foo" | "bar" }
}
new MyParser("foo").InputLine.run() // Success
new MyParser("foot").InputLine.run() // also Success!!
In the second run of the parser, instead of failing with a ParseError
as you might expect, it successfully parses
the matching input foo
and ignores the rest of the input.
If this is not what you want you need to explicitly match EOI
, for example as follows:
def InputLine = rule { ("foo" | "bar") ~ EOI }
TODO
(e.g., use parse.formatError(error, showTraces = true)
)
In order to run the top-level parser rule against a given input you create a new instance of your parser class and
call run()
on it, e.g:
val parser = new MyParser(input)
val result = parser.rootRule.run()
By default the type of result
in this snippet will be a Try[T]
whereby T
depends on the type
of rootRule
:
Type of rootRule |
Type of rootRule.run() |
---|---|
Rule0 |
Try[Unit] |
Rule1[T] |
Try[T] |
RuleN[L <: HList] (otherwise) |
Try[L] |
The contents of the value stack at the end of the rootRule
execution constitute the result of the parsing run.
Note that run()
is not available on rules that are not of type RuleN[L <: HList]
.
If the parser is not able to match the input successfully it creates an instance of class ParseError
, which is
defined like this
case class ParseError(position: Position, charCount: Int, traces: Seq[RuleTrace]) extends RuntimeException
In such cases the Try
is completed with a scala.util.Failure
holding the ParseError
.
If other exceptions occur during the parsing run (e.g. because some parser action failed) these will also end up as
a Try
failure.
parboiled2 has quite powerful error reporting facilities, which should help you (and your users) to easily understand
why a particular input does not conform to the defined grammar and how this can be fixed.
The formatError
method available on the Parser
class is of great utility here, as it can "pretty print"
a parse error instance, to display something like this (excerpt from the ErrorReportingSpec_):
Invalid input 'x', expected 'f', Digit, hex or UpperAlpha (line 1, column 4): abcx ^ 4 rules mismatched at error location: targetRule / | / "fgh" / 'f' targetRule / | / Digit targetRule / | / hex targetRule / | / UpperAlpha
Apart from delivering your parser results as a Try[T]
parboiled2 allows you to select another one of the
pre-defined Parser.DeliveryScheme
alternatives, or even define your own. They differ in how they wrap the three
possible outcomes of a parsing run:
- parsing completed successfully, deliver a result of type
T
- parsing failed with a
ParseError
- parsing failed due to another exception
This table compares the built-in Parser.DeliveryScheme
alternatives (the first one being the default):
Import | Type of rootRule.run() |
Success | ParseError | Other Exceptions |
---|---|---|---|---|
import Parser.DeliveryScheme.Try | Try[T] | Success | Failure | Failure |
import Parser.DeliveryScheme.Either | Either[ParseError, T] | Right | Left | thrown |
import Parser.DeliveryScheme.Throw | T | T | thrown | thrown |
Follow these steps to run the example parsers defined here on your own machine:
Clone the parboiled2 repository:
git clone git@github.com:sirthias/parboiled2.git
Change into the base directory:
cd parboiled2
Run SBT:
sbt "project examples" run
TODO
(about one order of magnitude faster, more powerful DSL, improved error reporting, fewer dependencies (more lightweight), but Scala 2.10.3+ only, no error recovery (yet) and no Java version (ever))
TODO
(several hundred times (!) faster, better error reporting, more concise and elegant DSL, similarly powerful in terms of language class capabilities, but Scala 2.10.3+ only, 2 added dependencies (parboiled2 + shapeless))
TODO
(much easier to read and maintain, more powerful (e.g. regexes do not support recursive structures), faster, but Scala 2.10.3+ only, 2 added dependencies (parboiled2 + shapeless))
TODO
TODO
In most cases the parboiled2 mailing list is probably the best place for your needs with regard to support, feedback and general discussion.
Note: Your first post after signup is going to be moderated (for spam protection), but we'll immediately give you full posting privileges if your message doesn't unmask you as a spammer.
You can also use the gitter.im chat channel for parboiled2:
TODO
Much of parboiled2 was developed by Alexander Myltsev during GSoc 2013, a big thank you for his great work!
Also, without the Macro Paradise made available by Eugene Burmako parboiled2 would probably still not be ready and its codebase would look a lot more messy.
parboiled2 is released under the Apache License 2.0