Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Tree: 51511334d5
Fetching contributors…

Cannot retrieve contributors at this time

375 lines (282 sloc) 12.8 KB
=begin pod
=CHAPTER Grammars
Grammars organize regexes, just like classes organize methods. The following
example demonstrates how to parse JSON, a data exchange format already
introduced (see L<multis>).
=begin code
# file lib/JSON/Tiny/
grammar JSON::Tiny::Grammar {
rule TOP { ^[ <object> | <array> ]$ }
rule object { '{' ~ '}' <pairlist> }
rule pairlist { [ <pair> ** [ \, ] ]? }
rule pair { <string> ':' <value> }
rule array { '[' ~ ']' [ <value> ** [ \, ] ]? }
proto token value { <...> };
token value:sym<number> {
[ 0 | <[1..9]> <[0..9]>* ]
[ \. <[0..9]>+ ]?
[ <[eE]> [\+|\-]? <[0..9]>+ ]?
token value:sym<true> { <sym> };
token value:sym<false> { <sym> };
token value:sym<null> { <sym> };
token value:sym<object> { <object> };
token value:sym<array> { <array> };
token value:sym<string> { <string> }
token string {
\" ~ \" [ <str> | \\ <str_escape> ]*
token str {
<!before \t>
<!before \n>
<!before \\>
<!before \">
# <-["\\\t\n]>+
token str_escape {
<["\\/bfnrt]> | u <xdigit>**4
# test it:
my $tester = '{
"country": "Austria",
"cities": [ "Wien", "Salzburg", "Innsbruck" ],
"population": 8353243
if JSON::Tiny::Grammar.parse($tester) {
say "It's valid JSON";
} else {
# TODO: error reporting
say "Not quite...";
=end code
A grammar contains various named regex. Regex names may be constructed
the same as subroutine names or method names. While regex names are
completely up to the grammar writer, a rule named C<TOP>
will, by default, be invoked when the C<.parse()> method is
executed on a grammar. The above call to C<JSON::Tiny::Grammar.parse($tester)>
starts by attempting to match the regex named C<TOP> to the string C<$tester>.
In this example, the C<TOP> rule anchors the match to the start and end
of the string, so that the whole string has to be in valid JSON format
for the match to succeed. After matching the anchor at the start of the
string, the regex attempts to match either an C<< <array> >> or an C<<
<object> >>. Enclosing a regex name in angle brackets causes the regex
engine to attempt to match a regex by that name within the same grammar.
Subsequent matches are straightforward and reflect the structure in
which JSON components can appear.
Regexes can be recursive. An C<array> contains C<value>. In turn a
C<value> can be an C<array>. This will not cause an infinite loop as
long as at least one regex per recursive call consumes at least one
character. If a set of regexes were to call each other recursively
without progressing in the string, the recursion could go on infinitely
and never proceed to other parts of the grammar.
X<|goal matching>
X<|~, regex meta character>
The example grammar given above introduces the I<goal matching syntax>
which can be presented abstractly as: C<A ~ B C>. In
C<JSON::Tiny::Grammar>, C<A> is C<'{'>, C<B> is C<'}'> and C<C> is C<<
<pairlist> >>. The atom on the left of the tilde (C<A>) is matched
normally, but the atom to the right of the tilde (C<B>) is set as the
goal, and then the final atom (C<C>) is matched. Once the final atom
matches, the regex engine attempts to match the goal (C<B>). This has
the effect of switching the match order of the final two atoms (C<B> and
C<C>), but since Perl knows that the regex engine should be looking for
the goal, a better error message can be given when the goal does not
match. This is very helpful for bracketing constructs as it puts the
brackets near one another.
X<|proto token>
Another novelty is the declaration of a I<proto token>:
=begin code
proto token value { <...> };
token value:sym<number> {
[ 0 | <[1..9]> <[0..9]>* ]
[ \. <[0..9]>+ ]?
[ <[eE]> [\+|\-]? <[0..9]>+ ]?
token value:sym<true> { <sym> };
token value:sym<false> { <sym> };
=end code
The C<proto token> syntax indicates that C<value> will be a set of
alternatives instead of a single regex. Each alternative has a name of the
form C<< token value:sym<thing> >>, which can be read as I<< alternative of
C<value> with parameter C<sym> set to C<thing> >>. The body of such an
alternative is a normal regex, where the call C<< <sym> >> matches the value
of the parameter, in this example C<thing>.
When calling the rule C<< <value> >>, the grammar engine attempts to
match the alternatives in parallel and the longest match wins. This is
exactly like normal alternation, but as we'll see in the next section,
has the advantage of being extensible.
=head1 Grammar Inheritance
The similarity of grammars to classes goes deeper than storing regexes in a
namespace as a class might store methods. You can inherit from and extend
grammars, mix roles into them, and take advantage of polymorphism. In fact, a
grammar is a class which by default inherits from C<Grammar> instead of
Suppose you want to enhance the JSON grammar to allow single-line C++ or
JavaScript comments, which begin with C<//> and continue until the end of the
line. The simplest enhancement is to allow such a comment in any place where
whitespace is valid.
However, C<JSON::Tiny::Grammar> only implicitly matches whitespace
through the use of I<rules>, which are like tokens but with the
C<:sigspace> modifier enabled. Implicit whitespace is matched with the
inherited regex C<< <ws> >>, so the simplest approach to enable single-
line comments is to override that named regex:
=begin code
grammar JSON::Tiny::Grammar::WithComments
is JSON::Tiny::Grammar {
token ws {
\s* [ '//' \N* \n ]?
my $tester = '{
"country": "Austria",
"cities": [ "Wien", "Salzburg", "Innsbruck" ],
"population": 8353243 // data from 2009-01
if JSON::Tiny::Grammar::WithComments.parse($tester) {
say "It's valid (modified) JSON";
=end code
The first two lines introduce a grammar that inherits from
C<JSON::Tiny::Grammar>. As subclasses inherit methods from superclasses,
so any grammar rule not present in the derived grammar will come from
its base grammar.
In this minimal JSON grammar, whitespace is never mandatory, so C<ws>
can match nothing at all. After optional spaces, two slashes C<'//'>
introduce a comment, after which must follow an arbitrary number of non-
newline characters, and then a newline. In prose, the comment starts
with C<'//'> and extends to the rest of the line.
Inherited grammars may also add variants to proto tokens:
=begin code
grammar JSON::ExtendedNumeric is JSON::Tiny::Grammar {
token value:sym<nan> { <sym> }
token value:sym<inf> { <[+-]>? <sym> }
=end code
In this grammar, a call to C<< <value> >> matches either one of the newly
added alternatives, or any of the old alternatives from the parent grammar
C<JSON::Tiny::Grammar>. Such extensibility is difficult to achieve with
ordinary, C<|> delimited alternatives.
=head1 Extracting data
X<|reduction methods>
X<|action methods>
The C<parse> method of a grammar returns a C<Match> object through which
you can access all the relevant information of the match. Named regex
that match within the grammar may be accessed via the C<Match> object
similar to a hash where the keys are the regex names and the values are
the C<Match> object that represents that part of the overall regex
match. Similarly, portions of the match that are captured with
parentheses are available as positional elements of the C<Match> object
(as if it were an array).
Once you have the Match object, what can you I<do> with it? You could
recursively traverse this object and create data structures based on what
you find or execute code. An alternative solution exists: I<action methods>.
=begin code
class JSON::Tiny::Actions {
method TOP($/) { make $/.values.[0].ast }
method object($/) { make $<pairlist>.ast.hash }
method pairlist($/) { make $<pair>».ast }
method pair($/) { make $<string>.ast => $<value>.ast }
method array($/) { make [$<value>».ast] }
method string($/) { make join '', $/.caps>>.value>>.ast }
# TODO: make that
# make +$/
# once prefix:<+> is sufficiently polymorphic
method value:sym<number>($/) { make eval $/ }
method value:sym<string>($/) { make $<string>.ast }
method value:sym<true> ($/) { make Bool::True }
method value:sym<false> ($/) { make Bool::False }
method value:sym<null> ($/) { make Any }
method value:sym<object>($/) { make $<object>.ast }
method value:sym<array> ($/) { make $<array>.ast }
method str($/) { make ~$/ }
method str_escape($/) {
if $<xdigit> {
make chr(:16($<xdigit>.join));
} else {
my %h = '\\' => "\\",
'n' => "\n",
't' => "\t",
'f' => "\f",
'r' => "\r";
make %h{$/};
my $actions =;
JSON::Tiny::Grammar.parse($str, :$actions);
=end code
This example passes an actions object to the grammar's C<parse> method.
Whenever the grammar engine finishes parsing a regex, it calls a method on
the actions object with the same name as the current regex. If no such method
exists, the grammar engine moves along. If a method does exist, the grammar
engine passes the current match object as a positional argument.
X<|abstract syntax tree>
Each match object has a slot called C<ast> (short for I<abstract syntax tree>)
for a payload object. This slot can hold a custom data structure that you
create from the action methods. Calling C<make $thing> in an action method
sets the C<ast> attribute of the current match object to C<$thing>.
X<|abstract syntax tree>
=begin comment :sidebar
An abstract syntax tree, or AST, is a data structure which represents the
parsed version of the text. Your grammar describes the structure of the AST:
its root element is the C<TOP> node, which contain children of the allowed
types and so on.
=end comment
In the case of the JSON parser, the payload is the data structure that the
JSON string represents. For each matching rule, the grammar engine calls an
action method to populate the C<ast> slot of the match object. This process
transforms the match tree into a different tree--in this case, the actual JSON
Although the rules and action methods live in different namespaces (and in a
real-world project probably even in separate files), here they are adjacent to
demonstrate their correspondence:
=begin code
rule TOP { ^ [ <object> | <array> ]$ }
method TOP($/) { make $/.values.[0].ast }
=end code
The C<TOP> rule has an alternation with two branches, C<object> and C<array>.
Both have a named capture. C<$/.values> returns a list of all captures, here
either the C<object> or the C<array> capture.
The action method takes the AST attached to the match object of that sub
capture, and promotes it as its own AST by calling C<make>.
=begin code
rule object { '{' ~ '}' <pairlist> }
method object($/) { make $<pairlist>.ast.hash }
=end code
The reduction method for C<object> extracts the AST of the C<pairlist>
submatch and turns it into a hash by calling its C<hash> method.
=begin code
rule pairlist { [ <pair> ** [ \, ] ]? }
method pairlist($/) { make $<pair>».ast; }
=end code
The C<pairlist> rule matches multiple comma-separated pairs. The reduction
method calls the C<.ast> method on each matched pair and installs the result
list in its own AST.
=begin code
rule pair { <string> ':' <value> }
method pair($/) { make $<string>.ast => $<value>.ast }
=end code
A pair consists of a string key and a value, so the action method constructs a
Perl 6 pair with the C<< => >> operator.
The other action methods work the same way. They transform the information
they extract from the match object into native Perl 6 data structures, and
call C<make> to set those native structures as their own ASTs.
The action methods that belong to a proto token are parametric in the same way
as the alternative:
=begin code
token value:sym<null> { <sym> };
method value:sym<null>($/) { make Any }
token value:sym<object> { <object> };
method value:sym<object>($/) { make $<object>.ast }
=end code
When a C<< <value> >> call matches, the action method with the same
parametrization as the matching alternative executes.
=end pod
Jump to Line
Something went wrong with that request. Please try again.