Parse versus RE
Author: Ladislav Mecir
There have been numerous discussions on the Rebol mailing list covering this subject. I decided to summarize some of the information to have at hand for future use.
Regular expressions are designed to describe regular grammars, i.e. a regular grammar is a grammar that can be described using a regular expression.
Example:
a*bc*
There is another definition: A regular grammar is a grammar that can be described by a generative grammar having every rule of the kind:
X → wY
or
X → w
, where X and Y are nonterminal symbols, while w is a word containing only terminal symbols. That is why regular grammars are called "right linear grammars" too.
The above example grammar can be translated to:
S → aS S → bC C → cC C → ε
Every regular expression can be easily transliterated to obtain the corresponding right-linear grammar as well as the corresponding parse expression.
Warning: Don’t be confused by notions like "Perl’s regular expressions" or "Python’s regular expressions" etc. These "regular expressions" actually are enhanced regular expressions. They use generalizations of regular expressions, i.e. they aren’t regular expressions in the strict meaning of the notion. As an illustration see the example of a context dependent grammar below using Perl’s regular expression. On the other hand, it is straightforward to describe any regular grammar using Perl’s regular expression.
For the definition of PEGs see the PEG Wikipedia page.
To notice that the parse expressions and PEGs are related it suffices to find out that the constructs they use are equivalent:
Name | PEG expression | Equivalent parse expression |
---|---|---|
Trivial success | ε | none |
Sequence | e1 e2 | e1 e2 |
Ordered choice | e1 / e2 | e1 | e2 |
Zero or more | e* | any e |
One or more | e+ | some e |
Optional | e? | opt e |
And-predicate | &e |
[e (cont: none) | (cont: [end skip])] [end skip] | cont and e |
Not-predicate | !e |
[e (cont: [end skip]) | (cont: none)] cont not e |
- a much simpler recursion-unsafe equivalent of the PEG and predicate is:
here: e :here
- In case of the and and not predicates, the first variant works in R2 as well as in R3, the second variant is available only in R3.
Let’s see a simple example of a context free grammar. This is a generative grammar that describes a (possibly empty) correctly parenthesized expression:
Terminal symbols:
( )
Non terminal symbols and their meaning:
S - start
Rules:
S → "("S")"S S → ε
A PEG description of such a grammar is:
S ← "("S")"S/ε
, or, alternatively:
S ← ("("S")")*
expression: ["(" expression ")" expression | none]
Note: This is equivalent to the first PEG rule above.
Usage:
parse/all "()" expression ; == true
expression: [any ["(" expression ")"]]
Note: This is equivalent to the second PEG rule above.
This grammar is not a regular grammar, therefore it cannot be described using a regular expression.
Some programming languages (e.g. Perl) have a built-in parser that can be called an enhanced regular expression parser, because it can parse more complicated grammars than regular.
Here is an implementation using PERL5 enhanced regular expression:
while ($test =~ /\([^\(*.)]+\)/) { $test=~ s/\([^\(*.)]+\)//g; }
This generative grammar describes a (possibly empty) correctly parenthesized expression with two kinds of parentheses.
Terminal symbols:
( ) [ ]
Non terminal symbols and their meaning:
S - start
Rules:
S → "("S")"S S → "["S"]"S S → ε
expression2: [ "(" expression2 ")" expression2 | "[" expression2 "]" expression2 | none ]
Usage:
parse "" expression2 ; == true parse "()" expression2 ; == true parse "[]" expression2 ; == true parse "[(])" expression2 ; == false
Note: Using the equivalence table it is easy to transform the above PARSE rule to a PEG.
Perl’s enhanced regular expression parser can parse even complicated languages. Let’s take an expression:
/(.+)\1+$/
, which describes a text containing a repeated group of characters at its end (one or more characters, repeated two or more times consecutively). Examples:
"my doggyoggy" "has as as as as " "fleasssssss"
It looks pretty simple. How can we do it in Rebol?
PARSE rules may look as follows:
repeated-group?: [ ; is the input non-empty? here: skip :here (group-length: (length? :here) / 2) copy group group-length skip group | ; did not succeed, skip two characters and try again 2 skip repeated-group? ] rules: [ ; skip an odd character, if any opt [skip here: any [2 skip] end :here] repeated-group? ]
Note: these rules are not easy to translate to PEG, since they use Rebol run-time evaluation (notice the parens) and copy the input.
Terminal symbols (using just two to simplify the rules):
a b
Non-terminal symbols:
S - start E - end marker of the first group F - end marker of the second group T - transport marker
Rules:
; the string can start with any number of terminal symbols: S → aS S → bS ; we need two equal non-empty groups at the end S → aEaF S → bEbF ; we can add a terminal symbol to the end of the first group, ; which has to be added to the end of the second group: E → aEaT E → bEbT ; T "transports" terminals towards the end of the second group: aTa → aaT aTb → baT bTa → abT bTb → bbT ; the transport stops at the end of the second group: TF → F ; we can erase group end markers when done: E → ε F → ε
Although somebody may say that the generative grammars are purely descriptive means for language definition, it isn’t always so. In the previous section we have seen that the generative grammar had some “procedural” properties.
What about Rebol Parse dialect? Is it descriptive, procedural or “hybrid”?
At the first sight it looks descriptive, but there is evidence supporting a point of view that it is purely procedural and the descriptive look is just an illusion.
The procedurality of the dialect may be demonstrated by the fact that the choice rule is order-dependent (the [a] construct isn’t generally equivalent to the [b] construct), it isn’t distributive (the b] c] construct isn’t generally equivalent to the [a] construct), it has simple backtracking, rules are greedy, etc.- The Parse Project for more information about Parse and proposals for the future.
- The Parse article in the Rebol programming wikibook.
- The Parse Project for more information about Parse and proposals for the future.
- The Parse article in the Rebol programming wikibook.