Skip to content

Parse versus RE

Christopher Ross-Gill edited this page Aug 1, 2016 · 1 revision

Author: Ladislav Mecir

There have been numerous discussions on the Rebol mailing list covering this subject. I decided to summarize some of the information to have at hand for future use.

Table of Contents

Regular expressions

Regular expressions are designed to describe regular grammars, i.e. a regular grammar is a grammar that can be described using a regular expression.

Example:

a*bc*

There is another definition: A regular grammar is a grammar that can be described by a generative grammar having every rule of the kind:

X → wY

or

X → w

, where X and Y are nonterminal symbols, while w is a word containing only terminal symbols. That is why regular grammars are called "right linear grammars" too.

The above example grammar can be translated to:

S → aS
S → bC
C → cC
C → ε

Every regular expression can be easily transliterated to obtain the corresponding right-linear grammar as well as the corresponding parse expression.

Warning: Don’t be confused by notions like "Perl’s regular expressions" or "Python’s regular expressions" etc. These "regular expressions" actually are enhanced regular expressions. They use generalizations of regular expressions, i.e. they aren’t regular expressions in the strict meaning of the notion. As an illustration see the example of a context dependent grammar below using Perl’s regular expression. On the other hand, it is straightforward to describe any regular grammar using Perl’s regular expression.

Parsing expression grammars and parse expressions

For the definition of PEGs see the PEG Wikipedia page.

To notice that the parse expressions and PEGs are related it suffices to find out that the constructs they use are equivalent:

Name PEG expression Equivalent parse expression
Trivial success ε none
Sequence e1 e2 e1 e2
Ordered choice e1 / e2 e1 | e2
Zero or more e* any e
One or more e+ some e
Optional e? opt e
And-predicate &e [e (cont: none) | (cont: [end skip])] [end skip] | cont
and e
Not-predicate !e [e (cont: [end skip]) | (cont: none)] cont
not e
Notes:
  • a much simpler recursion-unsafe equivalent of the PEG and predicate is:
here: e :here
  • In case of the and and not predicates, the first variant works in R2 as well as in R3, the second variant is available only in R3.

Context free grammar

Let’s see a simple example of a context free grammar. This is a generative grammar that describes a (possibly empty) correctly parenthesized expression:

Terminal symbols:

(
)

Non terminal symbols and their meaning:

S - start

Rules:

S → "("S")"S
S → ε

A PEG description of such a grammar is:

S ← "("S")"S/ε

, or, alternatively:

S ← ("("S")")*

Purely recursive Parse dialect description

expression: ["(" expression ")" expression | none]

Note: This is equivalent to the first PEG rule above.

Usage:

parse/all "()" expression ; == true

A Parse dialect description using the ANY keyword

expression: [any ["(" expression ")"]]

Note: This is equivalent to the second PEG rule above.

A description using regular expression

This grammar is not a regular grammar, therefore it cannot be described using a regular expression.

Some programming languages (e.g. Perl) have a built-in parser that can be called an enhanced regular expression parser, because it can parse more complicated grammars than regular.

Here is an implementation using PERL5 enhanced regular expression:

while ($test =~ /\([^\(*.)]+\)/) {
    $test=~ s/\([^\(*.)]+\)//g;
}

More complicated context free grammar

This generative grammar describes a (possibly empty) correctly parenthesized expression with two kinds of parentheses.

Terminal symbols:

(
)
[
]

Non terminal symbols and their meaning:

S - start

Rules:

S → "("S")"S
S → "["S"]"S
S → ε

PARSE dialect

expression2: [
    "(" expression2 ")" expression2
    | "[" expression2 "]" expression2
    | none
]

Usage:

parse "" expression2 ; == true
parse "()" expression2 ; == true
parse "[]" expression2 ; == true
parse "[(])" expression2 ; == false

Note: Using the equivalence table it is easy to transform the above PARSE rule to a PEG.

An unrestricted grammar

Perl’s enhanced regular expression parser can parse even complicated languages. Let’s take an expression:

/(.+)\1+$/

, which describes a text containing a repeated group of characters at its end (one or more characters, repeated two or more times consecutively). Examples:

"my doggyoggy"

"has as as as as "

"fleasssssss"

It looks pretty simple. How can we do it in Rebol?

PARSE rules

PARSE rules may look as follows:

repeated-group?: [
    ; is the input non-empty?
    here: skip :here
    (group-length: (length? :here) / 2)
    copy group group-length skip
    group
    | ; did not succeed, skip two characters and try again
    2 skip repeated-group?
]

rules: [
    ; skip an odd character, if any
    opt [skip here: any [2 skip] end :here]
    repeated-group?
]

Note: these rules are not easy to translate to PEG, since they use Rebol run-time evaluation (notice the parens) and copy the input.

Generative grammar

Terminal symbols (using just two to simplify the rules):

a
b

Non-terminal symbols:

S - start
E - end marker of the first group
F - end marker of the second group
T - transport marker

Rules:

; the string can start with any number of terminal symbols:
S → aS
S → bS
; we need two equal non-empty groups at the end
S → aEaF
S → bEbF
; we can add a terminal symbol to the end of the first group,
; which has to be added to the end of the second group:
E → aEaT
E → bEbT
; T "transports" terminals towards the end of the second group:
aTa → aaT
aTb → baT
bTa → abT
bTb → bbT
; the transport stops at the end of the second group:
TF → F
; we can erase group end markers when done:
E → ε
F → ε

Summary

Although somebody may say that the generative grammars are purely descriptive means for language definition, it isn’t always so. In the previous section we have seen that the generative grammar had some “procedural” properties.

What about Rebol Parse dialect? Is it descriptive, procedural or “hybrid”?

At the first sight it looks descriptive, but there is evidence supporting a point of view that it is purely procedural and the descriptive look is just an illusion.

The procedurality of the dialect may be demonstrated by the fact that the choice rule is order-dependent (the [a] construct isn’t generally equivalent to the [b] construct), it isn’t distributive (the b] c] construct isn’t generally equivalent to the [a] construct), it has simple backtracking, rules are greedy, etc.

See also

garbageb] c] construct isn’t generally equivalent to the [a] construct), it has simple backtracking, rules are greedy, etc.

See also

Clone this wiki locally