Skip to content

Latest commit

 

History

History
194 lines (133 loc) · 7.17 KB

README.md

File metadata and controls

194 lines (133 loc) · 7.17 KB

PHP PEG - A PEG compiler for parsing text in PHP

This is a Paring Expression Grammar compiler for PHP. PEG parsers are an alternative to other CFG grammars that includes both tokenization and lexing in a single top down grammar. For a basic overview of the subject, see http://en.wikipedia.org/wiki/Parsing_expression_grammar

Quick start

  • Write a parser. A parser is a PHP class with a grammar contained within it in a special syntax. The filetype is .peg.inc. See the examples directory.
  • Compile the parser. php ./cli.php ExampleParser.peg.inc > ExampleParser.php
  • Use the parser (you can also include code to do this in the input parser - again see the examples directory):

	$x = new ExampleParser( 'string to parse' ) ;
	$res = $x->match_Expr() ;

Parser Format

Parsers are contained within a PHP file, in a special comment block that starts with /*Parser:NameOfParser and continues until the comment is closed. During compilation this block will be replaced with a set of matching functions.

Lexically, the parser is a name token, a matching rule and a set of functions. The name token must not start with whitespace, contain no whitespace and end with a : character. The rule and function set are on the same line or on the indented lines below.

Rules

PEG matching rules try to follow standard PEG format, summarised thusly:

	token* - Token is optionally repeated
	token+ - Token is repeated at least one
	token? - Token is optionally present

	tokena tokenb - Token tokenb follows tokena, both of which are present
	tokena | tokenb - One of tokena or tokenb are present, prefering tokena

	&token - Token is present next (but not consumed by parse)
	!token - Token is not present next (but not consumed by parse)

 	( expression ) - Grouping for priority

But with these extensions:

	< or > - Optionally match whitespace
	[ or ] - Require some whitespace

Tokens

Tokens may be

  • bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
  • literals, surrounded by " or ' quote pairs. No escaping support is provided in literals.
  • regexs, surrounded by / pairs.
  • expressions - single words (match \w+) starting with $ or more complex surrounded by ${ } which call a user defined function to perform the match
Regular expression tokens

Automatically anchored to the current string start - do not include a string start anchor (^) anywhere. Can specify flags on stand-alone regexs. Currently doesn't handle flags on regexs with rules.

Expressions

Expressions allow run-time calculated matching. You can embed an expression within a literal or regex token to match against a calculated value, or simply specify the expression as a token to (optionally) internally handle matching and generate a result.

Expressions will try a variety of scopes to find a value. It will look for variables already set in the current result, rule-attached functions and a variety of other functions and constants.

Tried in this order

  • against current result
  • against containing expression stack in order (for sub-expressions only)
    • against parser instance as variable
    • against parser instance as rule-attached method INCLUDING $ ( i.e. function $foo() )
    • against parser instance as method INCLUDING $
    • as global method
  • as constant
Tricks and traps

Be careful against matching against results


	quoted_good: q:/['"]/ string "$q"
	quoted_bad:  q:/['"]/ string $q

"$q" matches against the value of q again. $q simply returns the value of q, without doing any matching

Named matching rules

Tokens and groups can be given names by prepending name and :, e.g.,


	rulea: "'" name:( tokena tokenb )* "'"

There must be no space betweeen the name and the :


	badrule: "'" name : ( tokena tokenb )* "'"

Recursive matchers can be given a name the same as their rule name by prepending with just a :. These next two rules are equivilent


	rulea: tokena tokenb:tokenb
	rulea: tokena :tokenb

Rule-attached functions

Each rule can have a set of functions attached to it. These functions can be defined

  • in-grammar by indenting the function body after the rule
  • in-class after close of grammar comment by defining a regular method who's name is {$rulename}_{$functionname}, or {$rulename}{$functionname} if function name starts with _
  • in a sub class

All functions that are not in-grammar must have PHP compatible names (see PHP name mapping). In-grammar functions will have their names converted if needed.

All these definitions define the same rule-attached function


	class A extends Parser {
	/**Parser
	foo: bar baz
	  function bar() {}
	* /

	  function foo_bar() {}
	}

	class B extends A {
	  function foo_bar() {}
	}

PHP name mapping

Rules in the grammar map to php functions named match_{$rulename}. However rule names can contain characters that php functions can't. These characters are remapped:


	'-' => '_'
	'$' => 'DLR'
	'*' => 'STR'

Other dis-allowed characters are removed.

Results

Results are a tree of nested arrays.

Without any specific control, each rules result will just be the text it matched against in a ['text'] member. This member must always exist.

Marking a subexpression, literal, regex or recursive match with a name (see Named matching rules) will insert a member into the result array named that name. If there is only one match it will be a single result array. If there is more than one match it will be an array of arrays.

You can override result storing by specifying a rule-attached function with the given name. It will be called with a reference to the current result array and the sub-match - in this case the default storage action will not occur.

If you specify a rule-attached function for a recursive match, you do not need to name that token at all - it will be call automatically. E.g.


	rulea: tokena tokenb
	  function tokenb ( &$res, $sub ) { print 'Will be called, even though tokenb is not named or marked with a :' ; }

You can also specify a rule-attached function called *, which will be called with every recursive match made


	rulea: tokena tokenb
	  function * ( &$res, $sub ) { print 'Will be called for both tokena and tokenb' ; }

Silent matches

By default all matches are added to the 'text' property of a result. By prepending a member with . that match will not be added to the ['text'] member. This doesn't affect the other result properties that named rules' add.

TODO

  • Allow configuration of whitespace - specify what matches, and wether it should be injected into results as-is, collapsed, or not at all
  • Allow inline-ing of rules into other rules for speed
  • More optimisation
  • Make Parser-parser be self-generated, instead of a bad hand rolled parser like it is now.
  • Slighly more powerfull expressions: ${parent.q}, ${foo()->bar}, etc.
  • Need to properly escape all literals. Expressions currently need to be in '', not ""
  • PHP token parser, and other token streams, instead of strings only like now