Using operator precedence to parse expressions #132

gvanrossum · 2020-06-06T01:16:34Z

There's an old parsing technique called "operator precedence" which we might be able to use to speed up parsing expressions.

When we parse an expression we go through a lot of levels of the grammar even if no operators of that level are present. E.g. a simple expression like a + b successfully goes through all of the following rules:

disjunction
conjunction
inversion
comparison
bitwise_or
bitwise_xor
bitwise_and
shift_expr
sum (the only one that has a non-trivial expansion in this example!)
term
factor (this is where unary operators get parsed)
power
await_primary
primary
atom

Many of these (esp. bitwise_or through power) use similar actions that call _Py_BinOp(left, op, right, EXTRA).

For just binary operators there's a pretty straightforward algorithm using a stack and a table of operators and precedences that produces the right AST without so many productions.

Maybe we should explore this and see if it makes a substantial difference?

A variant of this would have a single rule

bitwise_or:
    | factor operator bitwise_or { magic action }
    | factor

where the magic action would transform a tree of the form (a, *, (b, +, c)) into ((a, *, b), +, c). Sure, we've wasted some space for a complicated expression, but for a simple one we've parsed it with much less effort. Thoughts?

The text was updated successfully, but these errors were encountered:

gvanrossum · 2020-06-06T04:42:24Z

where the magic action would transform a tree of the form (a, *, (b, +, c)) into ((a, *, b), +, c). Sure, we've wasted some space for a complicated expression, but for a simple one we've parsed it with much less effort. Thoughts?

Oh, that can't work because the AST doesn't record whether there were parentheses forcing the grouping. So it'll have to be more like

bitwise_or:
    | factor (operator factor)* { magic action }

where the magic action must act on an array of (operator factor) pairs. We already have a similar data structure for the comparison operators. (Heh, this would benefit from generics. :-)

We would collapse 6 levels into 1 this way. (To do more would require more intelligence in the magic action, but it would be possible.)

viridia · 2020-06-06T04:52:09Z

Yeah, I did consider the parens issue :)

To avoid allocating too many intermediate arrays, you may be able to use that initial array of operators as the operator stack, using a separate read and write index: the read index scans the array from left to right, and the write index is used to push the combined nodes. In other words, you are using the same piece of memory as both a queue and a stack.

gvanrossum · 2020-06-06T05:12:25Z

In the context of this particular implementation, I don't mind allocating extra memory if I can free it before turning from the magic action. But writing in the array is tricky due to the way the memoization (packrat parsing) is implemented and how we allocate memory for parsing results (using an arena that is freed only when we delete the AST).

We still need to design the metagrammar syntax for indicating operator precedences.

gvanrossum · 2020-06-06T15:32:57Z

I wonder if the generator could recognize the pattern and just do the smart thing automatically, without any markup? The downside is that a small tweak to the grammar might accidentally break the optimization and testing wouldn't find this out.

And I would be happy with implementing an ad-hoc solution, given that we're really only planning to use (this version of) pegen for the Python parser.

pablogsal · 2020-06-06T17:25:24Z

Given that the operator precedence will be a global property of the grammar (I don't see if it would be useful to have local precedence rules) we could declare the precedence itself with some metas at the top of the grammar. Then, we could use some flag in the rule like the one we use for memoization (the "arguments" between parenthesis) to activate the "magic". I would recommend that if we go this way we make it explicit to decrease the cognitive burden of reading the grammar and reason about it.

viridia · 2020-06-06T17:38:56Z

Assuming that the relative precedence of the binary operators will be the same in all parsing contexts, you could certainly define them globally - that's what bison does:

http://web.mit.edu/gnu/doc/html/bison_8.html#SEC73

There are two ways I can think of to declare the precedence: either numerically, or by an ordered list. If it's an ordered list, you need a way of grouping operators of equal precedence. Bison's metagrammar looks like this:

%left '<' '>' '=' NE LE GE
%left '+' '-'
%left '*' '/'

The order in which operators are declared determines the relative precedence, with the last ones declared having the highest. %left and %right determines whether the operator is left-associative or right-associative (this affects whether you use '>' or '>=' when doing stack reductions.)

gvanrossum · 2020-06-06T22:31:48Z

Okay, we could adapt that to pegen's metagrammar like this:

@operators """
left |
left ^
left &
left << >>
left + -
left * / // % @
"""

It's a triple-quoted string, each line being a word left or right followed by one or more operator symbols separated by spaces (we could add quotes around the operators but I think we don't need that, since operators can't contain spaces). I'm not sure yet what to do with unary operators -- Bison's solution feels like a hack. Until I've figured that out I have to stop at the multiplication/division level, because the next level is unary +, - and ~. After that comes Python's only right-associative operator, **.

So now the parser generator has the priorities for those rules. How do we proceed then? Bison's example suggests that we could just write

expr:
    | a=expr '|' b=expr { _Py_BinOp(a, BitOr, b, EXTRA) }
    | a=expr ^ b=expr { _Py_BinOp(a, BitXor, b, EXTRA) }
    ...
    | factor

and let the parser generator handle the choice of which reduction to use when, but that's kind of against the PEG philosophy of always trying alternatives in the order in which they are written. (In Bison IIRC the order is immaterial since it does all this nice CFG stuff.)

So the other way is what I suggested in an earlier comment, where we first collect the whole expression, like this:

expr:
    | a=factor b=(operator factor)+ { _PyPegen_MagicReduce(a, b, EXTRA) }
    | factor

The rest is just a matter of programming.

pablogsal · 2020-06-07T00:53:27Z

I have a working first draft of this in this branch: python/cpython@master...we-like-parsers:operator_precedence

Is this more or less what you have in mind for _PyPegen_MagicReduce?

pablogsal · 2020-06-07T01:03:42Z

Some benchmarks:

MASTER

❯ ./python -m pyperf timeit 'import _peg_parser' '_peg_parser.parse_string("2*3 + 4*5*6\n12 + (2 * 3 * 4 * 5 + 6 + 7 * 8)\n1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20))))))", ast=False)'
.....................
Mean +- std dev: 27.0 us +- 0.1 us

❯ time ./python Tools/peg_generator/data/xxl.py
./python Tools/peg_generator/data/xxl.py  0.95s user 0.20s system 99% cpu 1.152 total

DRAFT

❯ ./python -m pyperf timeit 'import _peg_parser' '_peg_parser.parse_string("2*3 + 4*5*6\n12 + (2 * 3 * 4 * 5 + 6 + 7 * 8)\n1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20))))))", ast=False)'
.....................
Mean +- std dev: 25.0 us +- 0.2 us

❯ time ./python Tools/peg_generator/data/xxl.py
./python Tools/peg_generator/data/xxl.py  1.06s user 0.16s system 99% cpu 1.217 total

gvanrossum · 2020-06-07T04:04:17Z

To really see how this compares, try something with several trivial expressions. E.g. I tried [1, 2, 3, 4, 5, 6, 7, 8, 9, 0] and got 13 usec on master vs. 10 usec with your patch. That's impressive! (As is how quickly which you came up with this patch!)

I'll look over the code later (tomorrow?).

gvanrossum · 2020-06-08T05:09:29Z

Now a CPython PR (python#20696) and bpo issue (https://bugs.python.org/issue40902).

gvanrossum · 2020-06-08T05:30:27Z

I think I have an extension to unary operators (surely this isn't new, IIRC I discovered the same several decades ago).

First, the grammar rule becomes essentially

bitwise_or: primary (binop unop* primary)*

but the parsing algorithm treats each operator as a separate reduce opportunity, with a NULL primary inserted in front of each unary operator.

We use the same stack format as before, alternating expressions and (operator, priority) pairs:

expr (op, pri) expr (op, pri) expr (op, pri) expr

When a unary operator is encountered, it is given a separate priority, and we push a NULL expression to its left. For an expression like a+b*-~c*d we would pretend the input was as follows:

a + b * NULL - NULL ~ c * d

Unary operators have separate priorities to the left and to the right -- an incoming unary operator gets a certain priority, and once on the stack the same operator is given a different priority. Both are distinct from the priority of the same operator in a binary position.

The priorities ensure that we never try to reduce e.g. b * NULL -- instead, when we encounter the second *, we reduce NULL ~c to (~c) and then NULL - (~c) to (-(~c)). Then finally we reduce b * ((-(~c)) to (b*(-(~c))) and then we push * and d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using operator precedence to parse expressions #132

Using operator precedence to parse expressions #132

gvanrossum commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

viridia commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

pablogsal commented Jun 6, 2020

viridia commented Jun 6, 2020 •

edited by gvanrossum

gvanrossum commented Jun 6, 2020

pablogsal commented Jun 7, 2020 •

edited

pablogsal commented Jun 7, 2020 •

edited

gvanrossum commented Jun 7, 2020

gvanrossum commented Jun 8, 2020

gvanrossum commented Jun 8, 2020

viridia commented Jun 8, 2020

viridia commented Jun 8, 2020

gvanrossum commented Jun 8, 2020

gvanrossum commented Jul 25, 2020

Using operator precedence to parse expressions #132

Using operator precedence to parse expressions #132

Comments

gvanrossum commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

viridia commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

gvanrossum commented Jun 6, 2020

pablogsal commented Jun 6, 2020

viridia commented Jun 6, 2020 • edited by gvanrossum

gvanrossum commented Jun 6, 2020

pablogsal commented Jun 7, 2020 • edited

pablogsal commented Jun 7, 2020 • edited

MASTER

DRAFT

gvanrossum commented Jun 7, 2020

gvanrossum commented Jun 8, 2020

gvanrossum commented Jun 8, 2020

viridia commented Jun 8, 2020

viridia commented Jun 8, 2020

gvanrossum commented Jun 8, 2020

gvanrossum commented Jul 25, 2020

viridia commented Jun 6, 2020 •

edited by gvanrossum

pablogsal commented Jun 7, 2020 •

edited

pablogsal commented Jun 7, 2020 •

edited