Skip to content

Table‐driven AST‐to‐Python Translation

R. Bernstein edited this page May 3, 2024 · 1 revision

Table of Contents

Introduction

A number of semantic actions rules are specified via pattern-matching rules in some dictionaries. This greatly reduces the need to write methods for semantic actions that are very similar. Furthermore, if you know the codes, you can get a more compact description of actions are to be done.

See Writing Semantic-action Rules for a general overview of semantic actions. Here we focus on how we have customized the framework for deparsing. Specifically, we describe the kinds of tables used and how the table entries direct what to output and how to traverse the AST tree.

The instance method gen_source() is the top level routine for walking the AST tree produced by parsing. But the instance method default() is the routine that is called for each node when walking the AST. It introspects or searches the class it is in for nonterminal instance methods and some tables to determine what to do. Instance method definitions that start with n_ and end in the nonterminal name considered first. When that isn't found, the tables are then consulted.

When a table entry is found using the nonterminal name as the key, that entry and the node name are passed as parameters to the template_engine() method. Entries in tables are tuples: first entry is a string often with %-escaped format specifiers; after that is the arguments to the be used by the format specifiers. The number of items and types of the arguments after the first argument is determined by the format string. This is similar to how the C and other languages handle formatted output.

Some examples of valid format rules are given below. The only thing you need to know below is that

  • %c requires one integer index argument or a tuple of an integer index and a string nonterminal name,
  • %% takes no argument, and
  • %p` requires a tuple of two integers or a tuple of an integer, a string nonterminal name, and another integer
    'INPLACE_MODULO':	( '%%=',),
    'INPLACE_OR':	( '|=' ,),
    'binary_expr':  ( '%c %c %c', 0,
                    (-1, 'binary_op'),
                    ( 1, 'expr' ) ),
    'unary_not':	( 'not %c', (0 'expr') ),
    'slice1':		( '%c[%p:]', (0, 'expr'), (1, 100) ),

Here are some invalid rules:

    'INPLACE_MODULO': ( '%=',),     # there is no %= specifier
    'INPLACE_OR':	  ( '|=' , 0),  # no argument has been called for in the string
    'binary_expr':	  ( '%c %c %c', 0, -1 ), # not enough arguments
    'unary_not':	  ( 'not %c', 'a' ),     # expecting an integer argument
    'slice1':		  ( '%c[%p:]', 0, 1 ),   # %p expects a tuple

In some case, an instance method may do some setup and call another non-terminal method or call template_engine directly.

Aside from default() and template_engine, another routine that is called a lot is prune(). This is called when processing for a node is finished. It raises an "prune" exception which may percolate up several calls.

Here is an example of that kind of situation. There is a method for the async_call_function nonterminal called n_async_call_function(). It simply needs to prepend the word "async" at the front of a call. After outputting "aysnc" it changes the type of its node to call function and then calls template_engine() with this modified node. Since template_engine() has already made a pass over the node, after template_engine() finishes, self.prune() is called to ensure another pass over the node doesn't appear.

Mapping Parse Tree Nodes to a Format Table

When we don't have a custom method for a nonterminal, we look up the the action to perform on the nonterminal via one of two tables.

The format actions of the rules in each table entry is the same in each of the table. What is different the AST node is located with repect to the nonterminal name we look up.

Below, "K" is the nonterminal key name we lookup up, and "N" is the node where that rule should be applied to.

The tables and how they apply to the nonterminal key in are:


          N&K               N       
         / | ... \        / | ... \ 
        O  O      O      O  O      K
                                    
                                    
      TABLE_DIRECT      TABLE_R

The dictionary values in the tables are format specifiers which are described in a later. But before this we describe in more detail tables types and give examples of each.

TABLE_DIRECT

By far, the most used table is TABLE_DIRECT.

Here both nonterminal key name to lookup up and the node to apply that rule are in the same place: at the parent node.

The degenerate case is where there a grammar item has no children. In other words it is an instruction that has text associated with that needs to be output.

Examples include instructions like BINARY_ADD (+), INPLACE_ADD (+=), or UNARY_POSITIVE (+) .

What about instructions that sometimes don't have text associated with them, like POP_TOP? They just don't appear in a table or are referenced inside a template for any rule that they appear in.

A common situation you may encounter when adding a new instruction is for text that should be associated with it not to show up in output. Typically that is because it needs to be added to TABLE_DIRECT.

An example of a true non-terminal that has a child is unary not.

A fragment of an AST for unary_not this would look like this

    unary_not (2)
       0. expr
          L.   1       0  LOAD_NAME                'a'

and the TABLE_DIRECT entry for this would look like:

    'unary_not':	( 'not %c', 0 ),

TABLE_R

The nonterminal key appears as the last child of its siblings, while the node to apply the rule is in its parent.

The reason this kind of thing happens and that we are interested in the last child is that instructions are in reverse polish where we have operand first followed by operator. And here we are interested in the operator.

An example of something that uses this is del_stmt. The AST fragment for del sys.a is:

    del_stmt (2)
        0. expr
          L.   1       0  LOAD_NAME                'sys'
        1.             2  DELETE_ATTR              'a'

And the TABLE_R entry for this is

 'DELETE_ATTR': ( '%|del %c.%[-1]{pattr}\n', 0 )

However we need to record in del_stmt that we are using the TABLE_R pattern. The way that is done is by entering it in MAP like this:

   MAP = {
      'del_stmt': MAP_R', # ...
          }

Something similar goes on for storing attributes as is found in the x.a of x.a = 5 and those are grouped under designator.

The current Python grammar doesn't use this very much.

Format specifiers

In this section we describe the meaning of specific format specifiers like %c or %|.

The % may optionally be followed by a number in square brackets, which makes the engine walk down to that node before evaluating the escape code. A few TABLE_R entries make use of this. For example


    'STORE_ATTR':	( '%c.%[1]{pattr}', 0),
    'DELETE_ATTR':	( '%|del %c.%[-1]{pattr}\n', 0 ),

%c

This is probably the most common specifier. Its argument is a single integer representing a node index, or a tuple where the first entry of the tuple is an integer representing a node index, and a string which is the node 'kind' field (the terminal/nonterminal name) that this node is asserted to be. That is, if a string is there, it is checked against the node.

The format specifier indicates that the node and its children should be descended. For example, For a unary expression -val. We might have the AST tree fragment:

      unary_expr (2)
        0. expr
           L.   2       4  LOAD_NAME                'val'
        1. unary_op
                        6  UNARY_NEGATIVE

The below template rule for unary_expr indicates that we should descend into the unary_op subtree first, (the second child), and then into expr:

    'unary_expr':   ( '%c%c', 1,  0) )

There is just one grammar rule for unary_expr, and it is:

	unary_expr ::= expr unary_op

Therefore, a more stringent template format specifier can be used:

    'unary_expr': ( '%c%c',
                    (1, 'unary_op'),
                    (0, 'expr') ),

The above specifier documents that node 0 is an "expression' node, while node 1 is an 'unary_op' node. Note only is this documented in the code, but the template engine also will check this at runtime when encountering this template. Sure, the assert check slows things down a little, but not noticably, and I think there is great benefit here.

%p

Like %c but sets default operator precedence for all the nodes under the one specified. Its argument then is a tuple indicating the node index and the precedence value, an integer, or a triple of the node index, the node name, and a precedence.

Operator precedence can change output in requiring grouping symbols or changing where they appear.

For example, in arithmetic expressions it determines whether we need parenthesis or not.

Consider this tree:

    binary_expr (3)
      0. expr
         binary_expr (3)
           0. expr
             L.   2       4  LOAD_NAME                'x'
           1. expr
                          6  LOAD_NAME                'y'
           2. binary_op
                         8  BINARY_MULTIPLY
      1. expr
                10  LOAD_NAME                'z'
      2. binary_op
                12  BINARY_ADD

The tree is unambiguous that "x" and "y" should be multiplied before "z" is added. That also matches Pythons default or non-parenthesized operator precedence.

When the precedence value of BINARY_MULTIPLY is less than that of BINARY_ADD as it is done in the code, we are indicating that BINARY_MULTIPLY in the absence of parenthesis, it groups first and closest. So the output will for this portion will be: x * y + z and no parenthesis is needed.

However suppose we instead set the precedence of BINARY_MULTIPLY to a larger value than BINARY_ADD. In this case the code will think that the meaning of x * y + z is by default equivalent to x * (y + z), but the tree indicates that multiplication must be done first. So in the output the code will add the explicit parenthesis giving (x * y) + z.

The above is all nice and good, but alas it doesn't explain how %p is used specifically in the code. The above is handle all by setting values for BINARY_ADD and BINARY_MULTIPLY in the PRECEDENCE dictionary.

So let me describe how it is typically used in the code. Often it is used to set a node to a very high value which ensures that node groups last and grabs the most. For example in slices like a[5:] we have the rule:

    'slice1': ( '%c[%p:]', 0, (1, 100) ),

I am not sure this is strictly needed. Perhaps someone can come up with a better example and description.

%C (capital C)

This is used for list-like nodes that can have an arbitrary number of similar children. It needs a tuple of 3 items: a starting node, the maximum value of an end node, and a string to be inserted between sibling children. Consider the Python statement:

    a, b, c = (1, 2, 3)

The AST fragment for this is:

    assign (2)
       0. expr
          L.   1       0  LOAD_CONST               (1, 2, 3)
       1. designator
          unpack (4)
              0.              2  UNPACK_SEQUENCE_3     3
              1. designator
                              4  STORE_NAME               'a'
              2. designator
                              6  STORE_NAME               'b'
              3. designator
                              8  STORE_NAME               'c'

Notice that here we want , added between "a" and "b" and "b" and "c".

The TABLE_DIRECT entry for unpack then is:

  'unpack':		( '%C%,', (1, maxint, ', ') )

Which indicates we skip the first node UNPACK_SEQUENCE_3. To indicate we want all remaining items, we give a high number maxint which is defined in the program. And finally we indicate that terms are separated with , .

Notice there is also %, described next.

%,

Append a comma (,) if the last %C only printed one item. This is mostly for tuples on the left-hand side of an assignment statement since BUILD_TUPLE_n pretty-prints other tuples. This specifier takes no arguments.

%P

Similar to %C but you also include an additional precedence value as you add when going from %c to %p. As with %C, I don't have a good explanation yet of why it is needed. But here is an example from the code:

    'build_tuple2':	( '%P', (0, -1, ', ', 100) )

Note that index -1, is used to indicate the last item in the list.

%D

Same as %C this s for left-recursive lists like kwargs where goes to epsilon at the beginning. If we were to use %C an extra separator with an epsilon would appear at the beginning.

Example:

    'kwargs':    	( '%D', (0, maxint, ', ') )

%|

Insert spaces to the current indentation level. Takes no arguments.

%+

Increase current indentation level. Takes no arguments.

%-

Decrease current indentation level. Takes no arguments.

%{ expr }

Runs Python eval(expr) the context of the current parent node. This specifier takes no arguments.

The most common and simplest situation is to pick out the printable attribute field of an instruction. For example in Python 2.7 for from os import path as p, the Syntax tree fragment generated is:

    alias (2)
        0.                   9  IMPORT_FROM           1  'path'
        1. store
                            12  STORE_NAME            2  'p'

Here we want to pull at the printable attribute of IMPORT_NAME and STORE_NAME name. So the TABLE_DIRECT entries are:


    'IMPORT_FROM':	( '%{pattr}', ),
    'STORE_NAME':	( '%{pattr}', ),

The resulting string for the above is path and p, while the alias pattern joins these together to get path as p

A more involved case is in Python 3.6 format specifiers. The AST fragment for f'{abc!s}' is:

                  fstring_single (2)
                         0. expr
                            L.   1       0  LOAD_NAME                'abc'
                         1.              2  FORMAT_VALUE          1  '!s'

The TABLE_DIRECT entry for fstring_single is:

    'fstring_single': ( "f'{%c%{conversion}}'", 0),

There is also an n_fstring_method() which gets called first. That stores into the conversion attribute.

%[n]{ expr }

Runs Python eval(expr) the context of the current node[n]. This specifier takes no arguments.

This is similar to the above but instead of working on the parent node, we work on one of the child nodes

Example:

Here is the table-driven fragment for the attribute grammar rule:

  'attribute': (
     '%c.%[1]{pattr}',
     (0, 'expr')
  ),

So when we see this:

   attribute (2): ('%c.%[1]{pattr}', (0, 'expr'))
        0. expr
           L.  40        70  LOAD_FAST                'self'
        1.               72  LOAD_ATTR                dry_run

node[1].pattr is evaluated giving dry_run. The entire rule produces self.dry_run.

%[n]{%x}

Evaluates/rescurses on child n using specifier %x where x can be any one of the the above format specifiers, e.g. %c, %p, etc. The arguments are the same as for the specifier used.

Example:

For an async for pattern rule we have

   list_afor (2): (
     " async for %[1]{%c} in %c%[1]{%c}",
	 (1, "store"),
	 (0, "get_aiter"),
	 (3, "list_iter"))

When we see async for i in f this might get parsed in Python 3.7 as:

     list_afor (2): (' async for %[1]{%c} in %c%[1]{%c}', (1, 'store'), (0, 'get_aiter'), (3, 'list_iter'))
        0. get_aiter (2)
            0. expr
			   ...
            1.                26  GET_AITER
        1. list_afor2 (11)
            0. func_async_prefix (5)
			    ...
            1. store
                              36  STORE_FAST               'i'
            2. func_async_middle (9)
			   ...
            3. list_iter: ('%c', 0)

We want to pick out node[1][1] which should be a "store" rule, and node[1][3] which should be a list_iter rule. The specifier %c is applied to both of these children.

%%

Write a %. Takes no arguments. It is used for example when we want to output the in place modulo operator %=.

Additional Fragment Format Specifiers

When we are handling fragment deparsing, all of the above format specifiers are used, but we have the additional work to do in associating instruciton offsets with nodes. To accomplish this, additional format specifiers are needed.

%x

This specifier a tuple the first item is a node index, the source node, and another tuple of node indexis which indicates all of the nodes that the information from the source node should be propogated to.

For example in:

    'importstmt': ( '%|import %c%x\n', 2, (2,(0,1)), ),

information from node 2 is propogated to nodes 0 and 1.

%r

%r associates recursively location information for the string that follows

For example in:

   'break_stmt':	( '%|%rbreak\n', ),

The node will be associated with the text break, excluding the trailing newline.

Note we associate the accumulated text with the node normally, but we just don't do it recursively which is where offsets are probably located.

%b

%b associates the text from the previous start node up to what we have now

For example in:

  'importmultiple':   ( '%|import%b %c%c\n', 0, 2, 3 ),

The node position 0 will be associated with "import".