Skip to content

Adding a tree transformation phase to uncompyle6

rocky edited this page Dec 28, 2021 · 20 revisions

Table of Contents

Adding a tree-transformation phase to uncompyle6

The problem

In "Poor if statement formatting", Issue #178 it was noted that:

if x:
    pass
elif y:
    pass
elif not z:
    pass

is decompiled from Python 2.4 bytecode to:

if x:
    pass
else:
    if y:
        pass
    else:
        if not z:
            pass

Thoughts around fixing and a note about AST's.

My thoughts on how to structure a decompiler are in Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages and in that you'll see in Figure 1 that there is an optional phase in between tree building and final text generation a box called "Macro expansion" where you can transform the tree according to your liking. Here,in that phase we would transform the else: if into elsif while adjusting the indentation. Currently uncompyle6 has no such phase.

Alexander Sieusahai has expressed interest in undertaking this. Thanks Alex! So here is a writeup of my current thoughts around this phase.

Alex mentions though "ast", and already there is a possible source of confusion here. First, the tree we produce is "abstract" in the sense that it imposes a structure on top of the instructions. And yes, it definitely is a "tree". But it is not abstract in the other sense that that it is used to mean in that it is compact to the point of having only the essential nodes and removes unnecessary syntax. In the case of the source code that would be semicolons or newlines. In our situation where we start out with an the opcode names of an instruction stream, we keep all the tokens intact.

Also, it is not the same thing as Python's AST. We do however try to keep the nonterminal or tree node names as similar as possible. So while uncompyle2 may have a tree name like designator, in uncompyle6 it is called store to match the corresponding Python AST name Store. Note that in contrast to AST's CamelCase use lowercase snake_case to remind people that this ain't Python's AST. (Are there any volunteers out there to write a translator to convert uncompyle6's tree into Python's AST?)

Design of a Tree-Transformation Phase

Ok, so with all of that stuff out of the way, let's get to work on this!

We basically want to transform one tree into another. This will mean adding another phase to uncompyle6, but to start out let's just add this to the semantics directory, and perhaps later we'll split that out.

Let's look at the tree with this cool recently-added option --tree+ where I include the templating instructions. For the Python code when decompiled from Python 2.4 bytecode you get:

        ifelsestmt (5)
             0. testexpr
                testfalse (2)
                     0. expr
                         L.   1         0  LOAD_NAME             0  'x'
                     1. jmp_false (2)
                         0.                 3  JUMP_IF_FALSE         4  'to 10'
                         1.                 6  POP_TOP
             1. pass
             2. jf_cf_pop (3)
                 0.  L.   2         7  JUMP_FORWARD         23  'to 33'
                 1.              10_0  COME_FROM             3  '3'
                 2.                10  POP_TOP
             3. else_suite
                suite_stmts
                    ifelsestmt (5)
                         0. testexpr
                            testfalse (2)
							...
                         1. pass
                         2. jf_cf_pop (3)
						    ...
                         3. else_suite
                            suite_stmts
                                ifstmt (2)
                                     0. testexpr_then
									 ...
...

Table-driven semantic actions describes how to interpret the cryptic stuff that starts out ('%|if %c

Boiling this down a bit we see the that basic templating rule used is: ifelsestmt: ('%|if %c:\n%+%c%-%|else:\n%+%c%-', 0, 1, 3)

The (5) previously shown after ifelsestmt just means that there are 5 children. But only use 3 of them, at index 0, 1, and 3 respectively, are used. An "Abstract" Syntax tree would have removed children at index 2 and 4.

If you look at uncompyle6/semantics/consts.py around line 264 you'll see

    'ifelifstmt':	( '%|if %c:\n%+%c%-%c', 0, 1, 3 ),

so this is what we want to get used instead. So what needs to get done is to write a rule. Such a rule would indicate that if we see a tree pattern like:

ifelsestmt [testexpr testfalse . else_suite [suite_stmts [stmt [ifelsestmt]]] . ]

transform that to:

ifelifstmt [testexpr testfalse . ifelsestmt]

And this is what was done. The tree transformation phas is in uncompyle6/semantics/transform.py and the method to consult is n_ifelsestmt.

A little about the jargon used above to direct tree transformation. Something like ifelsestmt represents a AST object defined in uncompyle6/parsers/astnode. (And yes, I see that name is poorly named. I will have to change that sometime.) The "kind" field of the object would change from ifelsestmt to ifelifstmt. The brackets indicate that what is enclosed in the brackets are children of the tree node that precede it. And the dot means that you can match any AST object.

I'm sure code to do this kind of thing has been written many times, especially in the LISP community. For example Emacs Lisp has a pcase pattern matching macro. The thing can be done stupidly by iterating over the entire tree for each kind of transform. More clever though is to iterate once and try the set of patterns on each node. In matching a single rule you would have to consider one or more descendants of a node, so each node might get visited more than once in this kind of code. The most clever approach is similar to how the spark parsin system works: traverse the tree, maintaining a list of tree-pattern matches and a position in each match rule. At each stage in the tree walk there is a set rules each augmented by a cursor or "dot". The dot rule advances if it matches the name after the dot. Otherwise the rule is discarded. When the dot is at the end of the rule, there is a match. This way, each node in the tree is scanned exactly once.

Other places this can be used

The next most glaring situation is where uncompyle6 produces

if not expr:
   RaiseAssertion ...

instead of:

assert expr

Finally I'll point out that there already is a bit of tree hackery inside uncompyle/semantics/pysource.py to remove an assignment to __doc__ from the tree, (see around line 2121) which has:

   if do_doc and self.hide_internal:

or extraneous return statements from the end of the main program, the end of the module, or the end of some functions; See function is_return_none() in pysource.py.

If we had the better tree transformation phase and the rules in a nice dictionary, all of that code would be much cleaner.