Skip to content
R. Bernstein edited this page Nov 27, 2019 · 15 revisions

Continuing in the series of how I fix parse bugs encountered.

The basic idea is to find something that works and compare to see where things go wrong. In this example, the problem is in Python 2.7 so I compare uncompyle6 with uncompyle2 which is similar. In Fixing Issue #283, I compare Python 2.6 bytecode which has bugs, with Python 2.7 which decompiles fine.

Issue 151 ultimately was missing a grammar rule, and another too permissive grammar rule was covering for that. This also happens somewhat in Fixing Issue #283.

Here are the steps I have took to address issue #151.

Step one is get a small test case and compile it. The test case provided in the issue was:

def func(a, b, c):
    while a:
        if b:
            continue
        return False
    return True

To compile it one can use the compile-file.py program in python-uncompyle6/test/stdlib

   $ cd test/stdlib
   $ pyenv 2.7.14
   $ ./compile-file.py /tmp/issue-151.py
compiling /tmp/issue-151.py to /tmp/issue-151-2.7.pyc

Let's get detailed information in decompilation: assembly code, grammar reductions and the AST tree (which will fail below):

   $ ../../bin/uncompyle6 -agT /tmp/issue-151-2.7.pyc > /tmp/issue-151.log 2>&1

It so happens that uncompyle2 decompiles correctly. So let's run that and compare the grammar. The option that corresponds to t in uncompyle2 is --showast:

    $ uncompyle2 --showast /tmp/issue-151-2.7.pyc > /tmp/issue-151-good.log

Looking at the tree for that we see:

      whilestmt

            0	SETUP_LOOP        '26'                              (1)
         testexpr
            testfalse
               expr
                  3	LOAD_FAST         'a'                       (2)
               jmp_false
                  6	POP_JUMP_IF_FALSE '25'                      (3)
         return_stmts
            _stmts
               stmt
                  ifstmt
                     testexpr
                        testfalse
                           expr
                              
                                 9	LOAD_FAST         'b'        (4)
                           jmp_false
                              12	POP_JUMP_IF_FALSE '21'       (5)
                     _ifstmts_jump
                        c_stmts_opt
                           c_stmts
                              _stmts
                                 stmt
                                    continue_stmt
                                       
                                          15	CONTINUE          '3' (6)
                        18	JUMP_FORWARD      '21'                (7)
                        21_0	COME_FROM         '18'                (8)
            return_stmt
               ret_expr
                  expr
                     
                        21	LOAD_GLOBAL       'False'             (9)
               24	RETURN_VALUE      None                       (10)
         25	POP_BLOCK         None
         26_0	COME_FROM         '0'

Here I compared the grammar reductions with the tree that uncompyle2 has. The number in parenthesis I added and are are the token numbers. Unfortunately uncompyle2 doesn't have an option to track grammar reductions, so you can compare this phase with uncompyle6.

In the /tmp/issue-151.log log, I saw:

         3     expr ::= LOAD_FAST (2)
         3     ret_expr ::= expr (2)
         3     assert_expr ::= expr (2)
         6     jmp_false ::= POP_JUMP_IF_FALSE (3)
         3-6   testfalse ::= expr jmp_false (3)
         3     testexpr ::= testfalse (3)

That first LOAD_FAST matches the load fast shown in the tree above. The token number, 2 is given in parenthesis (each token is actually an instruction and the first instruction is SETUP_LOOP here).

So we have expr ::= LOAD_FAST (2) in /tmp/issue-151.log and if you look in the tree you can find expr with a child LOAD_FAST. Note the 3 in both places refers to offset 3. Then a couple of lines down in /tmp/issue-151.log you will see

6 jmp_false ::= POP_JUMP_IF_FALSE (3)

And you will see in the tree jmp_false has a child POP_JUMP_IF_FALSE in the tree. Then after that in /tmp/issue-151.log there is:

3-6 testfalse ::= expr jmp_false (3)

The 3-6 means that this rule spans offset 3 to 6 in bytecode instructions. But token number 3 in parenthesis is the last token that the Earley parse has seen at this point, In particular it has currently scanned SETUP_LOOP, LOAD_FAST, and POP_JUMP_IF_FALSE.

The reduction rules ret_expr and assert_expr don't appear in the tree. That's okay. What's going on is that there are things the grammar considered as possibilities. They just didn't pan out though.

Tracing on like this I finally get in issue-151-bad.log this:

L.  5:  21     expr ::= LOAD_GLOBAL (9)
L.  5:  21     ret_expr ::= expr (9)
L.  5:  21     assert_expr ::= expr (9)
L.  5:  21-24  return ::= ret_expr RETURN_VALUE (10)

this is where things start to go awry...

L.  3:   9-24  returns ::= _stmts return (10)
L.  5:  21     stmt ::= return (10)
         3-24  returns ::= _stmts return (10)

After the returns ::= _stmts return (10) is where we expect to see a rule like whilestmt ::= SETUP_LOOP testexpr return_stmts. In uncompyle2, returns is called return_stmt.

Notice that the returns (or return_stmt as uncompyle2 calls it) that starts on line three spans offsets 3 to 24, ending at token 10, and that is wrong.

To address the grabbiness I changed:

 iflaststmtl ::= testexpr c_stmts_opt

to:

iflaststmtl ::= testexpr c_stmts

Okay, so I guess now I got to explain nonterminal names and what they mean.

First returns. This is from uncompyle6/parser.py:

        # The "returns" nonterminal is a sequence of statements that ends in a RETURN statement.
        # In later Python versions with jump optimization, this can cause JUMPs
        # that would normally appear to be omitted.

        returns ::= return
        returns ::= _stmts return

In uncompyle6 it is called return_stmts. To try to keep nonterminal names closer to the Python AST names change the name to returns. The AST symbol would be Return.

The nonterminal iflaststmtl is an "if" statement which is inside a loop (that's the "l" suffix. I think the "last" part means it is the last statement of the loop).

The nonterminal c_stmts are statements that can contain a continue statement in them. And when there is an _opt suffix it means it is optional.

So it seems unusual that iflastmtl could just contain some sort of test without a body. It's actually more complicated than that because that might not be true for for other things. So this change might cause a bug in the future.

Lastly, we add a new test case for this particular code. Run inside test directory:

$ cp /tmp/issue-151.py simple_source/bug27\+/05_while_if_continue.py

Add some additional some comments to the beginning of 05_while_if_continue.py to describe what's up.

$ git add simple_source/bug27+/05_while_if_continue.py
$ ./add-test.py simple_source/bug27+/05_while_if_continue.py
$ git add -f bytecode_2.7/05_while_if_continue.pyc

Ideally though we need a whole separate process for figuring out control flow