Skip to content
R. Bernstein edited this page Dec 25, 2015 · 2 revisions

If you've read What's up?, you'll get a bit of (what I think of as the sad) sorry state of affairs among forks. So why start this project rather than contribute to one of those other (probably defunct) projects?

Well, I'm sorry to say that my focus is a bit different even still. In Deparsing technology and its use in exact location reporting I describe a little of my use: using deparsing to give a more precise notion of where you are in a program when you are stopped at a given bytecode offset.

For the rest, let me describe how this changes the code a little bit.

First, there is the style of Python used to express the bytecode. For decompilation you probably want to leave off extraneous verbiage (except when it is explicitly requested), while for location reporting you want probably want it the other way: include verbiage unless suppressing was requested.

Consider the bytecode for a module docstring:

0 LOAD_CONST  0 ("this is my docstring")
3 STORE_NAME  0 (__doc__)

A very straightforward translation of the way a docstring is set for a module would be:

__doc__ = "this is my docstring"

But instead what was probably written was:

"""this is my docstring"""

Why this matters is that if I ask what's at offset 0, the answer should I think be:

__doc__ = "this is my docstring"
          ^^^^^^^^^^^^^^^^^^^^^^

and for offset 3:

__doc__ = "this is my docstring"
^^^^^^^

Other examples include, adding "return" statements, and setting class variables like:

__module__ = __name__

inside classes

Also, when you are building up a text representation, at some point while building the text up, you can throw out parser information and information about and how fragments of text are associated with which tree nodes and terminal symbols. Note that some of the terminal symbols have the bytecode offsets that we want to be able to index on.

So I need to save more of the structure of the parsed tree, at least around the point of context. But also some of the deparsing rules need to be augmented.

Consider the CPython bytecode for:

for i in range(2):
    pass

The CPython bytecode for this is:

  1        0 SETUP_LOOP             20 (to 23)
           3 LOAD_NAME               0 (range)
           6 LOAD_CONST              0 (2)
           9 CALL_FUNCTION           1 (1 positional, 0 keyword pair)
          12 GET_ITER
      >>  13 FOR_ITER                6 (to 22)
          16 STORE_NAME              1 (i)

  2       19 JUMP_ABSOLUTE          13
      >>  22 POP_BLOCK
      >>  23 LOAD_CONST              1 (None)
          26 RETURN_VALUE

To create Python text from the bytecode instructions above, note that there is an instruction that handles initializing i; other instructions handle looping over i. Specifically, Offset 0 is involved in handling the for-loop initialization. Offsets 13, and 16 handle the loop variable update.

The uncompyle grammar rule covering this is:

    forstmt ::= SETUP_LOOP expr _for ...
    _for ::= GET_ITER FOR_ITER

Notice how the bytecode instructions in the grammar match the emitted instructions. The uncompyle-ed parse tree then looks like this:

   forstmt  .-- SETUP_LOOP ("i")
            |-- expr "range(2)"
            |-- _for ("i")
            |--

And the rule for extracting Python text from that above parse is:

    'forstmt':		( '%|for %c in %c:\n%+%c%-\n\n', 3, 1, 4 ),

Without going into the details of the format string, all you need to know is that each %c picks up the corresponding child of the node. In more straight-forward Python this would be:

     print("for %s in %s: %s" % (arg[3], arg[1], arg[4]))

The thing to notice in this long-winded setup, is that you don't find child number 2 used anywhere! That's because in the Python code, I just mention i once whereas in the code I need to set i potentially at two different locations. So my grammar rule is free to find the variable name "i" from either child 2 or child 3. It arbitrarily picked child 3 top use.

But in my situation, I need to be able to know that from any of the several offsets, I am at the position:

for i in range(2):
    ^

So I am not free to ignore some of the children of the for statement parse tree.

Right now this is hard-coded into my handling. However to make it more table driven, I need to add a new pattern rule. For example something like:

    'forstmt':		( '%|for %c in %c:\n%+%c%-\n\n%x', 3, 1, 4, (3,(2) ),
                                                  ^^            ^^^^^^

where %x with argument (3,(2)) means copy all of the location information from child 3 onto child 2 and its descendants if it has any.

Similarly in "range(2)", for deparsing I can get that there are instructions that covers the loading of the the function name "range" and another set of instructions to call the "range(2)" function with the supplied parameter. In contrast to the previous case, those two texts are different. But purposes of constructing the text "range" from the instructions, I can again choose either of set of instructions. So here I would need a pattern that indicate "note the range here, but don't modify the string that is being built up".

Let me come back and close with how this changes the code and the focus.

If your goal is to reconstruct a Python program from bytecode, it is generally important that you handle all of the bytecode. In my case, I only need a small portion. So 100% accuracy is less of a concern. And if I fail even in the places of interest, that's not critical: just annoying.

In my use case I am both faster and slower than a de-compilation of the entire bytecode. I generally decompyle only the function or top-level module for the offset that I am given. That's small compared to the whole program. But on the other hand, I need to save more parse tree and string offset information. And as seen above the grammar rules have to be more cumbersome so as to cover all of the offsets that one might possibly ask for. (Note the "possibly ask for" in the last sentence. If there never is any possibility that in debugging you will be stopped an offset which also means that there is never any possibility of crashing at a particular offset, then it is probably okay to omit storing information about that offset.)

So should I manage this new feature existing package(s) or outside?

If there were a solid package with the same interface, this could be done outside. In fact I'd prefer it. However given the state of affairs and access to changing the underlying parse tree patterns, for now I'd probably be better off adding the little additional code I have into uncompyle.

I may however change my mind.