Skip to content

pycdc compared with uncompyle6

R. Bernstein edited this page May 11, 2019 · 37 revisions

pycdc vs uncompyle6

Disclaimer: yes, I work on uncompyle6 and don't work on pycdc, so a review is of course going to be slanted if not biased. But I think the differences are too interesting not to mention.

I have posted this on hackernews so that people can offer their thoughts.

Table of Contents

Integrated vs. Modular

First, let me say that when I first encountered pycdc, I thought it amazing for its relatively small amount of code which covers such a large range of Python versions; 20 or so of them. And it gets a bit of this compactness by isolating changes to a small number of tables. It assumes, for the most part, that Python is regular, and can be expressed by some general principles.

And that's its main problem. This is Python: it is not regular. You have some versions where generated bytecode puts keyword parameters before positional parameters on an evaluation stack; then in a subsequent version reverses that order, for a couple of releases before it changes the order back, and then redoes whole the way this works - twice in consecutive versions. Not only do opcodes add, drop and get renamed, they sometimes change semantic meaning; or they get reassigned to a different opcode number.

I've been also working on Emacs Lisp bytecode which goes back over 30 years. So it is even older than Python. But it has changed a lot less maybe by only a 1/3rd as much.

Because of the constant upheaval that goes on in Python bytecode, I've had to separate just the opcode abstraction into a separate program xdis. And by separating it out, I can then grow Python complexities to handle things like associating full Python releases with magic numbers which pycdc can't handle now.

There have been situations where the Python magic number has changed in the middle of a major release (3.5.0-3.5.2 vs 3.5.3-3.5.5). And there are magic number variants like pypy and dropbox (which pycdc doesn't handle). I think it is even the case that a Python magic number has been reused in two releases. No, xdis can't currently handle that. But at least all of this is isolated craziness in a Python module, should that craziness become important some time.

So in comparison, uncompyle6 is much larger, and I'll say it: uglier. It has been a constant challenge to try to keep the complexity down by refactoring the code.

And to that end, refactoring, in contrast to where the codebase was when I started, not only have I split it out the opcode handling part, I've also split out the Earley parser into a separate package.

And so that's another difference. You can use the opcode-handling package and the Earley Algorithm parser independent of decompilation. And I do.

I wrote a bytecode assembler which uses just the opcode abstraction part. And in my Python debugger, I've used the Earley algorithm parser to parse debugger concepts like locations and list ranges.

So although pycdc has some opcode understanding, and it is even encapsulated into some tables, you can't use that much outside of pycdc. Thankfully pycdc comes with a bytecode disassembler, pycdas, so internally it reuses that table goodness.

But suppose say you want to use want use pycdc's decompilation from inside Python? That was in fact my motivation for getting interested in decompilation: I wanted extended decompilation information in a Python debugger written in Python. Since pycdc is written in C++, that can't be easily done. It can and is easly done in uncompyle6 by import'ing the right modules.

A decompiler in Python raises an issue. Standard Python libraries only have disassembly modules "dis" and "opcode" that only work for the version that the interpreter is running. So do you allow decompilation of Python bytecode from any Python interpeter, a fixed particular version e.g. Python 2.7? Or from the latest version which changes periodically? Or maybe force the version of interpeter and the bytecode it decompiles have to be the same?

The choice uncomplye6 makes is that it allows any Python bytecode to get decompiled, same as pycdc. That means it can't use the standard Python libraries and is more complicated.

Furthermore uncompyle6 allows the interpreter doing the decompilation to be on any version of Python 2.4 or greater. That adds complexity because not only does the Python language and bytecode change, but also the data types for the bytecode objects changes. What is a "code" type in Python 2.7 is not the same as a "code" type in 3.0. Even the fields inside the "code" type change their type: in Python 2.x the instructions of the code type is a string while in Python 3.x it is a bytearray.

When you write the decompiler in a different language like C++, while this kind of problem still exists, things are simpler: you still need definitions for Python 2 versus Python 3 code types, but the way you describe them doesn't change depending on what C++ compiler you use.

So in sum, think of uncompyle6 as this Chevy Van or SUV where you can customize the vehicle, get additional options, and add extensions; pycdc is more like a VW air-cooled beetle. It is small, compact, fast at what it does, but is limited in power. And if you try to use it to haul a trailer, you'll surely overheat it and blow out the engine. (Actually, pycdc is faster than uncompyle6, so that analogy is a little off here.)

A little more about the uncompyle6 extensions. One thing I think is cool is that I can show you the deparsed tree of the program. So you can more easily understand how the instructions work. I have been moving towards using Python AST names. But keep in mind the two are only similar at the upper levels. And because of the compiler technology used, uncompyle6 can parse fragments of the text which I use in another program for showing more precise locations given a program offset.

Another extension uncompyle6 has is that it can give information on how to associate line numbers from original source code with the reconstructed source code. One final cool feature, aluded to above, is that you use uncompyle6 inside Python to get information at runtime and instrospect over running code. Again, I make use of this in my Python debugger.

But all of this is not without cost and code complexity. To make this run from Python 2.4 onwards, the code is in fact in two separate code branches in git. For each of the 3 projects.

Also, by separating the project into 3 parts means that sometimes when working on the decompiler I may find that I need an extension to either the opcode part or parser part. And when that happens I need to release 2 or 3 packages at the same time with version-specific dependencies. Life would be simpler if there was just one package for these kinds of things.

Error Reporting, Handing, and Testing

The last set of differences I'd like to mention has to do with error handling and testing.

It is not uncommon in uncompyle6 to get parse errors. In other words, you want to decompile something and uncompyle6 tells you that it doesn't understand something, hopefully isolated to particular function(s). This is either do to the ever-changing Python and its code generation, or just an existing failing of uncompyle6. Either way, uncompyle6 is up front about it: the default is to report an error for that portion and fail.

In contrast, although in some cases you may get a pycdc crash with a SEGV, more likely pycdc will produce something resembling Python. It may report: Warning: Stack history is not empty! which suggests something may be amiss. On the other hand, this message can happen even when things are okay. So you don't really know. And there are times when you won't get even this kind of complaint, but the source code will be wrong.

pycdc has some tests it performs, but those tests don't nearly cover all of the cases that can arise. uncompyle6 probably has 5 times as many tests. And if this isn't enough, it comes with programs to scan the python standard libraries to try decompilation on. And generating something resembling Python bytecode is not good enough. There is a "verify" option in uncompyle6 which will then compile the source just to make sure that the output of decompilation is valid Python, and not just "resembles Python".

Now of course, that too isn't a full test, since the valid Python just might not capture the same semantics. So we go one further: we take the test programs that Python uses to test itself, decompile that bytecode and run that Python. If that fails running, while running the bytecode from which the source was generated succeeds, then the semantics aren't the same.

I am sorry to report that uncompyle6 right now is pretty weak here. However weak, it is much better than any other Python decompiler out there.

Popularity

In the decompiler business, sometimes you don't have to be perfect to be helpful. For example, if I am using a decompiler to find out where I was when a program crashed, well, I just need to understand a narrow region of code. And it doesn't have to be totally exact as long as I get the idea. After all, the current technology for error reporting is simply "you have an error somewhere on or around line n". That is, if there is in fact a line n in a source file that can be found. Likewise if you are trying to reverse engineer some code that you don't have the source to, perhaps to understand a game board data structure you want to change, then again, the translation doesn't have to be exact for you to accomplish your goal.

And perhaps that's partly the reason why, as judged by github ratings, both pycdc and the various unmaintained uncompyle/decompyle forks currently have high ratings. uncompyle6 is also the newcomer to the scene, so it has taken a while for it to get the the current level of popularity. The first 3 or 4 years it was ranked behind the others.

Excutive Summary

In sum, pycdc is a relatively small, relatively compact (for the job it needs to do) single package. That's its virtue. But it is also its disadvantage when up against something as large, ugly, and ever changing as Python. Shortly after it was created, it worked probably about as good as it ever will. But over time I don't expect the pycdc situation to get better given the current flux of Python. It is already showing signs of decay starting in Python 3.5 and the upheavals in code generation of 3.6 and 3.7. There is no reason it couldn't be fixed up for those versions, but it would mean adding a lot of complexity.

Perhaps it better would be just to split off a new version of pycdc for 3.5 and beyond. But as Python code generation improves, even doing just this gets harder.

uncompyle6 attempts to handle these upheavals by ever painful refactoring.

pycdc has a number of bugs. If you look at the issues list, there are over 60 or so distinct classes of bugs. I doubt this will be fixed anytime soon. But does it matter? Popularity statistics suggest not.

uncompyle6 has a number of bugs too, but it is serious about fixing them. But that has meant a lot more effort and testing machinery (3 different CI services and 3 or for different classes of tests). And a lot more complexity.