Decompilation

This page describes a potential series of steps to complete in a decompilation effort of FUEL. A FUEL decompilation effort should be considered the first step toward a larger Zouna decompilation project. These steps do not necessarily have to be completed in order and the overall process will be very fluid, moving back and forth between the different areas of interest. As it stands, we are in step 6 "Automate as Much of the Decompilation Process as Possible" territory. I have started the FUELDecompilation repository to contain the decompilation tooling and code.

0. Acquire the Binary

Before you can do anything you need a copy of the main executable and any dynamic libraries it was shipped with. You can find the loaded modules and their paths on the file system by running under a debugger like x32dbg/x64dbg. This can also be done statically using Dependencies. Having the dynamic libraries handy can be a useful reference but it not a strict requirement in most cases.

1. Identify Toolchain

As stated in the Binary Information page, VC80 from Microsoft Visual Studio 2005 Professional Edition was used to compile and link the game. I have created a portable version of this toolchain available at widberg/msvc8.0. Detect It Easy and richprint are your friends here.

1.1. Identify Compiler and Linker Arguments

If you are trying to produce a matching decompilation it is not enough to just identify the toolchain, you will also need to determine the configuration, usually in the form of compiler command-line arguments. The optimization level and other flags can be determined by compiling sample programs and comparing assembly. Usually at lower optimization levels the function prologue and epilogues will be longer. FUEL uses /O2 to optimize for speed. FUEL disables buffer security checks using /GS-, lack of stack cookies is evidence of this. The /fp:fast flag is used to enable fast floating point behavior.

As will be later expanded upon in the 7.3 "Calling Convention Shenanigans" section, FUEL was linked with the /LTCG link-time code generation flag enabled.

2. Identify Statically Linked Libraries

Along with the C runtime library from VS2005, several static libraries are linked into the game including DirectX 9 and Bink load libraries. Once the library functions have been identified, they can be excluded from the decompilation effort.

2.1. Generating Load Libraries

An import library for a DLL can be generated following these steps if the original .lib cannot be located, for example with xlive.dll. First, run dumpbin /EXPORTS xlive.dll > xlive.exports. From there, you can use the function names and ordinals in the exports file to create a .def file:

LIBRARY XLIVE
EXPORTS
	XWSAStartup @1 NONAME
	XWSACleanup @2 NONAME
	XSocketCreate @3 NONAME
	XSocketClose @4 NONAME

The LIBRARY directive declares the base name of the .dll to import. The NONAME annotation means that the load library will import by ordinal rather than by symbol name, you will still use the symbol name in your code to link. You can also exclude the annotation and import by name. Since the original xlive.lib that FUEL was linked against used the NONAME annotation and xlive.dll does not export by symbol, only the ordinals would be available if not for xlivelessness/xlive_exports.def or another source of leaked symbol names. You can always give the imports names yourself, just make sure to be consistent. The Module-Definition (.Def) Files has more information about what is possible here, in general, I try to match the way the original executable imports library functions.

This .def file can then be turned into an import library using the command lib "/MACHINE:X86" "/DEF:xlive.def" "/OUT:xlive.lib". Linking this .lib into an executable will cause the imports to be added to that exectuable's Import Directory Table, ILT, and IAT. A more advanced and unnecessary understanding of PE imports can be obtained in 0xRick's "A dive into the PE file format - PE file structure - Part 5: PE Imports (Import Directory Table, ILT, IAT)" blog post.

2.2. Adopted Source-Available Code

Often times binaries will have source-available code code that was compiled and linked in. This is different from linking against a provided .lib since the compiler arguments and link time optimization will have a greater affect on the generated code, since it is being compiled by the binary developer rather than the library developer. So signature databases will be less likely to match the functions, although it is not impossible that they do. Usually searching for strings, constants, or arrays of constants found in a function you believe was adopted will lead you to a public facing Git repository, academic paper, or FTP server with source files. This is most common for compression, encryption, and hashing algorithms in my experience. The The On-Line Encyclopedia of Integer Sequences is a great resource for searching for arrays of constants and finding the names of algorithms they could be associated with, knowing the name of the algorithm makes it easier to search for. A list of code of this nature found in FUEL is available in the 7.1. Prior Work section of this page.

3. Apply Symbol Names

Using the techniques described on the Binary Similarity page, one can apply some "better than nothing" symbols to the executable which makes it easier to work with. In the case of FUEL, the game was compiled with RTTI enabled therefore some class names are available. This can be used to build inheritance trees and namespace functions. The __FILE__ macro strings found in the binary mentioned in the Binary Information page can also be used to derive names. ooanalyzer can be a bit finicky but it's worth a try to see if it propagates class names to more functions than Ghidra does with simple RTTI parsing.

4. Delink Executable

The Ghidra extension boricj/ghidra-delinker-extension can be used to export relocatable COFF object files from an executable, allowing one to rebuild the game without having to have previously decompiled every function upfront. There are also other projects that serve a similar purpose. Some heuristics, like padding and optimization analysis, can be used to roughly determine the original object file boundaries. In conjunction with the __FILE__ macro strings, this can help in naming the object files and the source files they are derived from.

Some considerations for delinking were found in boricj/ghidra-delinker-extension#6 by me. There is a bug in Ghidra's "x86 Constant Reference Analyzer", so you should disable that analyzer. Ghidra doesn't always place labels at the start of strings if only substrings are referenced, but the delinker will try to find the symbol for the start of the string, so check the log output and add these symbols before rerunning the delinker. Ghidra will add lots of XRefs to things that look like addresses but are not, in particular, 0x007fffff, so delete these references before rerunning the delinker. Make sure there are no duplicate symbols in the project's symbol table, not doing this will cause hard-to-diagnose problems that will waste your time. Also, make sure all instructions are in a function, the delinker does not act on orphan instructions.

A quick smoke test for how well the delinker worked is to relink with a different base address and run the executable. This catches some obvious crashes but hard to debug infinite loops and other control flow deviations may remain. Also, make sure DLLs aren't loading in the original executable's address space. If there is code there, you may experience weird subtle glitches instead of obvious crashes.

4.1. Global Initializers

As described in CRT initialization, one can relink the global initalizers with Microsoft link.exe specific tricks. The following C file, when compiled, produces an object file with a list of global initializers in the .CRT$XCU section. Maintaining the order of the entries is important. When linked, MSVCRT will incorporate these into the list of functions called in __cinit.

#pragma section(".CRT$XCU", read)

#define X(x) \
    extern void x(void); \
    __declspec(allocate(".CRT$XCU")) void (*__xc_u_0_##x)(void) = x;

X(FUN_008e4690)
/* etc... */

4.2. TLS Callbacks, Structured Exception Handling, and C++ Exceptions

Some other CRT structures that may need to be marked appropriately in Ghidra are TLS Callbacks, Structured Exception Handling, and C++ Exceptions. From what I can tell FUEL uses SEH exactly once in the MemoryManager_Z class. Aside from making sure the structures have their pointer members marked as addresses in Ghidra and are included in the delinker selection, no extra processing is needed.

4.3. Resources

Resource Hacker can be used to export a .rc resource script from the executable. This can be compiled into a .RES with rc.exe and relinked. If resources are missing then the game may not function properly. You must add #include <Windows.h> to the top of the file created by Resource Hacker since it uses some macros from there when generating its output. Also, you will likely need to remove or replace the RT_MANIFEST type resource to make Windows side-by-side happy and avoid The application has failed to start because the side by side configuration is incorrect. messages.

4.4 Replacing Functions

A function from the delinked game can be replaced by deleting or overriding the original symbol. The easiest option is to exclude the original function body from the delink and write your new function body to be linked in; however, this does require the game to be delinked again with the new selection, which can be time-consuming. You could just remove the symbol from the COFF symbol table without doing another full delink. Another option is to use the MSVC linker's symbol resolution behavior. As described in "Understanding the classical model for linking: You can override an LIB with another LIB, and a LIB with an OBJ, but you can’t override an OBJ", if the original game OBJ is in a LIB and a symbol exists both in that LIB and an OBJ with your override function body, the override will take precedence. A LIB can be made from an OBJ with the command lib /out:game.lib game.obj. This can be used to emulate weak symbol semantics from ELF. COFF does have a symbol class for weak symbols, 108, but it lacks support. Note that using a non-MSVC linker like lld may cause additional warnings and errors when relying on this behavior that are not present in a genuine MSVC linker. In general the version of link.exe you are stuck with will also affect how well this works. I couldn't get it working with the one from VC80 and decided that removing functions was better. It reduces the resultant binary size and makes errors more obvious.

4.5 Name Mangling

One thing to be aware of when syncing symbols between the Ghidra project and recompiled source is name mangling, sometimes called name decorating. Name mangling in C++ is needed to uniquely identify overloaded functions and support features like function overloading, namespaces, and template instantiation by encoding additional information (such as function signatures) into symbol names. While C does not support these features MSVC will still decorate C function names with information about the calling convention. The MSVC name mangling behavior is semi-documented on the decorated names page of Microsoft's Learn site, Clang's implementation of the mangler is available at llvm-project/clang/lib/AST /MicrosoftMangle.cpp, and Wikiversity has a Visual C++ name mangling page. dumpbin can be used to list the mangled names of symbols present in a COFF or PE file. undname converts a mangled name to its demangled form. Ghidra can also demangle names in some cases if it can detect the mangling scheme used by the compiler, or using the scripting API's Demangler class.

With some MSVC trickery, we can have the compiler output the undecorated and decorated name of a function along with some user data, i.e. an address from the original executable for attribution in a Ghidra project. Invoking the #define MARKFUNCTION(source, address, ...) __pragma(comment(user, __FUNCSIG__ ";" __FUNCDNAME__ ";" #source ";" #address ";" #__VA_ARGS__)) macro like MARKFUNCTION(FUEL, 0x00000000, int); from within a function to cause a line similar to int __cdecl X::a<int>(void);??$a@H@X@@SAHXZ;FUEL;0x00000000;int to be embedded into the object file after the symbol table. These can be parsed as part of a build system. This may or may not be easier than running undname on all symbols in an object file and associating them with lines in the source code and parsing some comment format on a nearby line. To play nice with templates, a mark for each type expected in the template parameters can be made as user data and the parser can ignore lines where the user data does not match the function names. Once all the symbols have been collected, the undecorated name can be used as the primary human-readable name in Ghidra and the decorated name can be a secondary symbol for the delinker.

A more complicated and worse option I considered was invoking the #define MARKFUNCTION(source, address, ...) __pragma(message(__FUNCSIG__ ";" __FUNCDNAME__ ";" #source ";" #address ";" #__VA_ARGS__)) macro like MARKFUNCTION(FUEL, 0x00000000, int); from within a function to cause a line similar to int __cdecl X::a<int>(void);??$a@H@X@@SAHXZ;FUEL;0x00000000;int to be printed to stdout by the compiler. These can be collected and parsed by a compiler launcher as part of a build system.

Although it may not be immediately obvious, using the compiler to emit this information is superior to parsing the code for comments. First, you eliminate any source of bugs due to the difference in implementation between the compiler building the code and whatever parser you write to digest the comments. You also get the C preprocessor integration for free; if the section of code that invokes the macro is in an inactive #if block, the comment will not be emitted. No questioning if your parser is using the same set of definitions as the compiler. Second, the name mangling itself requires information about the visibility of member functions, sizes of types, etc and requires running an entire C/C++ frontend on the source for context. The compiler does this anyway, we might as well use that result.

5. Prioritize Functions to be Decompiled to Source

FUEL is too big to decompile in full as you will soon see. But we can leverage the delinked object files to rebuild the game while only decompiling a small subset of the functions. How do we decide which functions to start with? How about the functions run during a particular part of the game, for example, startup? Using DynamoRIO's drcov tool one can trace the execution of part of the game to identify which functions are involved. I have a janky way of processing the output of drcov into a list of unique non-library functions in the order of first appearance in the trace.

First, if the trace is from version 3 you must convert it to version 2 using drcov-3-to-2.py. Then you must convert the trace into a list of basic block start addresses using drcov.py (change the hardcoded module name near the end of the script) and piping the output into a file. Finally, you must change the hardcoded paths and run idalistfunctionsfrombb.py or write an equivalent script for your preferred SRE tool. At the end of this process, you will have funcs.txt which contains 6895 unique non-library function start addresses. This is a mess but it took minutes to setup rather than hours and I don't plan on doing it often.

We can follow this process again tracing from the start through participating in a race and subtracting out the functions run from the start to the main menu to only include those functions that are called during the race segment. The command grep -Fxvf funcs-A.txt funcs-B.txt > only-in-B.txt will produce a file only-in-B.txt with the lines appearing in funcs-B.txt and not in funcs-A.txt maintaining their original order. After doing exactly that, I am left with only-in-race.txt which contains 1433 unique non-library function start addresses. Thankfully not many more unique functions are called after the game is loaded. grep -Fxf funcs-A.txt funcs-B.txt will output only those lines in both files. awk '!seen[$0]++' funcs-A.txt funcs-B.txt allegedly outputs all the lines in both files without duplicates.

A capture that starts the game, completes a race, and exits contains 7997 unique non-library function start addresses in the main executable. The first and last couple of functions are global static constructors and destructors. For comparison, the popular Lego Island decompilation project contains 4422 functions across the main executable and game DLL without limiting the list to only functions that are run in a trace. If we expand our consideration in the same way, FUEL has approximately 35000 functions.

6. Automate as Much of the Decompilation Process as Possible

Something like simonlindholm/decomp-permuter can be used to permute pseudo-C code exported from IDA's Hex-Rays Decompiler to compile to machine code that better matches the original disassembly. An upgrade to decomp-permuter to support x86_32 and COFF or a new tool based on Clang's AST could work here. The Clang solution could be better suited for C++. I must mention CVise and CSmith here as they solve similar problems. cvise/clang_delta, Clang-Chimera, or GrammaTech/clang-mutate could be a good starting point for a Clang-based source code mutator. This is by far the most ambitious item on this list, but it must be solved to finish a decompilation of a modern game in a reasonable amount of time.

6.1. Pie in the Sky

If you really want to earn your PhD you can develop a genetic algorithm using something like NEAT (most famously used in SethBling's MarI/O) that could "learn the compiler." I'm not sure how well this would work but I am interested in seeing how an alternative decompiler architecture based on generating and mutating source code and compiling it to compare to the target disassembly would work. This is in complete opposition to the current strategy that all decompilers use, lifting to an intermediate-level representation and generating pseudo-C code that is not guaranteed to compile let alone to the same machine code as was lifted. The current strategy is of course much faster than running a machine-learning model and perfect recompilation is unnecessary for pretty much everyone so it makes sense that nobody does it the way I am suggesting. If you are less ambitious a hybrid model could also work where you take the pseudo C output of a standard decompiler architecture and mutate it as described in the previous paragraph without as much randomness using machine learning. I'm not sure what I'm saying at this point but I feel like there is something interesting here.

7. Hand Decompile Remaining Functions

The decomp.me platform allows one to compare the results of compiling the C code for a function to the original disassembly in an easy-to-use interface similar to Compiler Explorer. It has support for VC80 x86 and a preset for FUEL.

7.1. Prior Work

As part of other efforts, existing source-available code that has found its way into the game has been identified. Usually searching for debug symbols from the other games, strings, constants, or characteristics of algorithms will yield results. This can aid in establishing code style guidelines and having a "known good" base for matching disassembly. The following pages describe the nature of the discovered code.

ZOUNA.rar (DAT-o-MATIC)
microsoft/MixedRealityToolkit/SpatialUnderstanding/Src/Engine
Networking
Asobo LZ Compression
Asobo Arithmetic Coding Compression
Asobo CRC32
0x5abe/zouna-templates-docs/code
StackWalk from this codeguru forum post (archive.org) is in FUEL. The strings can be searched and found.
SolveQuadric, SolveCubic, SolveQuartic, and inttor from the Graphics Gems series of books are in ZOUNA.rar and WALL-E with the same debug symbols.
The original TinyXML library is used in WALL-E to parse .ini files. This was identified using debug symbols as well. This was not used in FUEL.
dMULTIPLY[012]_33[13] style functions from Bullet Physics are in WALL-E with the same debug symbols. The original Bullet Physics was open source at the time of WALL-E's release but the functions may come from somewhere else since they are very standard.

7.2. A Note About Exact Match Decompilation

In the absence of some complex formal methods technique, perhaps based on angr's constraint solving capabilities or LLVM's alive2, the only real way to statically guarantee that two sets of assembly with non-trivial differences are functionally equivalent is for them to be identical. This is why this page and many existing decompilation efforts focus so heavily on matching the original disassembly where applicable. There is also an element of "because we can" at play in some projects where matching the original is more of a point of pride than something practical. The idea is that it is not a true decompilation if it doesn't match and therefore doesn't count, "you might as well just do a green field reimplementation if it isn't going to match." That said, non-matching functions can still be built, relinked, and tested dynamically, i.e. running the game and checking the behavior. While not rigorous this should be good enough for most purposes.

That said, there are some practical limitations to matching every instruction exactly. With link-time code generation enabled the prologue and epilogue of some functions will be impossible to match by simply compiling source to an object file without linking for reasons discussed in the next subsection. Even with linking many factors inform the optimizations and instruction selection that are simply too chaotic to control for. Even small changes in the prologue can cause instruction alignment to shift and increase/decrease padding before branch targets leading to discrepancies far into the function body due to a change in the calling convention. Also, if arguments end up passed on the stack instead of in registers loads and stores may be emitted. It is possible to have a successful decompilation project where the goal is not 100% matching disassembly, for example, the GTA 3/GTA VC decompilation project.

7.3 Calling Convention Shenanigans

One problem with decompiling FUEL is maintaining interoperability between decompiled functions and the original delinked functions that have not yet been decompiled to source. A calling convention is a contract between the caller, the site where the target function is being called, and the callee, the target function, that specifies things such as where arguments and return values will be stored, which registers can be assumed to be unmodified after the call, and who cleans up the stack. MSVC allows specifying certain standardized calling conventions including __cdecl, __stdcall, and __fastcall. __thicall can also be used but only on member functions in MSVC, Clang allows __thiscall on non-member functions. Having control over which calling convention a function uses is great, however since that control is limited to a small set of predefined calling conventions not all cases are easily covered.

FUEL was linked with link-time code generation enabled which allows for whole-program optimization such as analyzing register usage and optimizing function calling conventions to pass more arguments in unused registers before pushing them onto the stack. However, this leads to most functions having unique and non-standard calling conventions not found in the list above. This makes it hard for delinked code to call recompiled code and for recompiled code to call delinked code. The easiest solution to this problem is to generate trampoline stubs that act as an adaptor from the non-standard to standard calling conventions and vice versa. Libraries exist to generate these trampolines such as my usercall.hpp or even add support to the compiler directly like my llvm-project-widberg-extensions. These are overkill for just generating trampolines and a simpler script could generate the assembly and assemble it as part of the build system for the recompiled code.

As mentioned previously, this is only a concern when calls cross the delinked-recompiled boundary. If one recompiled function calls another recompiled function no special care needs to be taken. And if no delinked code calls the second function, then the trampoline to it does not even need to be generated. Even if the call in the original binary used a non-standard calling convention. This is to say that one can divide the function call graph of the binary up into minimally sized "islands" where all the edges that cross the island/not-island boundary are standard calling conventions. The simplest island would be a single function with a standard calling convention that calls no other functions. As long as each island exists wholly in the delinked code or recompiled code, no trampolines will be needed between functions. So one can reimplement delinked islands one at a time in the recompiled code without having to worry about calling convention shenanigans at all. In practice, however, the size of these islands will be quite large due to the highly interconnected and brutally optimized nature of most binaries, so trampolines are needed if you do not want to reimplement hundreds of functions at a time while maintaining interoperability. Additionally, finding these islands in Ghidra is a nightmare due to its lackluster support for discovering non-standard calling conventions on functions in its analysis. Usually, functions containing in_ or unaff_ prepended local variable names are not using one of the standard calling conventions. IDA Pro and Binary Ninja are far superior in this aspect.

Home • Read the Docs

Home
FAQ

For FMTK Users and Mod Developers

Read the Docs

For FMTK Developers

Asobo BigFile Format Specification
Asobo Classes
      Animation_Z
      Binary_Z
      Bitmap_Z
      Camera_Z
      CollisionVol_Z
      Fonts_Z
      GameObj_Z
      GenWorld_Z
      GwRoad_Z
      Keyframer*_Z
      Light_Z
      LightData_Z
      Lod_Z
      LodData_Z
      Material_Z
      MaterialAnim_Z
      MaterialObj_Z
      Mesh_Z
      MeshData_Z
      Node_Z
      Omni_Z
      Particles_Z
      ParticlesData_Z
      RotShape_Z
      RotShapeData_Z
      Rtc_Z
      Skel_Z
      Skin_Z
      Sound_Z
      Spline_Z
      SplineGraph_Z
      Surface_Z
      SurfaceDatas_Z
      UserDefine_Z
      Warp_Z
      World_Z
      WorldRef_Z
Asobo File Format Idioms
Asobo CRC32
Asobo LZ Compression
Asobo Arithmetic Coding Compression
Asobo Save Game File Format Specification
Asobo Audio Formats
TotemTech/ToonTech/Zouna/ACE/BSSTech/Opal Timeline
Zouna Modding Resources
Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly