Skip to content

Type Analysis

Roger edited this page Dec 19, 2016 · 3 revisions

Reko's Type Inference

In order to recover the original type information that may have been present in the subject binary file, Reko uses a type inference method based on Raimar Falke's 2004 master's thesis "Entwicklung eines Typanalysesystem für einen Decompiler" ('Development of a type analysis system for a decompiler'), which is an interprocedural (whole-program) analysis. Reko provides some additions to accommodate real-world code that reko encounters. This thesis used to be downloadable from the net from many places, but I can't seem to find it anymore. I can provide it for interested parties; be aware that it is written in German.

One aspect that I like about Falke's approach is that, unlike other researchers, he doesn't approach the type analysis problem as a constraint satisfaction problem, which has always struck me as being a very heavy handed approach. His algorithm has the advantage of being relatively easy to implement (all mistakes in the reko codebase are of course my own, and not Falke's).

Fundamentals

Each Expression in reko must have a DataType -- currently the property Expression.DataType. At the very minimum, the DataType is Primitive.Byte or Primitive.WordN where N is the number of bits taken up by the expression. Every processor architecture must define the operand sizes of every operand of each machine instruction, so we're guaranteed at least this. Naturally, if the processor architecture can yield more hints about the types of its operands, reko will take advantage of it. However, the majority of the instructions rewritten / lifted will remain as PrimitiveType until more contextual analysis is brought to bear. In a highly abstract way, we have the set E of all expresions each of which has a DT property:

E
.DT

Type variables

This is where Falke's type analysis comes into play. Reko scans through the IR code and assigns to each expression a TypeVariable (currently Expression.TypeVariable but it may get moved elsewhere in the future). Type variables represent the as-yet undetermined data type of an expression, and the implementation is used to store information gathered from the binary for later stages of the analysis. After type variable assignment we have:

E
.DT
.TV  --->   TV

Type evidence

Once each expression has a TypeVariable, reko scans again, this time collecting "evidence" (what Falke calls "traits") based on how the expressions are used in the code. Obviously the initial data type is evidence but, as mentioned above, it is normally quite unspecific. However, the operators involved in the expressions will allow reko to refine the initial PrimitiveType types into more specific types. For example, if reko encounters an expression i_1 + i_1 (where '+' is integer addition), it is highly unlikely that i_1 is either a floating point value, character, or pointer. Also, any user supplied or platform supplied information is introduced here as well. All this work is carried out by the ExpressionTypeAscender and ExpressionTypeDescender classes.

Each TypeVariable has an OriginalDataType property, which is the datatype that reko can deduce for that type variable, without taking into consideration assignments (either directly in code or through procedure parameters / return values).

E
.DT
.TV  ---> TV
          .DT_orig

To be fair, my German is rusty, so it's possible I may have misinterpreted Falke's section 3.8.1:

Um Konflikte später aufzulösen zu können, wird zu jeder Typvariable t_i ein Ursprungsdatentyp t_i' berechnet. Wenn es sich bei dem Typen um einen Summentyp handelt, wird des Alternativetyp mit Hilfe des Ursprungstyp bestimmt.

which I translate freely to:

In order to resolve later conflicts, an original data type t_i' is computed for each type variable t_i. When dealing with a union type, the alternative type is determined with the help of the original type.

It's possible that Falke implies that reko should just copy E.DT into TV.DT_orig. I will try to confirm with him.

Equivalence class

Statements and expressions that assign values to one another introduce an equivalence relation over the type variables which allows reko to group them into equivalence classes (described in Falke section 3.8.5.4). The type variables are partitioned so that each one belongs to exactly one equivalence class, using a standard disjoint set union algorithm. Each type variable records the EquivalenceClass it belongs to, and the EquivalenceClass contains the union of all the original data types of the type variables that it consists of. Naturally, the union is simplified if all of its alternatives are compatible.

E
.DT
.TV  ---> TV        ---> EQ
          .DT_Orig       .DT

The resulting unified datatype is copied back to the type variable as well:

E
.DT
.TV  ---> TV        ---> EQ
          .DT            .DT
          .DT_Orig

Now, the datatypes in EQ.DT are "cooked" to EQ.DT' attempting to simplify them as much as possible. Falke's thesis 3.8.5.7 specifies a number of transformations which reko performs. We finally have:

E
.DT
.T  ---> TV        ---> EQ
         .DT            .DT'
         .DT_Orig

Expression rewriting

Now we're ready to take the data types obtained through type analysis and apply them to the expressions. This is done in reko's TypedExpressionRewriter class. The end result is that DT' is "pushed" back into the Es, but many of those Es will be rewritten as a consequence of their types being discovered. Once the TypedExpressionRewriter is done, type analysis is complete, and all type variables should be discarded. Our end result is:

E'
.DT'

where E' represents the fact that expressions like Mem0[r0_1 + 0x42:word32] have now changed to r0_1->dw0042 and DT' reflects that the original data type of the Mem0 expression, which was word32, is very likely now reduced to either a primitive type like int32 or float32, or a complex type like e.g. ptr(Eq_3) where Eq_3 is a TypeReference to a structure or union.

You can’t perform that action at this time.