Reko's Type Inference
In order to recover the original type information that may have been present in the subject binary file, Reko uses a type inference method based on Raimar Falke's 2004 master's thesis "Entwicklung eines Typanalysesystem für einen Decompiler" ('Development of a type analysis system for a decompiler'), which is an interprocedural (whole-program) analysis. Reko provides some additions to accommodate real-world code that reko encounters. This thesis used to be downloadable from the net from many places, but I can't seem to find it anymore. I can provide it for interested parties; be aware that it is written in German.
One aspect that I like about Falke's approach is that, unlike other researchers, he doesn't approach the type analysis problem as a constraint satisfaction problem, which has always struck me as being a very heavy handed approach. His algorithm has the advantage of being relatively easy to implement (all mistakes in the reko codebase are of course my own, and not Falke's).
Expression in reko must have a
DataType -- currently the property
Expression.DataType. At the very minimum, the
N is the number of bits taken up by the expression.
Every processor architecture must define the operand sizes of every operand
of each machine instruction, so we're guaranteed at least this. Naturally,
if the processor architecture can yield more hints about the types of its
operands, reko will take advantage of it. However, the majority of the instructions
rewritten / lifted will remain as
PrimitiveType until more contextual analysis
is brought to bear. In a highly abstract way, we have the set
E of all expresions
each of which has a
This is where Falke's type analysis comes into play. Reko scans through the
IR code and assigns to each expression a
Expression.TypeVariable but it may get moved elsewhere in the future). Type
variables represent the as-yet undetermined data type of an expression, and
the implementation is used to store information gathered from the binary for
later stages of the analysis. After type variable assignment we have:
E .DT .TV ---> TV
Once each expression has a
TypeVariable, reko scans again, this time collecting
"evidence" (what Falke calls "traits") based on how the expressions are used in
the code. Obviously the initial data type is evidence but, as mentioned above, it
is normally quite unspecific. However, the operators involved in the expressions
will allow reko to refine the initial
PrimitiveType types into more specific
types. For example, if reko encounters an expression
i_1 + i_1 (where '+' is integer
addition), it is highly unlikely that
i_1 is either a floating point value,
character, or pointer. Also, any user supplied or platform supplied information
is introduced here as well. All this work is carried out by the
TypeVariable has an
OriginalDataType property, which is the datatype
that reko can deduce for that type variable, without taking into consideration
assignments (either directly in code or through procedure parameters /
E .DT .TV ---> TV .DT_orig
To be fair, my German is rusty, so it's possible I may have misinterpreted Falke's section 3.8.1:
Um Konflikte später aufzulösen zu können, wird zu jeder Typvariable t_i ein Ursprungsdatentyp t_i' berechnet. Wenn es sich bei dem Typen um einen Summentyp handelt, wird des Alternativetyp mit Hilfe des Ursprungstyp bestimmt.
which I translate freely to:
In order to resolve later conflicts, an original data type t_i' is computed for each type variable t_i. When dealing with a union type, the alternative type is determined with the help of the original type.
It's possible that Falke implies that reko should just copy
I will try to confirm with him.
Statements and expressions that assign values to one another introduce an
equivalence relation over the type variables which allows reko to group
them into equivalence classes (described in Falke section 126.96.36.199). The
type variables are partitioned so that each one belongs to exactly one
equivalence class, using a standard disjoint set union algorithm. Each type variable
EquivalenceClass it belongs to, and the
EquivalenceClass contains the
union of all the original data types of the type variables that it consists of. Naturally, the union
is simplified if all of its alternatives are compatible.
E .DT .TV ---> TV ---> EQ .DT_Orig .DT
The resulting unified datatype is copied back to the type variable as well:
E .DT .TV ---> TV ---> EQ .DT .DT .DT_Orig
Now, the datatypes in
EQ.DT are "cooked" to
EQ.DT' attempting to
simplify them as much as possible. Falke's thesis 188.8.131.52 specifies a number
of transformations which reko performs. We finally have:
E .DT .T ---> TV ---> EQ .DT .DT' .DT_Orig
Now we're ready to take the data types obtained through type analysis and
apply them to the expressions. This is done in reko's
class. The end result is that
DT' is "pushed" back into the
Es, but many
Es will be rewritten as a consequence of their types being discovered.
TypedExpressionRewriter is done, type analysis is complete, and all
type variables should be discarded. Our end result is:
E' represents the fact that expressions like
Mem0[r0_1 + 0x42:word32]
have now changed to
DT' reflects that the original data type
of the Mem0 expression, which was
word32, is very likely now reduced to either
a primitive type like
float32, or a complex type like e.g.
Eq_3 is a
TypeReference to a structure or union.