Skip to content

Graph Matching: Challenges

Sandeep Dasgupta edited this page Aug 25, 2019 · 1 revision

A working instance of graph matching

Literature

For the matching following are the prerequisite tasks:

Task 1: Data flow graph of the binary Issue1

The data flow graph for the binary program must include the values(register or memory) as nodes. This requires enumerating all the registers and memory in the program and run the data flow analysis with it. For example: for the #0: addq $0, %rax; #1: movq %rax, %rbx, the dfg should have the following relations: %rax (from #0) --> %rbx (from #1)

The current representation is like #0 --> #1 and only captures the df relation across instructions and hence will not expose the input-output relation within an instruction. On the other hand, the dfg of the decompiled LLVM program, being derived from the instruction semantics, captures the df relations within an instruction.

Task 2: Enhance Stoke APIs to provide memory access information Issue2

Consider the instruction pushq %rbp which pushes the value of rbp at sack location %rsp-8 and also updates the value of %rsp to %rsp-8. The read-set of the instruction is %rbx and %rsp and the write-set is %rsp and memory. Currently, the stoke APIs, to extract the read/write-set of instruction, is not proving the memory read or write information. Make changes in the API to accommodate that.

Task 3: Prune out spurious df edges of the dfg of the binary Issue3

Binary instruction are multi-input and multi-output. Consider the instruction pushq %rbp which pushes the value of rbp at sack location %rsp-8 and also updates the value of %rsp to %rsp-8. The read-set of the instruction is %rbx and %rsp and the write-set is %rsp and memory. A naive approach of generating dfg will create two edges from each read-set to each one of the write-set. However, the edge from %rbp to %rsp is spurious as %rbp is not contributing towards the update of the %rsp value. These spurious edges can be avoided if we take the semantics of the pushq instruction into account. In fact, the dfg of the decompiled LLVM program does not have this spurious information as it is constructed following the semantics of an instruction.

In order to remove the spurious edges, we need to know which members of the read-set are actually contributing to the write-set. Such, information can be extracted from the semantic definition of an instruction.

Task 4: Discover aliasing load/stores to normalise the dfg on the decompiled LLVM program Issue4

In the unoptimized decompiled LLVM program, there is no store forwarding, which makes it hard to see the df relation between the aliasing store and subsequent load. Discovering the aliasing relation will drive the LLVM dfg closer to the binary dfg.

Corner cases

Cosider the scenario:

X86
===========
addq $10, %rbx

addq $20, %rbx

jc .L2


L2:
subq %rbx, %rdx // use-set %rbx

LLVM
===========
/*
addq $10, %rbx
*/
%1 = load *RBX
%2 = add 10, %1

/*
addq $20, %rbx
*/
%3 = load *RBX
%4 = add 20, %3

jc .L2


L2:
/*
subq %rbx, %rdx // use-set %rbx
*/
%5 = sub %4, ... 

Case I: Unoptimized: Both llvm dfg is present; unordered Our intention: Before L2 we need to add an invariant rbx == %4

Case II: Optimized: Both llvm dfg is present; re-ordered Our intention: Before L2 we need to add an invariant rbx == %4

Case III: Optimized: Only one of the lvm dfgs are present; Our intention: Before L2 we need to add an invariant rbx == %4