First prototype. Do not use in production!
This project tries to provide additional Ghidra Version Tracking Correlators suitable for patch diffing.
In Ghidra: File
-> Install Extensions
hit the top right green +
icon; then select the ghidra_<VERSION>_PatchDiffCorrelator.zip
(that you either build from source with the GhidraDev plugin or downloaded pre-build from the releases section; please make sure VERSION
matches your Ghidra version!)
Then restart Ghidra.
A simple introduction video:
Simple workflow:
- Run the Automatic Version Tracking Command.
- Run a
Bulk * Match
Correlator withOnly match accepted matches
select. This will produce a scoring for your accepted matches for similarity of the functions.
Advanced workflow:
While the Automatic Version Tracking Command find very good matches by running the included Correlators in their defined ordered and automagically accepting good matching, it takes more time than only running the Correlators you need to get your matches. This can be done via (but may vary depending on binary):
- Run the
Exact Symbols Name Match
Correlator if there are symbols. - Run the
Exact Function * Match
Correlators. Accept
all matched functions.Accept
suitableImplied Matches
- Run some
Reference
Correlators. Accept
matches.- Repeat "conventional" matching until the function you are after has been accepted.
- Run a
Bulk * Match
Correlator withOnly match accepted matches
select. This will produce a scoring for your accepted matches for similarity of the functions.
- The symbol matcher in these correlators is not as good as the included
Exact Symbols Name Match
Correlator. - These matchers are slower than the included
Exact Function * Match
Correlators, so you should run the included ones first, then exclude them (via selection) from running through the patch diff correlators. - The
Bulk Basic Block Mnemonics Match
Correlator is good for finding basic block changes. - The
Bulk Mnemomics Match
Correlator is robust against instruction reordering performed by compilers.
This adds additional Program Correlators to Ghidra. These are - unlike the
Correlators that ship with Ghidra - able to produce Matches with a Similarity Score
below 1.00
. This means these correlators give an estimate how similar functions
are to one another instead of providing perfect matching as the included correlators.
This indicator on similarity is need to find patches in functions.
The Bulk Instructions Match
Correlator will make an unordered bulk list of Instructions
occurring in a function.
Let's say we have the function:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x8
MOV ESP,EBP
POP EBP
RET
Then the Correlator would "bulk" this to the following list of features:
MOV EBP,ESP
MOV ESP,EBP
POP EBP
PUSH EBP
RET
SUB ESP,0x8
If we now have a function:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x42
MOV ESP,EBP
POP EBP
RET
With features:
MOV EBP,ESP
MOV ESP,EBP
POP EBP
PUSH EBP
RET
SUB ESP,0x42
It would match 5 out of 6 features of the earlier function.
The matching is unordered - hence the notion of "bulk".
So a function of (warning: doesn't make sense):
SUB ESP,0x8
POP EBP
MOV EBP,ESP
PUSH EBP
MOV ESP,EBP
RET
Would still match 6 of 6 with the original function, because of the unordered bulk comparison logic.
The Bulk Mnemonics Match
Correlator only adds the instruction mnemonics to the feature bundle for matching.
If you have the function:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x8
MOV ESP,EBP
POP EBP
RET
Then the Correlator would "bulk" this to the following list of features:
MOV
MOV
POP
PUSH
RET
SUB
If we now have a function:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x42
MOV ESP,EBP
POP EBP
RET
With features:
MOV
MOV
POP
PUSH
RET
SUB
would match 6 of 6.
Same unordered remarks as in the [Bulk Instructions Match] Correlator apply.
The Bulk Basic Block Mnemonics Match
Correlator first converts the mnemonics
of each basic block into a list. That list is sorted and hashes (so the order of the mnemonics
within the basic block don't matter). Then these basic block hashes are compared
between functions in an unordered bulk comparison.
There are several options:
Minimum similarity threshold (score)
: Only return matches that have a score higher than this threshold.Minimum confidence threshold (score)
: Confidence is ranked as follows (but may change in the future):1.0
or0.000
inlog10
: When symbols don't match10.0
or1.000
inlog10
: When symbols match
Symbol names must match
: Only match functions when their symbol names match- Warning: If you disable this make sure to set
Minimum similarity threshold
to something reasonable, otherwise you get the cross-product of all the functions in both binaries, e.g. if the source program has 100 functions and the destination also 100 and no threshold is specified, you'd get100 * 100 = 100000
matches!
- Warning: If you disable this make sure to set
Ignore undefined symbols (FUN_)
: Settings this won't use the default labels assigned to undefined symbols for symbol name matching. So it won't matchFUN_00001234
toFUN_00001234
.Only match accepted matches
: Only calculate the similarity for functions that already have an accepted match entry in the Matches Table. This is the most useful option.
These correlators color address ranges in the Source and Destination Programs that are different.
State: This is a first work in progess prototype to see whether a correlator is
technically able to execute setBackgroundColor()
on the Source and Destination Programs.
This colors basic blocks that are either new or deleted or have a different Mnemonic "Bulk" (see [Bulk Mnemonics Match] for a concept of "Bulk").
Current issues:
- Basic blocks are matched without CFG context.
- Fails badly if a function contains a duplicate basic block and the other program only contains the basic block once.
- Version Tracking Correlators can't (and aren't supposed to) color programs
This is implemented as a script.
- Recent paper summarizing the state of the art on binary code similarity: https://arxiv.org/abs/1909.11424
You can grab just the scripts. They are in PatchDiffCorrelator/ghidra_scripts/.
- Open Source Program.
- In Source Program select the Function you want to compare against the Destination Program (and colorize in the Destination Program).
- Run
FunctionDiffColorizer.java
- Select Destination Program.
- Select Destination Function.
- The changes of the Destination Function in the Destination Program are now colored.
Issues:
- if a function contains duplicate basic blocks but the destination only one it just randomly picks one of them for matching
- complex changes aren't handled optimally
git clone https://github.com/threatrack/ghidra-patchdiff-correlator
# build `ghidra_<VERSION>_PatchDiffCorrelator.zip` from command line
cd ghidra-patchdiff-correlator/PatchDiffCorrelator/
gradle -PGHIDRA_INSTALL_DIR=$GHIDRA_HOME
I have no idea how Eclipse or Java or any of that works so this may be bullshit - but it works!
Just importing the Gradle project doesn't work (unless you got the exact same Ghidra version (and paths?) that were used when developing this.
The easiest way so far (that I found) to import the code in Eclipse for development is:
- You need to have Eclipse setup with the GhidraDev plugin.
git clone https://github.com/threatrack/ghidra-patchdiff-correlator; cd ghidra-patchdiff-correlator; mv PatchDiffCorrelator PatchDiffCorrelator.bak
- In Eclipse: GhidraDev -> New -> Ghidra Module Project...; Project name: PatchDiffCorrelator, Project root directory:
$GITHUB/threatrack/ghidra-patchdiff-correlator
, "NEXT"; Deselect all module templates, "FINISH" - Project -> Import; General -> Filesystem; From directory:
$GITHUB/threatrack/ghidra-patchdiff-correlator/PatchDiffCorrelator.bak
- In
BasicBlockMnemonicFunctionBulker.hashes()
use a proper hashing algorithm to hash the basic blocks. - Optimization:
- Use the
instruction prime products
concept of BinDiff (see https://www.zynamics.com/bindiff/manual/)For this we need a mnemonic to prime number mapping :/- @byte_swap had the idea to work on PCode, so only a mapping PCode -> prime is needed, and that will work on all arches.
- Use the
- Help of the Extension isn't available in Ghidra. Need to figure out how to fix that.
- Figure out this Ghidra bug(?): NationalSecurityAgency/ghidra#1135
- Add option to only return the highest scoring match(es) for each function instead of the cross product of all functions.
- Use
symbol.getSource() == SourceType.DEFAULT
to detect undefined symbols instead of.startswith("FUN_"
. - Either add masking to the
Instructions
bulker, e.g. via:
from ghidra.app.plugin.core.instructionsearch import InstructionSearchApi
from ghidra.app.plugin.core.instructionsearch.model import MaskSettings
InstructionSearchApi().getBinarySearchString(currentProgram,currentSelection.getFirstRange(),MaskSettings(False,False,True))
- ... or remove the
Instructions
bulker ... because it is kind of useless without masking.