Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Substring locations #327

Merged
merged 7 commits into from Nov 13, 2020
Merged

WIP: Substring locations #327

merged 7 commits into from Nov 13, 2020

Conversation

rakuy0
Copy link
Contributor

@rakuy0 rakuy0 commented Oct 13, 2020

@atlas0fd00m general thoughts? The unicode portion should be much the same, but I figured it's better to get feedback on the idea now.

@atlas0fd00m
Copy link
Contributor

failing checks?

Copy link
Contributor

@atlas0fd00m atlas0fd00m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some additional docstrs and a few questions? overall, the direction seems pretty good.

@@ -920,6 +920,12 @@ def detectString(self, va):
if loc[L_LTYPE] == LOC_STRING:
if loc[L_VA] == va:
return loc[L_SIZE]
if ord(bytez[offset+count]) != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this covers the "find the big string after the little one was already discovered" case.
does this catch the "find the little string in the middle of an existing big string" case?
if so, how?

this is getting complex enough that we probably want some explanation of what's going on (specifically wrt string within a string) use-cases and how they're handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nice point on this only capture suffixes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter point: rather than building the string index into the location database, what about pulling it out into its own object.

class StringIndex:
  def find(self, needle) -> Tuple[str, Loc]

i suspect this would be simpler to maintain (esp when a string is removed, does this ever happen?) conceivably more performant by using dedicated algorithms/data structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a dedicated string database, my only hold back on that is actually persisting that information when saving that workspace. I don't want to lose the information, and making custom encoders/decoders breaks any kind of possible backwards compatibility.

But it is intentional that this only captures suffixes subsets, since when we call makeString we only get the VA and then we make a string based on what we calculate as the length. That's 100% an artifact of the binaries I've been running into recently, where substrings are just suffixes to longer strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there's some complexity in persisting the database.

what if we rebuilt the database upon each load? if we keep string start locations in the .viv, then its an easy operation to iterate, extract the strings, and index in memory. upon close, throw out the index.

of course it takes some CPU and memory to keep an index in memory, but i'd argue that until you can prove otherwise, its likely in line with the overhead of storing all the substrings in the .viv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, could build the index lazily on first access which would help the common case where substrings aren't accessed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be against rebuilding the database on every load. Wouldn't be the worst thing in the world.

If we do go that route, I can see two possible ways of doing things. First is pushing some logic into the add location handler in vivisect/base.py so that on a makeString/makeUnicode call it does something similar-ish to what I've got here, in that it handles what the base string is, indexes, and so forth. Mostly so when we load an existing workspace, we end up with the same info as when we directly analyzing a binary, since workspace loading only ever sees the event stream. Second is similar to the first, except we plumb a new event type that handles all that for us.

I like the idea of a dedicated string database, as I think it would give us (easier) finer grained control over deletion, since right now it's "delete the main string and the substrings go with it". It's a more matter of do it now or do it later.

@@ -966,6 +972,12 @@ def detectUnicode(self, va):
loc = self.getLocation(va+count)
if loc:
if loc[L_LTYPE] == LOC_UNI:
if loc[L_VA] == va:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

vivisect/__init__.py Show resolved Hide resolved
vivisect/__init__.py Show resolved Hide resolved
vivisect/__init__.py Show resolved Hide resolved
vivisect/__init__.py Outdated Show resolved Hide resolved
atlas0fd00m
atlas0fd00m previously approved these changes Nov 9, 2020
@rakuy0 rakuy0 merged commit ec75cf0 into master Nov 13, 2020
@rakuy0 rakuy0 deleted the rakuyo_substrings branch November 13, 2020 03:55
atlas0fd00m added a commit to atlas0fd00m/vivisect that referenced this pull request Jan 14, 2021
* first stab at substring

* substring tests

* words and tests

* do the thing for unicode too

* trailing whitespace

* well found the failure

Co-authored-by: atlas0fd00m <atlas@r4780y.com>
atlas0fd00m added a commit that referenced this pull request Aug 26, 2022
here marks the end of an era...
fairwell oh long-toothed PR.  long live symbolik switchcase analysis!
much love and pain hath gone into thy creation and refinement.  go forth and make life better for all viv users.

* symboliks-based switch-case analysis for arbitrary platforms.  breaks on
some graph core changes, but will fix that before making a pull request.

* still truing up switch cases after merge.  added thunk_bx detection as
it is helpful in identifying variables used in switch case analysis

* more truing

* GitEye .gitignore

* DynamicBranches analysis turned into a default VaSet for every architecture, and cleanup

* unittest: Viv test_vivisect check graph edges as well as nodes.

* updating documentation and MAX case count.

* cleanup some of the functionality we pulled out of the switchcase commit
(supporting functionality has been committed in other branch/merges)

* more cleanup

* fixes/changes to use DynamicBranch VaSet for switch-case analysis

* print statements with ()'s for pyV3

* initial break up of makeSwitch.  more work to be done.

* make SwitchCases vaset for every workspace by default.

* rearranging switchcase analysis code into different functions.  UGLY but
broken up and working.  prettification to follow.

* minor bugfix: satvals not initialized if determineCaseIndex fails (not
that it matters, since the result is the same)

* bugfixes, notes, improvements

* more correct-ish analysis for ptr-sized index-deltas

* further refinements: comments

* transient - don't run from here.  working on details.

* it works again, without that nasty xref/_event_list-chop hack.

* cleanup and additions to SYMT_*

* abstracted out which registers are used to hold a image base value into
the architecture class.

* abstracted out which operands are valid for the "jmp" instruction

* moved various architecture-specific things into the Architecture module.

* update getSymbolikPathsTo to include a graph arg

* indentation bugfix

* switchcase - skip "calls"

* clean up artifacts of previous dynamic branch tracking.
BUGFIX: viv unit test

* updates for graphcore changes

* more cleanup including logging

* lots of little things.

* mark array entries as numbers (offsets) or pointers.

* test using codeblocks instead of individual instructions to determine starting point.

* lots of beautification

* reintroduce comparison of opcodes versus codeblocks.
if upper bound not set, use MAX_CASES and let the pointer-checking and xref-checking limit the upper bound.  need more testing and better options.

* more work...

* Merge branch 'master' into atlas_switchcase_analysis

# Conflicts:
#	vivisect/tools/graphutil.py

* lots of changes. working on symboliks switchcase analysis v2.0

* lots of changes to switchcase v2.  original version is broken.

* bugfix in v1 code (introduced while trying to modularize)

* updates to switchcase v2.0

* working.  improved boundary ident.  NOT USABLE!

* continues to not work...

* v2: not quite right, but making huge progress.  naming is off, and haven't verified all the wiring yet.  but we're getting through the entire process (for better or worse)

* SAVEGAME - don't use... doesn't work!  but we're making headway with v2.0

* improvements, still not good.  about to add "getNormalizedConstraints()" so we can reduce the madness.

* still not working, but getting closer.  too many moving parts, trying to capture the progress

* tweak.

* lots of headway with libc-2.13.so (32bit PIC).  hopefully didn't break everything else.

* using timed path generator (20secs), add comment at jmp, lots of debugging changes.

* breakthrough?  seem to have trued up the baseoff/lower/upper stuff.  at least 32bit libc-2.13.so seems to like it.

* working better.  filtering out non-switch cases

* bugfix: don't overwrite the thunk_bx table each time you load the workspace!

* lots of switchcase analysis cleanup.

* bugfix.  merge fail.

* fix for @rakuyo's Symboliks cleanups

* bugfix: "upper" index really needs to be included in the switchcases.

* cleanup, CASE_FAILURE, tracking completed dynbranches (reducing analysis time by reducing duplication)

* "done" list, skipping analyzing dups, improved logging

* bugfix.  skipped too early.

* update cli addition of array-based switchcases (without all the smarts and analysis)

* woops.  need this code too (link_up in particular)

* make room for symswitchcase and switchcase (the original MS VS handler) to live side-by-side.

* make non-Windows targets use Symbolik Switchcase Analysis

* disable DEBUG logging for symswitchcase.py

* minor bug fixes in ARM disassembly and emulation.  helps get rid of some of the unittest error messages (yes, that's why they're there ;) (#305)

* revsync support (#304)

* make each parser add a sha256 hash of the file loaded

* refining approach to get bytes if they're possible.

* bugfix: addFileMeta requires a filename!
also changed Vivisect Extensions such that not only will .py files in the directory path be checked for vivExtension() functions, but so will directory/__init__.py files in the extensions directory.  this is intended to allow plugin/extensions to be self-contained within a directory and be copied or symlinked into a path that's in $VIV_EXT_PATH.

* commend per snickety @rakuyo ;)

* make blog md5 calc off preloaded bytes instead of file

* py3 it! (file->open)

* bugfix since msgpack added strict_map_key and we break that (#307)

thanks @rakuy0 for verification and pushing back for best quality.

* control creation of .viv directory (#310)

* control creation of .viv directory

* docstrings and rename param

* remove unnecessary param

* Even More Syntax Cleanup (#293)

A lot of cleanup things in prep for a python 3 transition. Getting rid of the old exception syntax, converting prints over to logging, cutting random scraps of code to be proper unit tests,  cut away some older bits of code, etc.

This still works in python2. It's just a lot of tidying up. There are no major functionality changes.

* setSymKid Speedup (#309)

A 40-60 percent reduction in runtime for symbolik reduction. Makes it so if the parent cache of a symbolik object is empty when we call setSymKid, we don't traverse all the way up the tree, due to some assumptions about how the caches get populated. See the setSymKid docstring in vivisect/symboliks/common.py for more information.

* A wild changelog appeared! (#312)

Add initial changelog.

* a few mods to enhance the CLI helpers for wiring up switchcases.

* a couple bugfixes and tweaks

* normalizing analysis modules

* bugfix: old switchcase analysis was *not* a fmod.  it hooked directly into the DynamicBranchHandlers.

* minor bugfix for handling deleted codeblocks (#317)

* getOperAddr normalization (#316)

* normalizing the prototype for getOperAddr() in i386, and returning None for non-deref operand (default).

* might as well update the "abstract" base class

* more in line with other getOperAddr()

* minor change to allow access to the tuple and lists of ARM registers … (#315)

* minor change to allow access to the tuple and lists of ARM registers (used in external tools).  this brings the ARM regs.py more inline with the other architectures' regs.py

* made the changes to arm_regs and arm_regs_tup, but didn't update the references.  this actually makes accessing registers more efficient :)

* import emulator to handle dynamic branches (switchcases) using only xrefs (#314)

* modify import emulator to handle dynamic branches (switchcases) using only xrefs.

* bugfix: forgot that getBranches returns REF_CODE/BR_DEREF options which are *not* direct code branches (eg. PLT).
added __ctype_b_loc to impapi

* Fix: syntax error discovered by pytocs (#318)

* Fix: Non-terminated string constant

`visgraph/renderers/svgrend.py` is missing a terminating apostrophe on line 41.

* Incorrect extra closing parenthesis removed

* Possible missing comment character '#'

Co-authored-by: atlas0fd00m <atlas@r4780y.com>

* IMAGE_FILE defs and honor NXCOMPAT (#319)

* Add File Header Defs and honor NX compat in DLLs

* little more stricter (since exe imgs can have it set to it)

* add small unit test on the memory maps

* cleanup per @rakuyo

* first run: symbolik switchcase unittest.

* making the symswitchcase build test cases

* Msgpack storage module (that works in py2 and py3) (#321)

* add msgpack storage module

* encode mmaps

* cross version

* derp

* add unit tests for mpfile

* Bug hunting (#320)

* makePointer returns a tuple

* we bail on the first failure in carving

* unittests (and make carve mark the right dead data)

* rejigger some of the tests and fix a minor bug

* cleanup symswitchcases.py
some unittest work (not sure it's in working state yet)

* reorder/rearranging code/comments

* cleanup some more

* fixups per @rakuyo
bugfix:  thunk_bx only exists on i386, so it shouldn't be checked if it doesn't exist!
additional cleanup

* bad intel. no soup for you. (#326)

* symswitchcase and unittest updates.  not done yet.  savegame while merging in the latest master.

* one more mod, per @rakuyo (testing things out)

* Substring locations (#327)

* first stab at substring

* substring tests

* words and tests

* do the thing for unicode too

* trailing whitespace

* well found the failure

Co-authored-by: atlas0fd00m <atlas@r4780y.com>

* cobra: don't configure logging for everyone upon import (#330)

* November bughunt (#329)

* We decode ud* correctly, but never really added a INS_* definition on it. So added those.
* int1/icebp support since we apparently never had that
* Making sure we don't codeflow past those and bring in a fix from my py3 branch on not code flowing past hlt instructions.
* Even if we fail on codeblock addition, we at least we can add the metadata to the function dictionary.
* We parse but do not make accessible the fixed file info from a PE file (should it be 
* Enable ARM analysis on PE 
* Pathcount in the UI had succumbed to bitrot and needed 
* Some symbolik reduction code coverage
* Address #322

Co-authored-by: atlas0fd00m <atlas@r4780y.com>

* Speed up for setSymKid, and a few decoding fixes (#332)

* vivisect: don't configure logging within library code (#334)

this stomps on the configuration provided by applications

* cleanup

* make walker test happy (falsely!  it's all a lie!)

* added amd64/ls switchcase test
minor mods to symswitchcase
cli switch bugfix

* oops.  turn off DEBUG log-level setting

* quiet!

* set recursion limit to 5000, code and unittest cleanup

* special exception handling for being unable to determine the Complex SymIdx.
improved filtering for unittest (testelf) names, and updates for the data files, to reduce "ptr_*" names.

* mods to testelf.py

* merge fail

* bugfix in ihex vstruct defs, and improved testelf

* cleanup unittests

* symswitchcase polishing... (partway done)

* reordering a few of the functions, and adding "mid-level functions" so we now have "low," "mid," and "high" level functions.  attempting to sort out the complexity.

* cleanup symswitchcase in-module notes.

* symswitchcase.py cleanup and documenting.  almost done.

* vivisect.helpers.getTestWorkspace() now loads binaries as well as .viv files

* configurable switchcase analysis parameters

* fixed switchcase unittests.

* touch up unittests

* py3

* logging changes, and bugfix for tgtva being None...

* make switchcase work on py3 (next/__next__)

* logging changese and unit test updates (inc py3 bugfixes)

* switchcase tests!

* demoting pointer log messages from info to debug

* more unittest tweaks and improvements for symswitchcase

* * Bugfix: remove duplicate loading of each module (#374)

* Default VIV_EXT_PATH of ~/.viv/plugins
* Fix module name (no longer "viv_ext") so "from ." works correctly
* Add vw to namespace (some indication of running as an extension)

* oops, removed this as superfluous too quickly.

* dynamic dialog box helper (#376)

* dynamic dialog box helper

* actually do something with defaults, and update docstr to help.

* warning and informational messages too.

* bugfixes in QtCore and other imports

* cleanup and additions to the README

* oops, removed this prematurely.

* cleanup example gui extension

* Update README.md

thanks @williballenthin!

Co-authored-by: Willi Ballenthin <willi.ballenthin@gmail.com>

Co-authored-by: Willi Ballenthin <willi.ballenthin@gmail.com>

* supporting more than just EBX in 32-bit Intel PIC support.  this seeks out all the major registers to be used as thunks.
now we need to figure out why we *lost* switchcases from ld-2.31.so

* yay!  got 16 switches in LD, up from 12 in the old code and 10 in the most recent commit.  we had gained 4 and lost 6... now we have them all :)

* bugfix if iterJumpTable() doesn't actually iterate anything.
DynamicBranches update in PE unittest

LD unittests

* more unittest updates

* minor bugfix: check upper is not None before comparison with lower.

* tweaks

* drag out the Pathing Timeout to the SwitchCase.__init__() and increase to 60 seconds.

* moving down the road to merge-readiness..

* couple bugfixes

* helper to get early (light-weight) access to the symboliks of the last codeblock for switch-case-vetting (ie. determining that something isn't a switchcase before expending a ton of resources on it)

* refinements and small bugfixes

* wow, this has been long in coming.  i'm guessing Visi fixed the "add*Prop" but never fixed the "del*Prop"

* continued work

* minor updates, including a path-converter for using getHierPaths*() for Symboliks

* improve getFuncCbRoutedPaths:
* reduceGraph - ripping out nodes that aren't part of the desired path before pathing begin
* weighted node-checks - only check for loops when the target node weight <= current node weight

* quiet down the log messages a little

* set the timeout much higher in hopes that we can maintain a low test-time while catching 4-8 more switches.

* modified to make default timeout=45, but with the ability to rerun with higher timeouts.

* garbage collect after each `analyzeFunction()` (hope to pacify CircleCi's memory management)

* updated tests

* update tests

* gc

* undo the gc damage

* seemingly dramatic improvement on SymSwitch loop-checking

* symswitchcase config (renamed from switchcase to avoid conflict with non-symboliks switchcase info)
added `timeout_secs` and defaulted to 45 secs

* set switchcase timeout for unittests to 30secs

* tone down SymSwitch logging a little.

* clean up unittests

* bugfix in unittest

* clearRouting() in effort to limit RAM usage (to avoid the OOMs we've been getting in CircleCI)
unittest change due to limiting switchcase analysis

* try different gc parameters

* bugfix in testswitches
update data for stabilitydata

* catching testelf bins with limited timeout
updating timeouts to 10secs

* update test timeouts to 30secs (10secs completes the tests! yay!)

* tweak unittests

* add Timeout value to the SwitchCases_TimedOut VaSet (track how long we've spent trying to analyze each one)

add "Reanalyze" ContextMenu item for each va in the SwitchCases_TimedOut VaSet
add "newthread" capabilities to ACT function wrapper to fire a thread for menu actions

* main menu entry to reanalyze timed-out switchcases

* SymSwitch hackathon with @rakuy0

* reduce cli manual switchcase option which was basically superfluous.

* touch-ups per @rakuy0

* fix tests
fix comment ;)

* fix tests

* damn, i updated MockVar, not MockVw.
fixing.

* improved log messages
walker test finalized and wrapped in
better unittest-generator helpers

* mods per @rakuy0

* cleanup per @rakuy0

* cleanup per @rakuy0

* cleanup

* cleanup and relocation per @rakuy0

* last cleanup of register groups (for this PR)

* cleanup

Co-authored-by: atlas <atlas@grimm-co.com>
Co-authored-by: James Gross <45212823+rakuy0@users.noreply.github.com>
Co-authored-by: John Källén <uxmal@users.noreply.github.com>
Co-authored-by: Willi Ballenthin <william.ballenthin@fireeye.com>
Co-authored-by: Willi Ballenthin <willi.ballenthin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants