Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running r2pipe Python in batch #123

Closed
zavalyshyn opened this issue Nov 30, 2020 · 2 comments
Closed

Running r2pipe Python in batch #123

zavalyshyn opened this issue Nov 30, 2020 · 2 comments

Comments

@zavalyshyn
Copy link

zavalyshyn commented Nov 30, 2020

Describe the issue

I'm using r2pipe to extract callgraph info from all the binaries in a given folder. For each binary I first open it, then run an "aaa" command and then extract the callgraph in r2 commands format with "agC*" command. Now, there is no specific issue per se, r2pipe works as intended but it takes quite a lot of time to run through all the binaries.

I've checked the examples folder on how to use r2pipe in batch, but the code there is somehow simplified.
I wonder what would be your suggestions on how to improve the code runtime.
For instance, do I really need to quit r2 after each file?

How to reproduce?

Here is my code:

binaries_list = os.listdir(binaries_dir)
batchsize = 1000 # execute files in batches of 1000
total_count = len(binaries_list)

def parseglobalcallgraph(filename):
    filepath = os.path.join(binaries_dir, filename)
    r2 = r2pipe.open(filepath,["-e io.cache=true"])
    r2.cmd('aaa')
    gcg = r2.cmd("agC*") # extract global call graph in r2 commands format
    r2.quit()
    hash_value = hashlib.md5(gcg.encode()).hexdigest()
    return {'hash':hash_value, 'filename':filename}

for i in range(0, len(binaries_list), batchsize):
    batch = binaries_list[i:i+batchsize]
    with Pool(processes=10) as pool:
        results = pool.imap(parseglobalcallgraph, batch)
        pool.close()
        for res in results:
            if (res['hash'] not in hash_db):
                hash_db.add(res['hash'])
                print(res['hash'])
            else:
                continue

Expected behavior

I'd expect it to be much faster but seems like I'm missing something.

Possible fix

Screenshots

Additional context

@trufae
Copy link
Contributor

trufae commented Aug 24, 2021

r2pipe is slow, in part because of Python, in part because the way it reads the data from the pipe. you can use the native r2pipe by prefixing the filepath with ccall:// so it will use dlopen(r_core) and do direct C api calls. that will make the script at least 10 times faster.

You can help improving the r2pipe module and profiling that issue. other langs dont have this issue

@zavalyshyn
Copy link
Author

Many thanks! I didn't know you could do that with prefixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants