-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Parallelized compare function with multiprocessing #709
[MRG] Parallelized compare function with multiprocessing #709
Conversation
…tegies on different systems
Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>
…abot/parallelize-compare
… to reduce memory usage, and pooling
Codecov Report
@@ Coverage Diff @@
## master #709 +/- ##
==========================================
+ Coverage 89.26% 89.27% +0.01%
==========================================
Files 27 29 +2
Lines 4303 4374 +71
Branches 45 45
==========================================
+ Hits 3841 3905 +64
- Misses 460 467 +7
Partials 2 2
Continue to review full report at Codecov.
|
@luizirber the PR is ready for your feedback! Please review when you can |
sourmash/compare.py
Outdated
from .logging import notify | ||
|
||
|
||
def compare_serial(siglist, ignore_abundance, downsample): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the "downsample" argument isn't used in the function call. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!! So excited
Ping @luizirber @ctb @taylorreiter @standage -- Let us know if there is anything else we can address! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a few minor comments, this looks great - thank you!
tests/test_sourmash.py
Outdated
@@ -784,6 +784,37 @@ def test_do_basic_compare(c): | |||
assert (cmp_out == cmp_calc).all() | |||
|
|||
|
|||
@utils.in_tempdir | |||
def test_do_basic_compare_parallel(c): | |||
# try doing a basic compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update comment to indicate this is parallel test (yeah, I'm being a bit pedantic, since the function name is quite clear :)
Nice work! |
Hi! I added the parallelizing using multiprocessing forked from @olgabot's parallelize-compare branch and PR - #666 . This PR speeded up the calculation of a large number of signatures similarity. If there are about 50k signatures, there could be 1.2 billion combinations to compare and calculate similarity. The pool.imap from multiprocessing python yields this generator and parallel process the data without copying all the variables into each process efficiently and completes the job much faster than it would have serially. It changed computation times from days to hours for a large number of files we ran in CZ BioHub
TODO's next we were discussing at biohub with @olgabot and phoenix were below (since we intend to use sourmash for large number of signature files):
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?