Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Parallelized compare function with multiprocessing #709

Merged
merged 59 commits into from
Sep 6, 2019
Merged

[MRG] Parallelized compare function with multiprocessing #709

merged 59 commits into from
Sep 6, 2019

Conversation

pranathivemuri
Copy link
Contributor

@pranathivemuri pranathivemuri commented Aug 7, 2019

Hi! I added the parallelizing using multiprocessing forked from @olgabot's parallelize-compare branch and PR - #666 . This PR speeded up the calculation of a large number of signatures similarity. If there are about 50k signatures, there could be 1.2 billion combinations to compare and calculate similarity. The pool.imap from multiprocessing python yields this generator and parallel process the data without copying all the variables into each process efficiently and completes the job much faster than it would have serially. It changed computation times from days to hours for a large number of files we ran in CZ BioHub

TODO's next we were discussing at biohub with @olgabot and phoenix were below (since we intend to use sourmash for large number of signature files):

  1. Parallelizing inside the get_similarities_at_index function using cython or rust?
  2. See if we can decrease the time from 0.0006 seconds to compute signature
  3. Output data formats - paraquet/hdf5, since for large combinations the csv file is as big as 50 GB
  4. GPU code for the multiprocessing stuff
  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

olgabot and others added 30 commits April 8, 2019 12:58
Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>
@pranathivemuri pranathivemuri changed the title Parallelized compare function [WIP] Parallelized compare function Aug 7, 2019
@pranathivemuri pranathivemuri changed the title [WIP] Parallelized compare function [WIP] Parallelized compare function with multiprocessing Aug 7, 2019
@codecov
Copy link

codecov bot commented Aug 7, 2019

Codecov Report

Merging #709 into master will increase coverage by 0.01%.
The diff coverage is 91.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
+ Coverage   89.26%   89.27%   +0.01%     
==========================================
  Files          27       29       +2     
  Lines        4303     4374      +71     
  Branches       45       45              
==========================================
+ Hits         3841     3905      +64     
- Misses        460      467       +7     
  Partials        2        2
Impacted Files Coverage Δ
sourmash/commands.py 88.72% <100%> (-0.08%) ⬇️
sourmash/np_utils.py 100% <100%> (ø)
sourmash/compare.py 89.55% <89.55%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cbcd06...b783856. Read the comment docs.

@pranathivemuri
Copy link
Contributor Author

@luizirber the PR is ready for your feedback! Please review when you can

@pranathivemuri pranathivemuri changed the title [WIP] Parallelized compare function with multiprocessing Parallelized compare function with multiprocessing Aug 8, 2019
from .logging import notify


def compare_serial(siglist, ignore_abundance, downsample):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the "downsample" argument isn't used in the function call. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

Copy link
Collaborator

@olgabot olgabot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! So excited

@olgabot
Copy link
Collaborator

olgabot commented Sep 4, 2019

Ping @luizirber @ctb @taylorreiter @standage -- Let us know if there is anything else we can address!

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a few minor comments, this looks great - thank you!

sourmash/commands.py Show resolved Hide resolved
@@ -784,6 +784,37 @@ def test_do_basic_compare(c):
assert (cmp_out == cmp_calc).all()


@utils.in_tempdir
def test_do_basic_compare_parallel(c):
# try doing a basic compare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update comment to indicate this is parallel test (yeah, I'm being a bit pedantic, since the function name is quite clear :)

sourmash/compare.py Show resolved Hide resolved
@pranathivemuri
Copy link
Contributor Author

@ctb @olgabot I addressed your comments. Please review and merge when you are free

@ctb
Copy link
Contributor

ctb commented Sep 6, 2019

Nice work!

@ctb ctb merged commit 605b747 into sourmash-bio:master Sep 6, 2019
@ctb ctb changed the title Parallelized compare function with multiprocessing [MRG] Parallelized compare function with multiprocessing Sep 6, 2019
@pranathivemuri pranathivemuri deleted the pranathi-parallelize-compare branch September 6, 2019 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants