Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Speed up sourmash gather by ignoring unidentifiable hashes #1613

Merged
merged 22 commits into from
Jun 20, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 20, 2021

This PR refactors the search.gather_databases(...) generator into an iterator class GatherDatabases(...) that should be much faster.

The three main optimizations are:

  • when a prefetch is done and we know what the unidentifiable query hashes are, remove them from consideration before doing the gather; this will speed up gather on mostly unidentifiable queries.
  • actually use the newly refactored MinHash.remove_many(...) method from [MRG] Improving MinHash.remove_many(...) performance #1571.
  • only downsample the original query and recalculate the numbers when scaled actually changes; this should speed up gather on large queries.

Should fix #1552 but @bluegenes will need to confirm.

@ctb
Copy link
Contributor Author

ctb commented Jun 20, 2021

@bluegenes I have one test to add, but this is ready to review.

In particular, if you can try this out on #1552, I'd be much obliged :).

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2021

Codecov Report

Merging #1613 (86fe574) into latest (74de59a) will increase coverage by 8.19%.
The diff coverage is 96.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1613      +/-   ##
==========================================
+ Coverage   81.29%   89.49%   +8.19%     
==========================================
  Files         103       76      -27     
  Lines       10485     6843    -3642     
  Branches     1217     1228      +11     
==========================================
- Hits         8524     6124    -2400     
+ Misses       1753      510    -1243     
- Partials      208      209       +1     
Flag Coverage Δ
python 89.49% <96.03%> (+0.06%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/commands.py 86.38% <89.13%> (-0.08%) ⬇️
src/sourmash/search.py 97.66% <100.00%> (+0.41%) ⬆️
src/core/src/sketch/minhash.rs
src/core/src/errors.rs
src/core/src/ffi/signature.rs
src/core/src/ffi/nodegraph.rs
src/core/src/cmd.rs
src/core/src/index/sbt/mod.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/wasm.rs
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74de59a...86fe574. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Jun 20, 2021

huh. Using an old copy of @bluegenes hanging sig in #1552, it still takes a really long time to run remove_many at the end.

@ctb
Copy link
Contributor Author

ctb commented Jun 20, 2021

ok, well, fixed (for now) in 00caa41. I'll refigure #1552 and ask @mr-eyes to revisit it after this is merged ;)

@ctb ctb changed the title [WIP] Speed up sourmash gather by ignoring unidentifiable hashes [MRG] Speed up sourmash gather by ignoring unidentifiable hashes Jun 20, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 20, 2021

Ready for review and merge @bluegenes!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@ctb
Copy link
Contributor Author

ctb commented Feb 16, 2022

keywords used by genome-grist: known and unknown hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove_many causes prefetch to hang on large-ish metagenome samples
3 participants