-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Improving MinHash.remove_many(...)
performance
#1571
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1571 +/- ##
==========================================
+ Coverage 80.96% 89.11% +8.14%
==========================================
Files 102 75 -27
Lines 10299 6621 -3678
Branches 1165 1170 +5
==========================================
- Hits 8339 5900 -2439
+ Misses 1751 515 -1236
+ Partials 209 206 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I am having a problem running Benchmark resultsTime Profiling
Memory profiling
Reproduce the profilingprofile_remove_many.py from sourmash import MinHash
l = list(range(100000000))
mh1 = MinHash(500, 21, track_abundance=False)
mh1.add_many(l)
mh2 = MinHash(500, 21, track_abundance=False)
mh2.add_many(l)
rm_mh = MinHash(500, 21, track_abundance=False)
rm_mh.add_many(l)
@profile
def old():
mh1.remove_many(l)
assert len(mh1) == 0
@profile
def new():
mh2.remove_many(rm_mh)
assert len(mh2) == 0
old()
new() reproduce # time
kernprof -l profile_remove_many.py
python -m line_profiler profile_remove_many.py.lprof
# memory
python -m memory_profiler profile_remove_many.py |
Another simple profiling focusing on time
from sourmash import MinHash
from time import time
l = list(range(int(10000000)))
mh1 = MinHash(500, 21, track_abundance=False)
mh1.add_many(l)
mh2 = MinHash(500, 21, track_abundance=False)
mh2.add_many(l)
rm_mh = MinHash(500, 21, track_abundance=False)
rm_mh.add_many(l)
def old():
mh1.remove_many(l)
assert len(mh1) == 0
def new():
mh2.remove_many(rm_mh)
assert len(mh2) == 0
t = time()
old()
print(f"add_many(iterable) took: {1000 * (time() - t)} ms")
t = time()
new()
print(f"add_many(MinHash) took: {1000 * (time() - t)} ms") |
remove_many
performanceremove_many
performance
Would you please review it? @luizirber @ctb |
This looks good to me but we should wait for @luizirber if we can :) |
Good? GOOD? This is awesome! It's five orders of magnitude faster! =] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great too, thanks @mr-eyes!
For fixing the two checks:
|
sorry, updated from latest before I saw luiz's latest comments! will wait to merge. |
remove_many
performanceMinHash.remove_many(...)
performance
src/core/src/sketch/minhash.rs
Outdated
@@ -415,6 +415,14 @@ impl KmerMinHash { | |||
}; | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra empty line here, remove it:
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
remove_many
performance #1553, and it's related to Ref:remove_many
causes prefetch to hang on large-ish metagenome samples #1552.First time contributing: https://orcid.org/0000-0002-3419-4785