Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Improving MinHash.remove_many(...) performance #1571

Merged
merged 8 commits into from
Jun 7, 2021
Merged

[MRG] Improving MinHash.remove_many(...) performance #1571

merged 8 commits into from
Jun 7, 2021

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Jun 6, 2021

⚠️ This PR will touch all the layers (Python, C header, Rust).


First time contributing: https://orcid.org/0000-0002-3419-4785

@codecov
Copy link

codecov bot commented Jun 6, 2021

Codecov Report

Merging #1571 (f6669f2) into latest (01de852) will increase coverage by 8.14%.
The diff coverage is 91.30%.

❗ Current head f6669f2 differs from pull request most recent head 9c2d6fa. Consider uploading reports for the commit 9c2d6fa to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1571      +/-   ##
==========================================
+ Coverage   80.96%   89.11%   +8.14%     
==========================================
  Files         102       75      -27     
  Lines       10299     6621    -3678     
  Branches     1165     1170       +5     
==========================================
- Hits         8339     5900    -2439     
+ Misses       1751      515    -1236     
+ Partials      209      206       -3     
Flag Coverage Δ
python 89.11% <91.30%> (+0.08%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sbt_storage.py 85.09% <89.79%> (+2.19%) ⬆️
src/sourmash/sbt.py 81.04% <94.11%> (+0.25%) ⬆️
src/sourmash/minhash.py 90.57% <100.00%> (+0.04%) ⬆️
src/core/src/index/linear.rs
src/core/src/errors.rs
src/core/src/lib.rs
src/core/src/index/sbt/mod.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/ffi/mod.rs
src/core/src/index/storage.rs
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb723a4...9c2d6fa. Read the comment docs.

@mr-eyes
Copy link
Member Author

mr-eyes commented Jun 7, 2021

I am having a problem running make benchmark or tox -e asv for the new written benchmark function but here are the results of profiling the new function.

Benchmark results

Time Profiling

Timer unit: 1e-06 s

Total time: 3.96304 s
File: profile_remove_many.py
Function: old at line 15

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           @profile
    16                                           def old():
    17         1    3962976.0 3962976.0    100.0      mh1.remove_many(l)
    18         1         67.0     67.0      0.0      assert len(mh1) == 0

Total time: 5.7e-05 s
File: profile_remove_many.py
Function: new at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           @profile
    22                                           def new():
    23         1         50.0     50.0     87.7      mh2.remove_many(rm_mh)
    24         1          7.0      7.0     12.3      assert len(mh2) == 0

Memory profiling

Filename: profile_remove_many.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    15 3909.449 MiB 3909.449 MiB           1   @profile
    16                                         def old():
    17 3909.656 MiB    0.207 MiB           1       mh1.remove_many(l)
    18 3909.656 MiB    0.000 MiB           1       assert len(mh1) == 0


Filename: profile_remove_many.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    21 3909.656 MiB 3909.656 MiB           1   @profile
    22                                         def new():
    23 3909.656 MiB    0.000 MiB           1       mh2.remove_many(rm_mh)
    24 3909.656 MiB    0.000 MiB           1       assert len(mh2) == 0

Reproduce the profiling

profile_remove_many.py

from sourmash import MinHash

l = list(range(100000000))

mh1 = MinHash(500, 21, track_abundance=False)
mh1.add_many(l)

mh2 = MinHash(500, 21, track_abundance=False)
mh2.add_many(l)

rm_mh = MinHash(500, 21, track_abundance=False)
rm_mh.add_many(l)


@profile
def old():
    mh1.remove_many(l)
    assert len(mh1) == 0


@profile
def new():
    mh2.remove_many(rm_mh)
    assert len(mh2) == 0

old()
new()

reproduce

# time
kernprof -l profile_remove_many.py
python -m line_profiler profile_remove_many.py.lprof

# memory
python -m memory_profiler profile_remove_many.py

@mr-eyes
Copy link
Member Author

mr-eyes commented Jun 7, 2021

Another simple profiling focusing on time

add_many(iterable) took: 362.20860481262207 ms
add_many(MinHash) took: 0.09489059448242188 ms

from sourmash import MinHash
from time import time

l = list(range(int(10000000)))

mh1 = MinHash(500, 21, track_abundance=False)
mh1.add_many(l)

mh2 = MinHash(500, 21, track_abundance=False)
mh2.add_many(l)

rm_mh = MinHash(500, 21, track_abundance=False)
rm_mh.add_many(l)

def old():
    mh1.remove_many(l)
    assert len(mh1) == 0


def new():
    mh2.remove_many(rm_mh)
    assert len(mh2) == 0

t = time()
old()
print(f"add_many(iterable) took: {1000 * (time() - t)} ms")

t = time()
new()
print(f"add_many(MinHash) took: {1000 * (time() - t)} ms")

@mr-eyes mr-eyes changed the title [WIP] Improving remove_many performance [MRG] Improving remove_many performance Jun 7, 2021
@mr-eyes
Copy link
Member Author

mr-eyes commented Jun 7, 2021

Would you please review it? @luizirber @ctb

@ctb
Copy link
Contributor

ctb commented Jun 7, 2021

This looks good to me but we should wait for @luizirber if we can :)

@luizirber
Copy link
Member

This looks good to me but we should wait for @luizirber if we can :)

Good? GOOD? This is awesome! It's five orders of magnitude faster! =]
(will actually look at the code now, but really liked the benchmarking, Mo!)

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great too, thanks @mr-eyes!

@luizirber
Copy link
Member

For fixing the two checks:

@ctb
Copy link
Contributor

ctb commented Jun 7, 2021

sorry, updated from latest before I saw luiz's latest comments! will wait to merge.

@ctb ctb changed the title [MRG] Improving remove_many performance [MRG] Improving MinHash.remove_many(...) performance Jun 7, 2021
include/sourmash.h Outdated Show resolved Hide resolved
@@ -415,6 +415,14 @@ impl KmerMinHash {
};
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line here, remove it:

Suggested change

mr-eyes and others added 2 commits June 7, 2021 20:11
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving remove_many performance
3 participants