Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] subtantial refactoring of CounterGather and related Index code. #2116

Merged
merged 45 commits into from
Jul 16, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 11, 2022

Note: PR into #2111

This PR does a bunch of cleanup and refactoring of gather related things:

  • renames search.make_gather_query to search.make_containment_query
  • adds support for best_only in Index.prefetch
  • renames Index.gather to Index.best_containment, and changes method to return None or an IndexSearchResult, rather than a list => much confusion reduction.
  • consolidates threshold calculation code in search.py into a single function 🎉
  • adds flatten_and_* utility functions in sourmash.minhash.
  • adjusts CounterGather classes to take SourmashSignature instead of MinHash object, in line with Better conceptual definition on minhash/signature/query #616
  • changes CounterGather.siglist and locations to use dictionaries instead of lists
  • adds CounterGather.signatures()
  • adds a CounterGather_LCA() test implementation

The net result is a lot of code reduction in the Python codebase plus (IMO) more readable code.

A note to CounterGather aficionados

The change to using dictionaries instead of lists to store signatures and locations in CounterGather means that only one signature per md5sum will be stored. For the default CounterGather test, this will be the last one added.

See test test_index.py:test_counter_gather_identical_md5sum for the specific implementation test for CounterGather, and test_index_protocol.py::test_counter_gather_multiple_identical_matches for the general protocol test.

TODO

  • make a CounterGather technique to provide list of locations, list of signatures??
  • should CounterGather classes take a list of index databases to prefetch on, instead of current creation API?
  • remove/rename Index.gather?

@codecov
Copy link

codecov bot commented Jul 11, 2022

Codecov Report

Merging #2116 (61624fc) into latest (0bc9dbd) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2116      +/-   ##
==========================================
+ Coverage   84.30%   91.69%   +7.38%     
==========================================
  Files         130       99      -31     
  Lines       15278    11002    -4276     
  Branches     2171     2166       -5     
==========================================
- Hits        12880    10088    -2792     
+ Misses       2095      611    -1484     
  Partials      303      303              
Flag Coverage Δ
python 91.69% <100.00%> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/commands.py 88.88% <100.00%> (ø)
src/sourmash/index/__init__.py 96.66% <100.00%> (-0.12%) ⬇️
src/sourmash/minhash.py 94.02% <100.00%> (+0.22%) ⬆️
src/sourmash/search.py 97.92% <100.00%> (-0.02%) ⬇️
src/core/tests/storage.rs
src/core/src/sketch/nodegraph.rs
src/core/src/cmd.rs
src/core/src/sketch/minhash.rs
src/core/tests/test.rs
src/core/src/ffi/hyperloglog.rs
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bc9dbd...61624fc. Read the comment docs.

@ctb ctb changed the title [EXP] additional work on CounterGather [MRG] subtantial refactoring of CounterGather and related Index code. Jul 13, 2022
@ctb
Copy link
Contributor Author

ctb commented Jul 13, 2022

@mr-eyes ready for review!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments/questions -- otherwise I think this looks good to me. cleaner and better descriptions, for sure

Base automatically changed from refactor/counter_gather_tests to latest July 16, 2022 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants