[MRG] Rework the `find` functionality for `Index` classes #1392

ctb · 2021-03-13T15:08:30Z

"They who control find control the universe."

This PR implements a common Index.find generator function for Jaccard similarity and containment on Index classes.

This new generator function takes a JaccardSearch object and a query signature as inputs, and yields all signatures that meet the criteria.

The JaccardSearch object in src/sourmash/search.py is the workhorse for find; it scores potential matches (and can truncate searches) based on query_size, intersection, subject_size, and union, which is sufficient to calculate similarity, containment, and max_containment.

This provides some nice simplifications -

Index.search and Index.gather can be implemented generically on top of find, and are now methods on the base Index class;
this supports fast prefetch functionality on top of all of our Index subclasses (SBT and LCA as well), ref [MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370 and [EXP] add a prefetch linear search function to Index #1371 and other makeshift strategies for large scale database search - the "greyhound" issue #1226;
all of the janky SBT-specific search code in sbtmh.py is no longer needed;

Moreover,

the logic around downsampling and database search is greatly simplified, and is now much more thoroughly tested;
the logic for abundance searches is split off into a separate function, which seems appropriate given that it's largely unused and largely untested...;

The overall result is a pretty good code consolidation and simplification.

This PR also:

splits sourmash search functionality into flat queries (uses new search_fn functionality, works on SBT and LCA too) and abund queries (only works on LinearIndex)
fixes sourmash categorize to work with more than SBTs by using load_file_as_index for the database (Upgrade sourmash categorize to take LCA databases as well as SBT. #829)
adds a location property to Index classes (Request that Index provide a location? #1377)
creates a namedtuple for results from Index.search and Index.gather that improves code readability;

Specifically, this PR:

Fixes #829 - sourmash categorize now takes all database types
Fixes #1377 - provides a location property on Index objects
Fixes #1389 - --best-only now works for similarity and containment
Fixes #1454 - sourmash index now flattens SBT leaves.

Larger thoughts:

Over in sourmash index does not flatten the signatures when building an SBT #1454 the question is asked, can we reliably do 'flat' Jaccard searches to discover good candidates for abundance matches (e.g. angular similarity)? Maybe worth creating a new issue.
This PR addresses @luizirber comment that Index.find is probably the key thing to implement for prefetch (or at least that's how I reinterpreted his comment :).

Questions and comments for reviewers

One goal for this is to better support a prefetch linear pass search, other makeshift strategies for large scale database search - the "greyhound" issue #1226. I'm not clear on whether I've gotten the abstractions right in such a way that they can be used/optimized in Rust code. Comments please!
the search and gather methods in the Index class, and the find functions in the various Index subclasses, are really the key changes in this PR.
this breaks backwards compatibility for what I think is a bug (at the very least it is previously undocumented behavior :) - we allowed storage of abund signatures in SBTs, and sourmash categorize took advantage of this to return angular similarity calculations when the query signature had abundances.
- This is literally the only place in the code base where this was used, so... I disabled it. See "Larger thoughts", above, and check out sourmash index does not flatten the signatures when building an SBT #1454.
- In terms of user notification, sourmash now produces an appropriate error message requiring the user to specify --ignore-abundance explicitly, which should steer people in the right direction and minimize the impact of this.
this breaks backwards compatibility for loading v1 and v2 SBT databases. I think this is correct. Briefly,
- the old test code did containment searches on SBTs with num MinHashes in them, which is incorrect.
- the new test code does a Jaccard search, and v1/v2 SBTs do not support min_n_below. This may be fixed by [WIP] Remove min_n_below from search code #1137, but I think it's out of scope for this PR :)
- if we want to keep this working for the next release, then we would need to merge [WIP] Remove min_n_below from search code #1137 before the next release (or do something else, like hold this PR until [WIP] Remove min_n_below from search code #1137).
- Also see [WIP] Fix old SBT version loading code tests #1397 - we can't easily create old version SBTs any more, it seems; in particular, we don't have any releases that create v1 and v2 SBTs!

TODO items:

codecov · 2021-03-13T15:15:38Z

Codecov Report

Merging #1392 (c8d8cd6) into latest (eb2b210) will increase coverage by 0.13%.
The diff coverage is 96.44%.

@@            Coverage Diff             @@
##           latest    #1392      +/-   ##
==========================================
+ Coverage   89.58%   89.71%   +0.13%     
==========================================
  Files         122      123       +1     
  Lines       18989    19464     +475     
  Branches     1455     1483      +28     
==========================================
+ Hits        17011    17463     +452     
- Misses       1750     1775      +25     
+ Partials      228      226       -2

Flag	Coverage Δ
python	`94.86% <97.71%> (+0.06%)`	⬆️
rust	`67.20% <0.00%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/core/src/ffi/minhash.rs	`0.00% <0.00%> (ø)`
src/core/src/sketch/minhash.rs	`91.31% <ø> (ø)`
src/sourmash/sbt.py	`80.82% <86.53%> (-3.13%)`	⬇️
src/sourmash/minhash.py	`92.69% <89.47%> (+0.05%)`	⬆️
src/sourmash/commands.py	`83.84% <94.33%> (+0.50%)`	⬆️
src/sourmash/search.py	`93.23% <95.74%> (+2.00%)`	⬆️
tests/test_search.py	`98.48% <98.48%> (ø)`
src/sourmash/index.py	`94.62% <98.85%> (+2.51%)`	⬆️
src/sourmash/cli/categorize.py	`100.00% <100.00%> (ø)`
src/sourmash/lca/lca_db.py	`92.30% <100.00%> (+0.96%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb2b210...c8d8cd6. Read the comment docs.

ctb · 2021-04-19T13:19:53Z

OK, I think this is done, provisionally.

luizirber · 2021-04-19T20:35:16Z

Review ongoing, but while I don't finish it: the gather implementation for SBTs became way slower, and memory consumption also went up:

ctb · 2021-04-19T20:39:09Z

yeesh! that's not good!

I wonder if unload_data's default changed?

ctb · 2021-04-19T20:39:31Z

(there's no obvious algorithmic reason for the time or memory to go up)

luizirber · 2021-04-19T20:46:52Z

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421

While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

ctb · 2021-04-19T20:48:55Z

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421

While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

ahh! good point! and that's pretty easy to fix with MinHash code... I'll see if I can get it working using the Rust code.

ctb · 2021-04-19T20:52:11Z

(great job on finding that, I struggled to get that code (a) working and then (b) clean and (c) tested, so now it's time for (d) optimization 😂)

luizirber · 2021-04-19T20:58:30Z

(great job on finding that, I struggled to get that code (a) working and then (b) clean and (c) tested, so now it's time for (d) optimization joy)

(that's the code I need to change for #1137 #1138 #1201 #1221, so...)

ctb · 2021-04-19T21:02:26Z

(that's the code I need to change for #1137 #1138 #1201 #1221, so...)

ctb · 2021-04-20T13:20:52Z

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421
While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

ahh! good point! and that's pretty easy to fix with MinHash code... I'll see if I can get it working using the Rust code.

#1474 provides a MinHash.intersection implemented in Rust. Would be interested in benchmarks!

…dex_find

ctb · 2021-04-21T21:04:09Z

note to self: @bluegenes and I want to add a match argument to JaccardSearch.collect(...) so that we can ignore matches to specific signatures, either by identity or by name or by ...

this would enable #985 and #849 more generically.

Edit: added in #1477

* add MinHash.intersection method * rearrange order of intersection * swizzle SBT search code over to using Rust-based intersection code, too * add intersection_and_union_size method to MinHash * make flatten a no-op if track_abundance=False * intersection_union_size in the FFI Co-authored-by: Luiz Irber <luiz.irber@gmail.com>

luizirber

This is already a large PR, and there are PRs merging more code into this, so I vote for merging now and rebasing the other ones.

Overall a nice cleanup of the codebase, no obvious performance regressions, and I think it mostly fits with future indices written in Rust.

ctb added 13 commits March 12, 2021 06:52

begin refactoring 'categorize'

4c09f5b

have the 'find' function for SBTs return signatures

af6fd84

fix majority of tests

8a92936

Merge branch 'latest' of github.com:dib-lab/sourmash into fix/sbt_find

c4adabf

comment & then fix test

cdb4159

torture the tests into working

a414624

split find and _find_nodes to take different kinds of functions

6f7d368

Merge branch 'fix/sbt_find' into refactor/categorize

7b2f624

redo 'find' on index

b5ab6d7

refactor lca_db to use new find

ed7d52b

refactor SBT to use new find

aec730e

comment/cleanup

590b3d6

refactor out common code

eb7d661

ctb added 16 commits March 13, 2021 07:18

fix up gather

0639c3e

use 'passes' properly

a65c79b

attempted cleanup

02794ee

minor fixes

f94e909

get a start on correct downsampling

c3a65ac

adjust tree downsampling for regular minhashes, too

9054cb8

remove now-unused search functions in sbtmh

db740ec

refactor categorize to use new find

03a5e60

cleanup and removal

b3718dd

remove redundant code in lca_db

e8e4702

remove redundant code in SBT

b40963c

add notes

055bd60

remove more unused code

2329009

refactor most of the test_sbt tests

e6d90f6

fix one minor issue

2baa8c3

fix jaccard calculation in sbt

0ec99ea

ctb changed the title ~~[WIP] Rework the find functionality for Index classes~~ [MRG] Rework the find functionality for Index classes Apr 19, 2021

ctb mentioned this pull request Apr 20, 2021

[MRG] Add MinHash.intersection(...) #1474

Merged

Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/in…

3a5901e

…dex_find

luizirber mentioned this pull request Apr 21, 2021

[MRG] Improving intersection and union calculations #1475

Merged

fix my dumb mistake with gather

57467cd

ctb mentioned this pull request Apr 21, 2021

[MRG] Adjust Index.find search protocol to support selective collection of matches #1477

Merged

luizirber approved these changes Apr 22, 2021

View reviewed changes

luizirber merged commit f02e250 into latest Apr 22, 2021

luizirber deleted the refactor/index_find branch April 22, 2021 02:17

ctb mentioned this pull request May 2, 2021

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged

16 tasks

This was referenced May 15, 2021

summary: selectors are good, let's maybe have more of them. #1524

Open

sourmash make_gather_query ValueError for threshold bp #1538

Closed

ctb mentioned this pull request Apr 3, 2022

explore truncation of search results based on overlap in the Index.find(...) protocol. #1925

Open

ctb mentioned this pull request Aug 21, 2022

prefetch-only Index classes and/or remote servers? #2229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Rework the `find` functionality for `Index` classes #1392

[MRG] Rework the `find` functionality for `Index` classes #1392

ctb commented Mar 13, 2021 •

edited

codecov bot commented Mar 13, 2021 •

edited

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 20, 2021

ctb commented Apr 21, 2021 •

edited

luizirber left a comment

[MRG] Rework the find functionality for Index classes #1392

[MRG] Rework the find functionality for Index classes #1392

Conversation

ctb commented Mar 13, 2021 • edited

Questions and comments for reviewers

TODO items:

codecov bot commented Mar 13, 2021 • edited

Codecov Report

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 19, 2021

luizirber commented Apr 19, 2021

ctb commented Apr 19, 2021

ctb commented Apr 20, 2021

ctb commented Apr 21, 2021 • edited

luizirber left a comment

Choose a reason for hiding this comment

[MRG] Rework the `find` functionality for `Index` classes #1392

[MRG] Rework the `find` functionality for `Index` classes #1392

ctb commented Mar 13, 2021 •

edited

codecov bot commented Mar 13, 2021 •

edited

ctb commented Apr 21, 2021 •

edited