Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Create an Index abstract base class #556

Merged
merged 42 commits into from
Dec 16, 2019
Merged

[MRG] Create an Index abstract base class #556

merged 42 commits into from
Dec 16, 2019

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Oct 5, 2018

From #545 (comment):

After the LCA index was implemented we figured it has a pretty similar API to the SBT 
index, and with the interested in other indexes it seems it's a good time to make a Index 
interface/base class and support more options.

Another way to test this abc is to define a Linear index: save all the signatures in a list/dictionary, search thru it linearly. It's also good as a base case for testing if SBT/LCA are finding all matches correctly (there are some test cases that already do this)

Remaining TODO before review:

  • clean up multiple definitions of SearchResult
  • add signatures method?
  • think carefully about whether we need additional LCA db gather tests, especially with respect to different scaled values.
  • resolve question re LCA_db search / ignore_abundance
  • resolve question re index search / resolve md5 dups
  • LCAGatherResult could maybe be merged/removed? -> punt to Deprecate 'sourmash lca gather' in sourmash 3.0 #728

Rejected for this PR:

  • we are not yet abandoning python 2.7 with this merge. (see comment)
  • do NOT add signature compatibility information or methods to Index objects.
  • do NOT add iterator/generator-like functionality to gather API.

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@olgabot
Copy link
Collaborator

olgabot commented Oct 12, 2018

Exciting!!

@codecov
Copy link

codecov bot commented Oct 13, 2018

Codecov Report

Merging #556 into master will increase coverage by 0.09%.
The diff coverage is 95.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #556      +/-   ##
==========================================
+ Coverage   89.38%   89.47%   +0.09%     
==========================================
  Files          29       30       +1     
  Lines        4614     4705      +91     
  Branches       49       49              
==========================================
+ Hits         4124     4210      +86     
- Misses        486      491       +5     
  Partials        4        4
Impacted Files Coverage Δ
sourmash/sbtmh.py 84.89% <100%> (-0.11%) ⬇️
sourmash/sourmash_args.py 96.72% <90%> (ø) ⬆️
sourmash/commands.py 87.87% <91.66%> (+0.45%) ⬆️
sourmash/search.py 91.26% <94.11%> (-2.94%) ⬇️
sourmash/lca/lca_utils.py 95.45% <95.65%> (-0.57%) ⬇️
sourmash/sbt.py 86.24% <96.55%> (+0.71%) ⬆️
sourmash/index.py 98.38% <98.38%> (ø)
sourmash/signature_json.py 96.21% <0%> (-1.09%) ⬇️
sourmash/lca/command_index.py 90.26% <0%> (-0.45%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03e5269...48abd10. Read the comment docs.

@luizirber luizirber mentioned this pull request Dec 14, 2018
5 tasks
@ctb
Copy link
Contributor

ctb commented Dec 17, 2018

@luizirber once #533 is merged I plan to work on this next & clean up some API stuff. May I haz?

@ctb
Copy link
Contributor

ctb commented Dec 20, 2018

Thought: I think we should provide at least two specialized functions on Index objects, equivalent to 'search' and 'gather'. The former would return matches with best Jaccard similarity, the latter would find matches with best containment. Since this is currently 100% of needed functionality it would be (IMO) be better than our current bending-over-backwards approach to working off of a generic 'find' function.

@luizirber
Copy link
Member Author

@ctb I like that! find is still useful for research (it's easier to define a function to pass to it than adding a method, because the DFS/BFS comes for free), but it is quite annoying for 'production' (or, let's say, when you need to cross language boundaries thru FFI =P)

I don't think there is a way of marking a method 'optional' with abc, so maybe remove find and add search and gather?

@ctb
Copy link
Contributor

ctb commented Dec 20, 2018 via email

@ctb
Copy link
Contributor

ctb commented Jan 5, 2019

hey @luizirber your thoughts on LinearIndex.search on this branch would be most welcome. Is that the right way to write an implementation, wrt keyword args etc etc? Inquiring minds...

""" """

@abstractmethod
def search(self, signature, *args, **kwargs):
Copy link
Member Author

@luizirber luizirber Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even defining search like this:

def search(self, signature, *, 
                 threshold=None,
                 do_containment=False, 
                 ignore_abundance=False,
                 **kwargs):

without *args, with default keyword arguments.

(the * in the middle is a Python 3 thing, but when this get merged I think we will have dropped Python 2 anyway?)

@ctb ctb changed the title [WIP] Create an Index abc [WIP] Create an Index abstract base class Jan 9, 2019
@luizirber
Copy link
Member Author

Tentative: add stub files with typing information too
https://mypy.readthedocs.io/en/stable/stubs.html

@olgabot
Copy link
Collaborator

olgabot commented Aug 10, 2019

What is the status of this with relation to the Rust codebase?

@luizirber
Copy link
Member Author

luizirber commented Aug 23, 2019

The Rust codebase has similar concepts implemented. This is the ~equivalent Trait (Rust traits are similar to Python abstract base classes) https://github.com/dib-lab/sourmash/blob/0d69b7973e1b4e690933b27aeb8d11becc876e19/src/index/mod.rs#L34-L68

And an implementation for a LinearIndex similar to the one @ctb implemented in this PR:
https://github.com/dib-lab/sourmash/blob/0d69b7973e1b4e690933b27aeb8d11becc876e19/src/index/linear.rs#L32-L83

(Note that since the Rust trait has a default implementation for search, the LinearIndex only need to implement find. But implementations are free to implement a search method if they want, like in the BIGSI example).

Python and Rust are not talking yet, mostly because it would involve juggling 3 different branches/PRs:

  • Replacing C++ with Rust #424, which replaces C++ with Rust. It wasn't merged yet because Titus thinks this should be 3.0 (since it changes the build system), even if there are no breaking changes in the API. Right now I think we should merge Add Dayhoff encoding of amino acids #689 first, and implement the Dayhoff encoding in Rust.
  • Moving loading and save sigs to Rust #532, moving signature loading into Rust. This is way faster for large signatures (in the multiple MBs range), but should land after Replacing C++ with Rust #424.
  • And this one. Original plan was to expose the Rust LinearIndex, and then move (or not) the SBT into Rust. (More likely to keep the Python SBT and expose the Rust SBT in another class).

A follow up PR to this one would then move commands (search and gather) to use Index instead of an specific implementation (like the SBT). Rust codebase already have some of this implemented.

@olgabot
Copy link
Collaborator

olgabot commented Aug 23, 2019

Ping @phoenixAja

@luizirber
Copy link
Member Author

luizirber commented Aug 24, 2019

#672 (comment) is also related to this: SBT and LCA are the main candidates to implement Index, but a list of signatures can also be put quickly into a LinearIndex (because it is pretty simple), and if the commands are defined in terms of Index operations we don't need to keep adding conditions and corners cases in each command (see the if/elif/else chain in search and gather).

@ctb
Copy link
Contributor

ctb commented Sep 6, 2019

Hey @luizirber, there are still a few failed tests, and I have a sneaking suspicion that there are ways to make things much more efficient with generators, but I've done a major bit of refactoring and would welcome your input!



class LinearIndex(Index):
def __init__(self, signatures=[], filename=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noooooooooo... Don't set a default argument like that =]
http://effbot.org/zone/default-values.htm

Suggested change
def __init__(self, signatures=[], filename=None):
def __init__(self, signatures=None, filename=None):
if signatures is None:
signatures = []

@luizirber
Copy link
Member Author

The LinearIndex is looking a lot like the one in the Rust side! And I think we can replace most cases where we process a list of signatures with a LinearIndex, (like in compare)...

Or even an Index. An idea I've been playing with is to add a datasets method in Index that iterates over datasets contained in an index (leaves for a SBT, for example). With this we can even define default implementations for Index methods (which are going to be the same as LinearIndex), and then any other implementation (like the SBT) can choose to change as they see fit (search and gather using the tree structure, instead of just iterating through the leaves, for example). It is not working yet on the Rust side, but sort of looks like this: https://github.com/dib-lab/sourmash/blob/36bc10d1ded37ab9c36cc8be77a8ebdb982dd399/src/index/mod.rs#L34-L83
(note the datasets and iter_datasets methods, and how they are using in the find method)

@ctb
Copy link
Contributor

ctb commented Sep 8, 2019

Yes, I too was thinking about a datasets-style iter! Dovetails with #672.

I'm not a huge fan of the datasets method name; you don't want to just go with signatures?

@ctb
Copy link
Contributor

ctb commented Sep 8, 2019

A few other thoughts -

first, LinearIndex isn't just a list of signatures, they need to be compatible signatures (i.e. that can be compared to each other). I think there's an opportunity to clear up and simplify some of the compatibility-checking code that's lying around the code base...

Second, looking at the gather code and thinking about possible optimizations, I would like to suggest have gather behave like an iterator in Python. By analogy, there would be an __iter__-like method that would return a gather object that has a next method on it. In the case of gather objects, the next method would take the next version of the signature to search, with the guarantee that this signature would be a proper subset of the previous version of the signature. This seems to fit the general case and would allow certain kinds of optimizations in the LCA db case, for example. I'll rough out some code and see what makes sense.

@ctb
Copy link
Contributor

ctb commented Sep 8, 2019

(ref datasets q, what about just adding an iterator to Index objects that returns the underlying contained objects?)

@ctb
Copy link
Contributor

ctb commented Sep 8, 2019

Note to self: should gather take threshold_bp instead of threshold as an argument, since we can estimate that for scaled signatures (& gather databases have to use scaled signatures)?

@ctb
Copy link
Contributor

ctb commented Sep 8, 2019

Note to self: the LCA db gather needs some more tests, because it's not breaking enough when I change the API :)

@ctb
Copy link
Contributor

ctb commented Dec 12, 2019

Note, we could now easily provide a generic sourmash.load_index function that sniffed the file format and then called the appropriate Index.load.

@ctb
Copy link
Contributor

ctb commented Dec 13, 2019

hi @luizirber this is ready for review!

@ctb ctb changed the title [WIP] Create an Index abstract base class Create an Index abstract base class Dec 13, 2019
@ctb ctb changed the title Create an Index abstract base class [MRG] Create an Index abstract base class Dec 13, 2019
ctb and others added 8 commits December 13, 2019 16:16
Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>
* add signatures() method to both LCA and SBT indices

* Update tests/test_sbt.py

Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>

* SBT.insert now matches Index.insert, while SBT.add_node does what insert used to

* clean up signature loading

* round out Index method tests, sort of :)
@ctb
Copy link
Contributor

ctb commented Dec 15, 2019

Updated with #796 and added some scaled tests. Review & merge at will @luizirber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants