Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

settle on core MinHash object API #885

Open
ctb opened this issue Feb 2, 2020 · 6 comments
Open

settle on core MinHash object API #885

ctb opened this issue Feb 2, 2020 · 6 comments
Labels
5.0 issues to address for a 5.0 release
Milestone

Comments

@ctb
Copy link
Contributor

ctb commented Feb 2, 2020

PR #882 raised a question in my mind about the various similarity and comparison methods on MinHash objects. We have similarity, compare, and jaccard, as well as contained_by and containment_ignore_maxhash. Currently compare and jaccard are identical (and do Jaccard comparisons without hash abundances), while similarity calculates Jaccard similarity w/o abundances and calculates angular similarity with abundances.

The source code for this is a bit of a mess, obviously :).

SEPARATELY, I just realized that the sourmash gather code in sourmash/search.py does abundance calculations in Yet Another Way, with what I'll call a projection from an abundance-weighted signature on to a non-abundance-weighted signature (see block in search.py). This code is entirely absent from the MinHash code. sigh.

Anyway.

#882 makes the rust code a bit simpler and less repetitive, and also IMO clearer, by adding explicit angular_similarity and jaccard functions (that do those calculations without downsampling), and then doing the downsampling and so on in similarity and compare. This raises the following questions --

Should the Python API match the rust API? (probably yes - right now this means adding angular_similarity.)

should we get rid of redundant functions (esp ones that no one uses outside of the internal sourmash team :) and document the official API? (probably yes.)

should we put "buyer beware" in around things like angular similarity and projection similarity that we haven't proven work, with either simulations or analytics? (...probably yes.)

related to #866 #624

@ctb
Copy link
Contributor Author

ctb commented Feb 6, 2020

@luizirber what's the right way to use deprecated to indicate that the Python MinHash::compare function will be removed in the future? And when's the earliest we can remove it - v4, or v5?

@ctb ctb mentioned this issue Feb 6, 2020
5 tasks
@ctb
Copy link
Contributor Author

ctb commented Feb 8, 2020

I'm going to try to follow the pandas deprecation policy (also ref their versioning policy)

@ctb
Copy link
Contributor Author

ctb commented Feb 9, 2020

We should write Python MinHash object API tests/docs as part of this, esp now that #889 is merged and life is looking a bit better.

@luizirber
Copy link
Member

We should write Python MinHash object API tests/docs as part of this, esp now that #889 is merged and life is looking a bit better.

And probably use similar docs for the rust docs, they are pretty empty right now...

@ctb
Copy link
Contributor Author

ctb commented Jul 2, 2020

would also be good to revisit issues like #618 to see that we're at least consistent.

@ctb
Copy link
Contributor Author

ctb commented Apr 10, 2021

random side thought: I think we should redefine the MinHash constructor API like so:

def __init__(self, *, num=0, scaled=0, ksize=None, ...)

(ref #338 of course)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

2 participants