Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we eliminate num MinHashes? (no) #1354

Open
ctb opened this issue Feb 28, 2021 · 6 comments
Open

should we eliminate num MinHashes? (no) #1354

ctb opened this issue Feb 28, 2021 · 6 comments

Comments

@ctb
Copy link
Contributor

ctb commented Feb 28, 2021

while they are nice to have around and are not much of a maintenance burden, they do complicate things --

see #1345, #870 for examples.

they are also not usable for a lot of purposes.

Discuss!

@olgabot
Copy link
Collaborator

olgabot commented Mar 17, 2021

To me, they are useful to have in the same API for comparison as a validation that scaled is indeed better. I found it helpful when figuring out whether num_hashes or scaled was more useful for single cells, without having to switch to a different program. Can it be a "not recommended, use at your own risk" kind of situation?

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2021

To me, they are useful to have in the same API for comparison as a validation that scaled is indeed better. I found it helpful when figuring out whether num_hashes or scaled was more useful for single cells, without having to switch to a different program.

Thank you, this is valuable feedback!

Can it be a "not recommended, use at your own risk" kind of situation?

It already kind of is, but we should be more explicit about it :)

My two concerns are -

  • they complicate the implementation of many things, which can be good when it allows for forcing generalization, but may be bad for code complexity and speed;
  • I keep on finding edge cases where we should have checked for misuse of num MinHashes, and didn't. So I'm worried that people that rely on them may find them broken in certain situations.

Clearer communication FTW.

@luizirber
Copy link
Member

I don't think we need to remove them, but I think we should split them. Many of the misuse issues happen because we didn't check if it was a num or scaled minhash...

(and splitting them also allows many perf optimizations underneath, merging would be greatly simplified...)

That said, it is easier to split on the Rust side, but what would be the API on the Python side? Do we keep one MinHash object, or create NumMinHash and ScaledMinHash? (there are consequences here on the FFI between Rust and Python =])

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2021

I don't think we need to remove them, but I think we should split them. Many of the misuse issues happen because we didn't check if it was a num or scaled minhash...

yes.

well, and because we only slowly came to accept how much better scaled MinHashes were for most things :)

(and splitting them also allows many perf optimizations underneath, merging would be greatly simplified...)

agree.

That said, it is easier to split on the Rust side, but what would be the API on the Python side? Do we keep one MinHash object, or create NumMinHash and ScaledMinHash? (there are consequences here on the FFI between Rust and Python =])

I'm tentatively in favor of having two different subclasses, but would want to explore the consequences. I wonder how much of this could be explored in Python-land before making any Rust changes?

@luizirber
Copy link
Member

I'm tentatively in favor of having two different subclasses, but would want to explore the consequences. I wonder how much of this could be explored in Python-land before making any Rust changes?

I think it is OK to prototype by checking explicitly (everywhere, all the time) for num and scaled/max_hash on the Python subclasses. That would be easy to adapt to FFI/Rust later.

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2021

Seems reasonable. There are lots of small things to explore here - see esp these big related issues, among the myriad of smaller ones:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants