Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

differentiate between mutable and immutable MinHash objects? #1494

Closed
ctb opened this issue May 2, 2021 · 4 comments
Closed

differentiate between mutable and immutable MinHash objects? #1494

ctb opened this issue May 2, 2021 · 4 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 2, 2021

As I write more/deeper code around search and prefetch and so on, I am starting worry about accidentally modifying MinHash objects.

In brief -

  • most of the time, MinHash objects loaded from signature files can and should not be changed in any way - my suspicion is that in most of the code, immutable MinHash objects are probably the right thing to be used!
  • when they are, they should (?) be copied-and-returned by application of methods like flatten and downsample, which could similarly return immutable MinHash objects.
  • there are several potential upsides to clearly distinguishing between mutable and immutable MinHash - in addition to preventing/discovering bugs, presumably there are potential performance improvements on the Python side, and I'm also guessing that defaulting to immutable MinHash objects over on the rust side could yield massive performance benefits.

related to the idea of diversifying MinHash objects to differentiate between num and scaled objects #1354

there's another issue out there about the two different rust implementations of hash storage that I can't find at the moment, that could factor into this.

@luizirber
Copy link
Member

there's another issue out there about the two different rust implementations of hash storage that I can't find at the moment, that could factor into this.

Probably #1045?

But overall +1000 on this. We can unlock a lot of cool optimizations if we don't need to mutate the MinHash sketches, from reducing memory needed to store the data, to faster intersection/union calculations, and as #1045 showed, we can also optimize for change-heavy code without undoing the optimization from read-heavy code.

@ctb
Copy link
Contributor Author

ctb commented May 2, 2021

actually this one #1055, linked from #1045.

This is another one (like #1354) that can benefit from refactoring done at the Python layer to then guide the way to improved Rust design.

@ctb
Copy link
Contributor Author

ctb commented Jun 12, 2021

this was done in #1508, which continues to be a mostly innocuous change so far! I'm going to leave this open for a bit tho, to see if there's more to be done.

@ctb
Copy link
Contributor Author

ctb commented Jun 26, 2021

closing - #1508 hasn't really caused any problems.

@ctb ctb closed this as completed Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants