Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MinHash class refactoring #1128

Merged
merged 59 commits into from
Aug 5, 2020
Merged

[MRG] MinHash class refactoring #1128

merged 59 commits into from
Aug 5, 2020

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 25, 2020

Many minor refactors and associated deprecations of the MinHash class.

Addresses some of #338
Fixes #611 by adding flatten
Fixes #82, fixes #284, and fixes #896 by providing a .hashes attribute that implements a read-only dictionary interface to hashes
Fixes #951 by deprecating is_molecule_type.
Fixes #618 by settling on downsample(...) as the downsampling API.

Relevant to general MinHash API refactoring #720, #885 #999 but doesn't address core issues there.

TODO before merge:

  • maybe document that only functions exposed at sourmash top level are public API under semantic versioning...
  • put together some doctests for the top level API / write docs and doctests against MinHash since it's darn useful
  • deprecate translate_codon or remove from MinHash class at any rate; @luizirber said "We can have a submodule for working with these translations, I think it is useful for prototyping new ideas."
  • summarize proposed __init__ refactoring in a new issue

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

original comment - Pedestrian MinHash class refactoring for 4.0?

This is a trial cleanup & regularization of the MinHash class, to see how it all ...looks and feels. Comments welcome!

related issues:

#720
#82
#338
#611
#896
#999

Constructor refactoring

see specific @luizirber proposal in #999.

I'm kind of in favor of

  • not doing a hash function switch over for 4.0
  • requiring ksize and moltype
  • requiring one of num or scaled;
  • (and changing n -> num)

changes to methods

(deprecate in 3.x, remove in 4.0.)

  • deprecate one or both of get_hashes (?) or get_mins , maybe in favor of a view object/property (see below) like .hashes?
  • remove subtract_mins
  • add->add_kmer`?
  • remove update
  • deprecate translate_codon or remove from MinHash class at any rate
  • rename downsample_n to downsample_num, or downsample(num=...)
  • eliminate max_hash stuff completely from public API, in favor of scaled
  • eliminate mins completely from public API, in favor of hash and hashes
  • eliminate is_molecule_type

Add:

  • flatten to remove abundances.
  • what about adding a view object for the hashes that provides a list/dict interface? e.g. #896

@codecov
Copy link

codecov bot commented Jul 25, 2020

Codecov Report

Merging #1128 into latest will increase coverage by 0.21%.
The diff coverage is 95.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1128      +/-   ##
==========================================
+ Coverage   92.33%   92.55%   +0.21%     
==========================================
  Files          74       73       -1     
  Lines        5809     5764      -45     
==========================================
- Hits         5364     5335      -29     
+ Misses        445      429      -16     
Impacted Files Coverage Δ
sourmash/__main__.py 91.66% <ø> (-0.65%) ⬇️
sourmash/command_compute.py 97.62% <ø> (-0.01%) ⬇️
sourmash/exceptions.py 87.75% <ø> (-0.49%) ⬇️
sourmash/lca/command_compare_csv.py 92.18% <ø> (-0.13%) ⬇️
sourmash/lca/command_index.py 93.12% <ø> (+0.41%) ⬆️
sourmash/lca/command_rankinfo.py 88.88% <ø> (-0.25%) ⬇️
sourmash/lca/lca_utils.py 91.12% <ø> (-0.08%) ⬇️
sourmash/logging.py 96.03% <ø> (-0.04%) ⬇️
sourmash/sbt.py 88.56% <ø> (-0.02%) ⬇️
sourmash/sbtmh.py 89.71% <ø> (-0.19%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cace054...91de874. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Jul 25, 2020

I think this might become close to mergable in 3.x if we deprecate rather than remove various methods, such as get_mins and subtract_mins, and also add a few tests to make sure that get_mins() and so on still work. Thoughts, @luizirber ?

@luizirber
Copy link
Member

I think this might become close to mergable in 3.x if we deprecate rather than remove various methods, such as get_mins and subtract_mins, and also add a few tests to make sure that get_mins() and so on still work. Thoughts, @luizirber ?

That would be awesome! Deprecating in 3.5 and removing in 4 is kind of short notice, but way better than just erroring. Still think we need a transition doc, but it can be way shorter ("check for DeprecationWarning!")

@luizirber
Copy link
Member

luizirber commented Jul 27, 2020

inspired by @luizirber talk, I decided to try my own experimental PR!

Yay experimental PRs! Even if it is not all merged, we can pick most bits and pieces and add in other PRs =]

I'm kind of in favor of

* not doing a hash function switch over for 4.0
* requiring ksize and moltype
* requiring one of num or scaled;
* (and changing `n` -> `num`)

Yeah, I think the hash function switch on __init__ should come later (but we should start using them internally)

deprecate one or both of get_hashes (?) or get_mins , maybe in favor of a view object/property (see below) like .hashes?

I really like the view style!

remove update

I think the goal was to have something more dict/set like? But I don't remember ever using it.

deprecate translate_codon or remove from MinHash class at any rate

We can have a submodule for working with these translations, I think it is useful for prototyping new ideas.

rename downsample_n to downsample_num, or downsample(num=...)

I kinda like downsample(num=...)

eliminate max_hash stuff completely from public API, in favor of scaled

👍

eliminate mins completely from public API, in favor of hash and hashes

💯

what about adding a view object for the hashes that provides a list/dict interface? e.g. #896

I really like this style.

@ctb
Copy link
Contributor Author

ctb commented Jul 27, 2020

I think this might become close to mergable in 3.x if we deprecate rather than remove various methods, such as get_mins and subtract_mins, and also add a few tests to make sure that get_mins() and so on still work. Thoughts, @luizirber ?

That would be awesome! Deprecating in 3.5 and removing in 4 is kind of short notice, but way better than just erroring. Still think we need a transition doc, but it can be way shorter ("check for DeprecationWarning!")

I think we can also let many things persist (get_mins(), etc.) past 4.0 without much harm. It's mostly the complete incompatibilities that I want to remove!

tests/test__minhash.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor Author

ctb commented Jul 27, 2020

Misc thoughts for putting this up for merge --

  • _minhash was always an internal API name, sourmash.MinHash is what should have been used. So I am keeping the module rename 👍
  • we'll deprecate (but keep) get_mins, while switching to using hashes.
  • get_max_hash_for_scaled and get_scaled_for_max_hash were not exposed at top level and were part of the _minhash module, changing their names should be fine.
  • speaking of which we should maybe document that only functions exposed at sourmash top level are public API under semantic versioning...
  • deprecate MinHash methods is_molecule_type, downsample_n and downsample_scaled
  • implement downsample(num/scaled=...)
  • deprecate add and update in favor of add_kmer and set_abundances(..., clear=False)
  • deprecate downsample_max_hash
  • deprecate max_hash property (?) or make it internal only (_max_hash)
  • add tests test_max_hash_and_scaled_error and test_mh_subtract back in, for now, with a note to maybe remove them later
  • misc thought: maybe hashes could (for some future MinHash class) be read-write? for now we should make it a read-only dict interface.

@ctb
Copy link
Contributor Author

ctb commented Jul 27, 2020

maaaybe think about which other parts of the top level API to deprecate. and/or put together some doctests for the top level API.

@ctb
Copy link
Contributor Author

ctb commented Jul 27, 2020

hmm also make issue to refactor Rust API accordingly - e.g. max_hash conversion.

(now #1134)

@ctb
Copy link
Contributor Author

ctb commented Jul 28, 2020

  • look at dna/protein/hp/dayhoff properties; are they used anywhere?
  • add tests for deprecated features like get_mins()

@ctb
Copy link
Contributor Author

ctb commented Aug 5, 2020

heh, the last commit was necessary because of a fun chain of events:

  • once we tagged this as 4.0, the DeprecationWarning for max_hash was triggered.
  • this DeprecationWarning then interfered with a test that was explicitly NOT expecting a warning.

@luizirber luizirber mentioned this pull request Aug 5, 2020
5 tasks
@luizirber
Copy link
Member

luizirber commented Aug 5, 2020

Is this PR also covering #1145 now?

In any case, we might need a PR derived from this one for the DeprecationWarnings and merge it in stable? (and then release it as 3.4.2 and continue updating latest? =])

@luizirber luizirber mentioned this pull request Aug 5, 2020
@ctb
Copy link
Contributor Author

ctb commented Aug 5, 2020 via email

sourmash/minhash.py Outdated Show resolved Hide resolved
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@ctb ctb changed the base branch from latest to stable August 5, 2020 22:46
@ctb ctb merged commit 2093b99 into stable Aug 5, 2020
@ctb ctb deleted the minhash_class_munge branch August 5, 2020 22:49
@ctb
Copy link
Contributor Author

ctb commented Aug 6, 2020

Is this PR also covering #1145 now?

Finally understood this comment, just a bit too late. Sigh. Yes, I merged in #1145 because I got tired of resolving conflicts... I didn't update the comment to reflect that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment