-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add scaled MinHash/max_hash support code; many tests & minor bug fixes. #83
Conversation
Current coverage is 77.37% (diff: 94.77%)@@ master #83 diff @@
==========================================
Files 17 17
Lines 2276 2374 +98
Methods 48 48
Messages 0 0
Branches 85 102 +17
==========================================
+ Hits 1741 1837 +96
Misses 510 510
- Partials 25 27 +2
|
return; | ||
} | ||
|
||
if (!num || mins.size() < num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick comment: since we are using C++11
, we can also use or
instead of ||
and and
instead of &&
=]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I like ||
!
Closes #92. |
Ready for review @luizirber @lgautier. I plan to add some tests for the remaining uncovered stuff in _minhash.cc; anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because none of my comments is blocking a merge. They might be worth a look though.
@@ -14,6 +14,8 @@ | |||
except ImportError: | |||
pass | |||
|
|||
DEFAULT_SEED=MinHash(1,1).seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working but I had envisioned that the default seed could be obtained with a module-level utility (function rather than object to stress that this is read-only at the moment).
from sourmash_lib import _minhash
DEFAULT_SEED = _minhash.hash_seed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think that's cleaner, but this will at least work for the moment. If I don't get to it on this PR I'll add an issue.
@@ -29,7 +31,8 @@ class Estimators(object): | |||
""" | |||
|
|||
def __init__(self, n=None, ksize=None, protein=False, | |||
with_cardinality=False, track_abundance=False): | |||
with_cardinality=False, track_abundance=False, | |||
max_hash=0, seed=DEFAULT_SEED): | |||
"Create a new MinHash estimator with size n and k-mer size ksize." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring could document what each parameter is. For example I am not sure about protein
, in addition to which it is passed as a value for a named parameter is_protein
in a nested function call (could be called either protein
or is_protein
everywhere then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! and thanks - I'd missed that protein
was different from is_protein
in this code. ac43885
return (self.num, self.ksize, self.is_protein, | ||
self.mh.get_mins(with_abundance=with_abundance), | ||
self.hll, self.track_abundance, self.max_hash, | ||
self.seed) | ||
|
||
def __setstate__(self, tup): | ||
from . import _minhash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the import of _minhash
at the module level (suggested in a comment above) make this unnecessary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
# one estimator for each ksize | ||
Elist = [] | ||
for k in ksizes: | ||
if args.protein: | ||
E = sourmash_lib.Estimators(ksize=k, n=args.num_hashes, | ||
protein=True, | ||
track_abundance=args.track_abundance) | ||
track_abundance=args.track_abundance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... trying to keep it under 80 chr...
Elist.append(E) | ||
if args.dna: | ||
E = sourmash_lib.Estimators(ksize=k, n=args.num_hashes, | ||
protein=False, | ||
with_cardinality=args.with_cardinality, | ||
track_abundance=args.track_abundance) | ||
track_abundance=args.track_abundance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation issue ?
@@ -69,6 +69,8 @@ def _json_next_signature(iterable, | |||
ksize = d['ksize'] | |||
mins = d['mins'] | |||
n = d['num'] | |||
max_hash = d.get('max_hash', 0) | |||
seed = d.get('seed', sourmash_lib.DEFAULT_SEED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Elegant way to set the seed when not found in JSON file.
@@ -1,5 +1,6 @@ | |||
import sourmash_lib | |||
from sourmash_lib.signature import SourmashSignature, save_signatures, load_signatures | |||
import math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
math
does not seem used anywhere in the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
Thank you for the review! |
@ctb : I have a late question - I am realizing now that I do not understand what |
On Sat, Jan 07, 2017 at 08:16:06AM -0800, Laurent Gautier wrote:
@ctb : I have a late question - I am realizing now that I do not understand what `max_hash` is for. I read and think that I understand what it is from the docstring and the C code, but not why it was introduced.
Yes, it's a research feature :). I'm still exploring its use.
|
OK. Reading it as "classified information". ;-) I am seeing here a connection with #97 in the sense that making experimentation through composability / extensibility (and without necessarily putting in the main code base) would be a good feature (although a feature of the design rather than a switch). I actually landed on this repos because I wanted to experiment with MinHash and SBT (and I am hardly using the command line, and my code is using this package as a library). I have started looking at getting kmers/ngrams and hash together (see discussion in marbl/Mash#27), and I am using that as a opportunity / concrete use-case to suggest a bit of refactoring to make experimentations easier. I should have a pull request that, although incomplete, should help me have examples and candidate solutions to discuss relatively soon. |
Update on my earlier note:
While the focus was on design, but I found myself spending a lot of time thinking about how to make it fit the existing codebase. I ended rewriting from scratch and in Python. This was for the bad news. Now the good news:
|
On Sat, Jan 07, 2017 at 08:59:10AM -0800, Laurent Gautier wrote:
OK. Reading it as "classified information". ;-)
not so much classified as "of unknown value" / "hard to explain" :)
I am seeing here a connection with #97 in the sense that making experimentation through composability / extensibility (and without necessarily putting in the main code base) would be a good feature (although a feature of the design rather than a switch). I actually landed on this repos because I wanted to experiment with MinHash and SBT (and I am hardly using the command line, and my code is using this package as a library).
I have started looking at getting kmers/ngrams and hash together (see discussion in marbl/Mash#27), and I am using that as a opportunity / concrete use-case to suggest a bit of refactoring to make experimentations easier. I should have a pull request that, although incomplete, should help me have examples and candidate solutions to discuss relatively soon.
my (our) general policy on this kind of thing is that more code, and
exploratory code, is fine in master as long as it's tested and doesn't mess
with the existing use cases; this is where semantic versioning comes in, too,
as it reassures people that command line foo remains reasonably stable. long
term running branches are also very much ok, especially if they're regularly
updated from master.
I would prefer to avoid forks as much as possible (because then we have to
make sure common bugs are communicated across the forks).
for me, it's all about (a) supporting common use cases for non-/semi-developers
"out of the box", (b) providing a robust platform for experimentation,
and (c) progressively moving towards more used functionality and fewer
bugs.
none of my projects have gotten big enough that this attitude has become
a problem :)
|
On the same page here. While trying to see how the writing of the shareable JSON file could be done, I saw opportunities for work on design (the alternative I saw was making the code base a bit circumvoluted) I was breaking a lot of other things in either case. I thought about a fork, but quickly came to the same points as you (challenge to stability if merged too early, synchronization nightmare as time is passing... and if this was not enough there is an active refactoring of the C-layer to Cython). For that reason I started from scratch with a blank page, a pencil, and an empty
update: I have a number of design and implementation ideas now in mashing-pumpkins. Notfinal-final but tested and benchmarked enough to make me have a release. A proof-of-concept* utility with that code is showing that while there is added flexibility building sourmash sketches from FASTA or FASTQ would be 2 to 3 times faster (and reportedly use much less RAM). |
--scaled
tosourmash compute
max_hash
toEstimator
andSourmashSignature
;abundances
saving/loading inEstimators
;MinHash
objects;test_sourmash.py
forsourmash --scaled
.Standard checklist:
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?