Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add scaled MinHash/max_hash support code; many tests & minor bug fixes. #83

Merged
merged 21 commits into from
Jan 5, 2017

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jan 3, 2017

  • Adds --scaled to sourmash compute
  • Adds max_hash to Estimator and SourmashSignature;
  • Fixes bug with abundances saving/loading in Estimators;
  • Fixes issue with Jaccard similarity computation for empty MinHash objects;

Standard checklist:

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb ctb changed the title make minhash max size of 0 indicate no limit to minhash size [WIP] make minhash max size of 0 indicate no limit to minhash size Jan 3, 2017
@codecov-io
Copy link

codecov-io commented Jan 3, 2017

Current coverage is 77.37% (diff: 94.77%)

Merging #83 into master will increase coverage by 0.88%

@@             master        #83   diff @@
==========================================
  Files            17         17          
  Lines          2276       2374    +98   
  Methods          48         48          
  Messages          0          0          
  Branches         85        102    +17   
==========================================
+ Hits           1741       1837    +96   
  Misses          510        510          
- Partials         25         27     +2   

Powered by Codecov. Last update 4950a81...7d2dfba

return;
}

if (!num || mins.size() < num) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment: since we are using C++11, we can also use or instead of || and and instead of && =]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I like ||!

@ctb ctb mentioned this pull request Jan 4, 2017
6 tasks
@ctb ctb changed the title [WIP] make minhash max size of 0 indicate no limit to minhash size [WIP] add scaled MinHash/max_hash support code; many tests & minor bug fixes. Jan 4, 2017
@ctb
Copy link
Contributor Author

ctb commented Jan 5, 2017

Closes #92.

@ctb
Copy link
Contributor Author

ctb commented Jan 5, 2017

Ready for review @luizirber @lgautier. I plan to add some tests for the remaining uncovered stuff in _minhash.cc; anything else?

Copy link
Contributor

@lgautier lgautier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because none of my comments is blocking a merge. They might be worth a look though.

@@ -14,6 +14,8 @@
except ImportError:
pass

DEFAULT_SEED=MinHash(1,1).seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working but I had envisioned that the default seed could be obtained with a module-level utility (function rather than object to stress that this is read-only at the moment).

from sourmash_lib import _minhash
DEFAULT_SEED = _minhash.hash_seed()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think that's cleaner, but this will at least work for the moment. If I don't get to it on this PR I'll add an issue.

@@ -29,7 +31,8 @@ class Estimators(object):
"""

def __init__(self, n=None, ksize=None, protein=False,
with_cardinality=False, track_abundance=False):
with_cardinality=False, track_abundance=False,
max_hash=0, seed=DEFAULT_SEED):
"Create a new MinHash estimator with size n and k-mer size ksize."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring could document what each parameter is. For example I am not sure about protein, in addition to which it is passed as a value for a named parameter is_protein in a nested function call (could be called either protein or is_protein everywhere then).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! and thanks - I'd missed that protein was different from is_protein in this code. ac43885

return (self.num, self.ksize, self.is_protein,
self.mh.get_mins(with_abundance=with_abundance),
self.hll, self.track_abundance, self.max_hash,
self.seed)

def __setstate__(self, tup):
from . import _minhash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the import of _minhash at the module level (suggested in a comment above) make this unnecessary ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

# one estimator for each ksize
Elist = []
for k in ksizes:
if args.protein:
E = sourmash_lib.Estimators(ksize=k, n=args.num_hashes,
protein=True,
track_abundance=args.track_abundance)
track_abundance=args.track_abundance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... trying to keep it under 80 chr...

Elist.append(E)
if args.dna:
E = sourmash_lib.Estimators(ksize=k, n=args.num_hashes,
protein=False,
with_cardinality=args.with_cardinality,
track_abundance=args.track_abundance)
track_abundance=args.track_abundance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue ?

@@ -69,6 +69,8 @@ def _json_next_signature(iterable,
ksize = d['ksize']
mins = d['mins']
n = d['num']
max_hash = d.get('max_hash', 0)
seed = d.get('seed', sourmash_lib.DEFAULT_SEED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Elegant way to set the seed when not found in JSON file.

@@ -1,5 +1,6 @@
import sourmash_lib
from sourmash_lib.signature import SourmashSignature, save_signatures, load_signatures
import math
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math does not seem used anywhere in the module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@ctb
Copy link
Contributor Author

ctb commented Jan 5, 2017

Thank you for the review!

@ctb ctb changed the title [WIP] add scaled MinHash/max_hash support code; many tests & minor bug fixes. add scaled MinHash/max_hash support code; many tests & minor bug fixes. Jan 5, 2017
@ctb ctb merged commit 2416946 into master Jan 5, 2017
@ctb ctb deleted the boundary branch January 5, 2017 14:28
This was referenced Jan 5, 2017
@lgautier
Copy link
Contributor

lgautier commented Jan 7, 2017

@ctb : I have a late question - I am realizing now that I do not understand what max_hash is for. I read and think that I understand what it is from the docstring and the C code, but not why it was introduced.

@ctb
Copy link
Contributor Author

ctb commented Jan 7, 2017 via email

@lgautier
Copy link
Contributor

lgautier commented Jan 7, 2017

OK. Reading it as "classified information". ;-)

I am seeing here a connection with #97 in the sense that making experimentation through composability / extensibility (and without necessarily putting in the main code base) would be a good feature (although a feature of the design rather than a switch). I actually landed on this repos because I wanted to experiment with MinHash and SBT (and I am hardly using the command line, and my code is using this package as a library).

I have started looking at getting kmers/ngrams and hash together (see discussion in marbl/Mash#27), and I am using that as a opportunity / concrete use-case to suggest a bit of refactoring to make experimentations easier. I should have a pull request that, although incomplete, should help me have examples and candidate solutions to discuss relatively soon.

@lgautier
Copy link
Contributor

Update on my earlier note:

I have started looking at getting kmers/ngrams and hash together (see discussion in marbl/Mash#27), and I am using that as a opportunity / concrete use-case to suggest a bit of refactoring to make experimentations easier. I should have a pull request that, although incomplete, should help me have examples and candidate solutions to discuss relatively soon.

While the focus was on design, but I found myself spending a lot of time thinking about how to make it fit the existing codebase. I ended rewriting from scratch and in Python. This was for the bad news.

Now the good news:

  • I think that I am coming up with an interesting design for library (not command line tool and interface, and I currently don't plan to).
  • I have cleaned a bit my implementation to make it easier to communicate about the ideas.
  • While more flexible (as in "it allows all kind of experimentations and try research ideas") the performances appear quite decent for a start (compared to fully C-implemented module sourmash._minhash).
  • I have written a short jupyter notebook to show some the points above - https://github.com/lgautier/mashing-pumpkins/blob/master/doc/notebooks/MinHash%2C%20design%20and%20performance.ipynb

@ctb
Copy link
Contributor Author

ctb commented Jan 12, 2017 via email

@lgautier
Copy link
Contributor

lgautier commented Jan 13, 2017

On the same page here.

While trying to see how the writing of the shareable JSON file could be done, I saw opportunities for work on design (the alternative I saw was making the code base a bit circumvoluted) I was breaking a lot of other things in either case.

I thought about a fork, but quickly came to the same points as you (challenge to stability if merged too early, synchronization nightmare as time is passing... and if this was not enough there is an active refactoring of the C-layer to Cython).

For that reason I started from scratch with a blank page, a pencil, and an empty .py file, thinking about a very minimal minhash library (just focusing on building/using what is described here: https://en.wikipedia.org/wiki/MinHash - that's the mashing-pumpkins thing) for the time being as this has the following practical advantages:

  • no need to worry about breaking anything during the initial phase, thus making the trying of bold ideas easier (or try Python 3-only features)
  • meaningful continuous integration status with respect to unit tests and coverage
  • in a second phase, should there be any interesting outcomes, the integration of the ideas can be done by just using it as a library (the only code needed is an adapter at this point). Whatever bits are interesting can also be looted - i put it under a compatible license.

update: I have a number of design and implementation ideas now in mashing-pumpkins. Notfinal-final but tested and benchmarked enough to make me have a release. A proof-of-concept* utility with that code is showing that while there is added flexibility building sourmash sketches from FASTA or FASTQ would be 2 to 3 times faster (and reportedly use much less RAM).
(*: That's proof-of-concept code, using adapter code to integrate with sourmash - that a library of building blocks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants