added constructors for MinHasher and subclasses that allow the number of... #144

Merged
merged 3 commits into from Jun 20, 2013

Conversation

Projects
None yet
3 participants
Contributor

koertkuipers commented Mar 14, 2013

... hashes and number of bands to be input, instead of relying on these being computed from a threshold. motivation: we had a need to guarantee the exact bytes in the hashes for backwards compatibility with previously generated and stored hashes. relying on the same numbers being computed from a Double constructor argument seemed somewhat risky.

@koertkuipers koertkuipers added constructors for MinHasher and subclasses that allow the number…
… of hashes and number of bands to be input, instead of relying on these being computed from a threshold. motivation: we had a need to guarantee the exact bytes in the hashes for backwards compatibility with previously generated and stored hashes.
bf84abf
Collaborator

johnynek commented Mar 16, 2013

In principle, I like this.

But while we're making breaking changes, can I ask that you also introduce a type:

case class MinHashSignature(bytes: IndexedSeq[Byte])

and make the monoid on that type? Using Array[Byte] is too generic a type, and it makes it impossible to set a default Monoid at too high a scope.

We should target this for 0.2.0.

Contributor

koertkuipers commented Mar 17, 2013

ok making the Monoid more specific with respect to type makes sense.

should we replace all the Array[Byte]?
for example similarity becomes:
def similarity(left : MinHashSignature, right : MinHashSignature) : Double
and init becomes:
def init(value : Long) : MinHashSignature

or be more surgical in the change and only replace the methods that relate
to the Monoid because of the too generic implicit (so only plus and zero),
while keep using Array[Byte] for the others?

On Sat, Mar 16, 2013 at 4:38 PM, P. Oscar Boykin
notifications@github.comwrote:

In principle, I like this.

But while we're making breaking changes, can I ask that you also introduce
a type:

case class MinHashSignature(bytes: IndexedSeq[Byte])

and make the monoid on that type? Using Array[Byte] is too generic a type,
and it makes it impossible to set a default Monoid at too high a scope.

We should target this for 0.2.0.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15011797
.

Contributor

koertkuipers commented Mar 18, 2013

i played around with a type like case class MinHashSignature(bytes:
IndexedSeq[Byte])

however i run into trouble with the IndexedSeq[Byte]. The implementation of
the plus operation uses elegant and efficient ByteBuffer operations, which
rely on the ByteBuffer wrapping the Array[Byte]. If cannot do the same
thing with IndexedSeq[Byte](at least i dont see a way), unless i start
copying every IndexedSeq into a Array[Byte].

any suggestions?

case class MinHashSignature(bytes: Array[Byte]) would work trivially.

On Sun, Mar 17, 2013 at 12:38 PM, Koert Kuipers koert@tresata.com wrote:

ok making the Monoid more specific with respect to type makes sense.

should we replace all the Array[Byte]?
for example similarity becomes:
def similarity(left : MinHashSignature, right : MinHashSignature) :
Double
and init becomes:
def init(value : Long) : MinHashSignature

or be more surgical in the change and only replace the methods that relate
to the Monoid because of the too generic implicit (so only plus and zero),
while keep using Array[Byte] for the others?

On Sat, Mar 16, 2013 at 4:38 PM, P. Oscar Boykin <notifications@github.com

wrote:

In principle, I like this.

But while we're making breaking changes, can I ask that you also
introduce a type:

case class MinHashSignature(bytes: IndexedSeq[Byte])

and make the monoid on that type? Using Array[Byte] is too generic a
type, and it makes it impossible to set a default Monoid at too high a
scope.

We should target this for 0.2.0.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15011797
.

Collaborator

johnynek commented Mar 18, 2013

I guess MinHashSignature(bytes: Array[Byte]) is fine, but we should be clear in the comments about the contract: are we assuming that these bytes are never modified (I think so).

Contributor

koertkuipers commented Mar 18, 2013

correct they are never modified. i will add that comment.

On Mon, Mar 18, 2013 at 6:44 PM, P. Oscar Boykin
notifications@github.comwrote:

I guess MinHashSignature(bytes: Array[Byte]) is fine, but we should be
clear in the comments about the contract: are we assuming that these bytes
are never modified (I think so).


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15086917
.

Contributor

koertkuipers commented Mar 18, 2013

unrelated... but i keep getting segfaults on algebird's sbt test. is there
some kind of trick to avoid that?

On Mon, Mar 18, 2013 at 6:47 PM, Koert Kuipers koert@tresata.com wrote:

correct they are never modified. i will add that comment.

On Mon, Mar 18, 2013 at 6:44 PM, P. Oscar Boykin <notifications@github.com

wrote:

I guess MinHashSignature(bytes: Array[Byte]) is fine, but we should be
clear in the comments about the contract: are we assuming that these bytes
are never modified (I think so).


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15086917
.

Collaborator

avibryant commented Mar 19, 2013

Thinking ahead, we may want MinHasher (or a version of MinHasher) to wrap AdaptiveVector[Int] instead of Array[Byte], so that we can produce sparse minhash signatures as in the DISCO paper (http://arxiv.org/abs/1206.2082).

We should probably measure how much benefit we're actually getting from the efficient Array[Byte] stuff to see how important it is to preserve.

Contributor

koertkuipers commented Mar 19, 2013

ok, i have to read that DISCO paper again. is AdaptiveVector a known type,
or was that to make the point that it should be sparse? couldnt find it
anywhere

On Tue, Mar 19, 2013 at 12:42 PM, avibryant notifications@github.comwrote:

Thinking ahead, we may want MinHasher (or a version of MinHasher) to wrap
AdaptiveVector[Int] instead of Array[Byte], so that we can produce sparse
minhash signatures as in the DISCO paper (http://arxiv.org/abs/1206.2082).

We should probably measure how much benefit we're actually getting from
the efficient Array[Byte] stuff to see how important it is to preserve.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15126908
.

Collaborator

avibryant commented Mar 19, 2013

Sorry, AdaptiveVector was a reference to #145 .

The idea in the DISCO paper is that if you have a count, or an estimate, of the total size of the set, then when processing a given element, you can exclude hash values from the signature that are so high that you predict that they are unlikely to end up in the final minhash; so you're emitting very sparse signatures as you go but end up with a dense one in the end after combining them all.

Collaborator

avibryant commented Mar 19, 2013

So what we'd probably want to do is produce a CountMinSketch to estimate the sizes of each set, then distribute that out to each node and use it when constructing the sparse minhashes...

Contributor

koertkuipers commented Mar 19, 2013

found it. sounds like you rebuilding the mahout.math vector classes?

On Tue, Mar 19, 2013 at 3:34 PM, avibryant notifications@github.com wrote:

Sorry, AdaptiveVector was a reference to #145#145.

The idea in the DISCO paper is that if you have a count, or an estimate,
of the total size of the set, then when processing a given element, you can
exclude hash values from the signature that are so high that you predict
that they are unlikely to end up in the final minhash; so you're emitting
very sparse signatures as you go but end up with a dense one in the end
after combining them all.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15137867
.

Contributor

koertkuipers commented Mar 19, 2013

does that mean you have to first run an entire map-reduce phase for the
CountMinSketch?

since the minhashing happens mostly map-side (both the creation and the
monad's plus operation) and once can asume that the result coming out of a
mapper would be dense already. i wonder what the gains would be of a
smarter implementation. ok i will read the paper again and be back.

On Tue, Mar 19, 2013 at 3:36 PM, avibryant notifications@github.com wrote:

So what we'd probably want to do is produce a CountMinSketch to estimate
the sizes of each set, then distribute that out to each node and use it
when constructing the sparse minhashes...


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15137939
.

Contributor

koertkuipers commented Mar 19, 2013

sorry for the typos. meant to say:

since the minhashing happens mostly map-side (both the creation and the
monad's plus operation), i think one can assume that the result coming out
of a mapper would be dense already. i wonder what the gains would be of a
smarter implementation. ok i will read the paper again and be back.

On Tue, Mar 19, 2013 at 3:55 PM, Koert Kuipers koert@tresata.com wrote:

does that mean you have to first run an entire map-reduce phase for the
CountMinSketch?

since the minhashing happens mostly map-side (both the creation and the
monad's plus operation) and once can asume that the result coming out of a
mapper would be dense already. i wonder what the gains would be of a
smarter implementation. ok i will read the paper again and be back.

On Tue, Mar 19, 2013 at 3:36 PM, avibryant notifications@github.comwrote:

So what we'd probably want to do is produce a CountMinSketch to estimate
the sizes of each set, then distribute that out to each node and use it
when constructing the sparse minhashes...


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15137939
.

Collaborator

avibryant commented Mar 19, 2013

Re mahout.math - probably, and it's also probably easier to do that than to find a way to reuse them... :/ Though would be nice if that turned out not to be true.

And yes, you first have to run a full MR phase to produce the CountMinSketch and then distribute it out. As for whether the outputs from each mapper will be dense already, I think that if the values for the keys are evenly distributed over N mappers, the vectors you produce should be only c/N full, where c is some small constant that you choose based on how worried you are about estimation error. For cases where you have a very large number of mappers, the savings should be significant.

Also, if you have a large number of keys, you should be able to keep more of them in memory on the map side if you're keeping smaller sparse vectors for each, so not only are the vectors you produce smaller, but you produce fewer of them (because you don't have to spill your cache as often).

Contributor

koertkuipers commented Mar 19, 2013

ok just read the paper. if you wanted to generate the minhash of a word
from the minhashes of the documents it is part of (as described in the
research paper) you indeed have to go through shuffle-and-sort and this
optimization is relevant and useful.

however i don't have this need right now. i am more interested in the
minhash of a document given the minhashes of its words, something that can
be done entirely map-side. so not sure i can justify focussing on this
right now.

On Tue, Mar 19, 2013 at 4:51 PM, avibryant notifications@github.com wrote:

Re mahout.math - probably, and it's also probably easier to do that than
to find a way to reuse them... :/ Though would be nice if that turned out
not to be true.

And yes, you first have to run a full MR phase to produce the
CountMinSketch and then distribute it out. As for whether the outputs from
each mapper will be dense already, I think that if the values for the keys
are evenly distributed over N mappers, the vectors you produce should be
only c/N full, where c is some small constant that you choose based on how
worried you are about estimation error. For cases where you have a very
large number of mappers, the savings should be significant.

Also, if you have a large number of keys, you should be able to keep more
of them in memory on the map side if you're keeping smaller sparse vectors
for each, so not only are the vectors you produce smaller, but you produce
fewer of them (because you don't have to spill your cache as often).


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-15142278
.

Collaborator

avibryant commented Mar 19, 2013

Right, so in your case for any given key (document), the number of mappers it's distributed over is 1, so there's no benefit here.

Collaborator

johnynek commented Apr 8, 2013

Koert, sorry for the latency.

This looks good to me. I'll merge after release the last of the 0.1.x series (probably in a couple of weeks I'll close this out).

This breaks binary compatibility, so I don't want to put it in 0.1.x.

Thanks for the work.

Contributor

koertkuipers commented Apr 9, 2013

great. i am in no rush

On Mon, Apr 8, 2013 at 8:34 AM, P. Oscar Boykin notifications@github.comwrote:

Koert, sorry for the latency.

This looks good to me. I'll merge after release the last of the 0.1.x
series (probably in a couple of weeks I'll close this out).

This breaks binary compatibility, so I don't want to put it in 0.1.x.

Thanks for the work.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144#issuecomment-16047666
.

Contributor

koertkuipers commented Apr 9, 2013

would it be ok if i make MinHasher extend Serializable? i would like to be able to use it in scalding functions

Collaborator

johnynek commented Apr 9, 2013

Monoid is serializable, but the object should have extends Serializable.

Please do add that!

@johnynek johnynek commented on the diff Jun 19, 2013

...e/src/main/scala/com/twitter/algebird/MinHasher.scala
@@ -3,6 +3,32 @@ package com.twitter.algebird
import java.nio._
/**
+ * MinHasher as a Monoid operates on this class to avoid the too generic Array[Byte].
+ * The bytes are assumed to be never modified. The only reason we did not use IndexedSeq[Byte] instead of Array[Byte] is
+ * because a ByteBuffer is used internally in MinHasher and it can wrap Array[Byte].
+ */
+case class MinHashSignature(bytes: Array[Byte])
+
+object MinHasher {
@johnynek

johnynek Jun 19, 2013

Collaborator

If you add extends Serializable here I will merge (if the tests are green).

@koertkuipers

koertkuipers Jun 19, 2013

Contributor

isn't a case class always Serializable? or are you referring to the object?

@johnynek

johnynek Jun 19, 2013

Collaborator

the object. I think you are right about case class.

@johnynek johnynek commented on the diff Jun 19, 2013

...e/src/main/scala/com/twitter/algebird/MinHasher.scala
@@ -99,36 +125,34 @@ abstract class MinHasher[H](targetThreshold : Double, maxBytes : Int)(implicit n
/** useful for understanding the effects of numBands and numRows */
def probabilityOfInclusion(sim : Double) = 1.0 - math.pow(1.0 - math.pow(sim, numRows), numBands)
- /** numerically solve the inverse of estimatedThreshold, given numBands*numRows */
- def pickBands(threshold : Double, hashes : Int) = {
- val target = hashes * -1 * math.log(threshold)
- var bands = 1
- while(bands * math.log(bands) < target)
- bands += 1
- bands
- }
-
/** Maximum value the hash can take on (not 2*hashSize because of signed types) */
def maxHash : H
@johnynek

johnynek Jun 19, 2013

Collaborator

UpperBounded[H] would be good here

@koertkuipers

koertkuipers Jun 19, 2013

Contributor

can you explain? i dont follow. thx

@johnynek

johnynek Jun 20, 2013

Collaborator

I was just saying if we implemented #166 we could use that typeclass here.

@koertkuipers

koertkuipers Jun 20, 2013

Contributor

ah ok got it

On Thu, Jun 20, 2013 at 11:23 AM, P. Oscar Boykin
notifications@github.comwrote:

In algebird-core/src/main/scala/com/twitter/algebird/MinHasher.scala:

@@ -99,36 +125,34 @@ abstract class MinHasher[H](targetThreshold : Double, maxBytes : Int)(implicit n
/** useful for understanding the effects of numBands and numRows */
def probabilityOfInclusion(sim : Double) = 1.0 - math.pow(1.0 - math.pow(sim, numRows), numBands)

  • /* numerically solve the inverse of estimatedThreshold, given numBandsnumRows */
  • def pickBands(threshold : Double, hashes : Int) = {
  • val target = hashes * -1 * math.log(threshold)
  • var bands = 1
  • while(bands * math.log(bands) < target)
  •  bands += 1
    
  • bands

- }

/* Maximum value the hash can take on (not 2hashSize because of signed types) */
def maxHash : H

I was just saying if we implemented #166https://github.com/twitter/algebird/issues/166we could use that typeclass here.


Reply to this email directly or view it on GitHubhttps://github.com/twitter/algebird/pull/144/files#r4799202
.

@johnynek johnynek commented on an outdated diff Jun 19, 2013

...e/src/main/scala/com/twitter/algebird/MinHasher.scala
/** Esimate jaccard similarity (size of union / size of intersection) */
- def similarity(left : Array[Byte], right : Array[Byte]) = {
- val matching = buildArray(left,right){(l,r) => if(l == r) n.one else n.zero}
- matching.map{_.toDouble}.sum / numHashes
- }
+ def similarity(left : Array[Byte], right : Array[Byte]) : Double =
@johnynek

johnynek Jun 19, 2013

Collaborator

shouldn't this take MinHashSignatures?

@johnynek johnynek commented on an outdated diff Jun 19, 2013

...e/src/main/scala/com/twitter/algebird/MinHasher.scala
/** Set union */
- def plus(left : Array[Byte], right : Array[Byte]) = {
- buildArray(left, right){(l,r) => n.min(l, r)}
+ def plus(left : Array[Byte], right : Array[Byte]) : Array[Byte] =
@johnynek

johnynek Jun 19, 2013

Collaborator

MinHashSignatures here?

@johnynek johnynek commented on an outdated diff Jun 19, 2013

...e/src/main/scala/com/twitter/algebird/MinHasher.scala
/** Bucket keys to use for quickly finding other similar items via locality sensitive hashing */
- def buckets(sig : Array[Byte]) = {
- sig.grouped(numRows*hashSize).toList.map{band =>
- val (long1, long2) = hashFunctions.head(band)
- long1
- }
- }
-
+ def buckets(sig : Array[Byte]) : List[Long] =
@johnynek

johnynek Jun 19, 2013

Collaborator

MinHashSignature?

@johnynek johnynek added a commit that referenced this pull request Jun 20, 2013

@johnynek johnynek Merge pull request #144 from tresata/feature/hasher-kk
added constructors for MinHasher and subclasses that allow the number of...
fe19b2f

@johnynek johnynek merged commit fe19b2f into twitter:develop Jun 20, 2013

1 check passed

default The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment