Add Hash, fix some minor issues #38

johnynek · 2015-01-16T18:55:49Z

Adds Hash, gives it an implicit to get Hashing.
Adds an implicit Equiv from Eq
adds some @sp
fixes some comments

Todo: add some hash laws.

Question: should we add LongHash (and LongLongHash)? It seems like those are the main things we use in algebird (LongLong in HyperLogLog).

avibryant · 2015-01-16T19:00:47Z

I think LongHash and StringHash are the two critical instances.

non · 2015-01-16T19:17:20Z

From my point of view, the one necessary law to encode is that eqv(a, b) implies hash(a) == hash(b).

avibryant · 2015-01-16T19:47:49Z

This may be out of scope but do we want to encode the idea of an independent hash family, here? At least in algebird, just a single hash function is not that valuable.

avibryant · 2015-01-16T19:49:10Z

The laws there would be that for a family of hashes given by hash(a,i), eqv(a,b) implies hash(a,i) == hash(b,i) for all i, and that hash(a,i) is uncorrelated with hash(a,j) for all i,j

non · 2015-01-16T20:04:31Z

I like that idea (an in fact the idea of corrolation is useful for testing a single hash function too) but defining it correctly is probably outside the scope of a single property, i.e. you'd probably need to build a histogram of a lot of calls and inspect it (and occasionally the test might fail).

Or is there a better strategy?

johnynek · 2015-01-16T20:13:58Z

how about something like:

trait HashFamily[A] extends Hash[A] {
  override def hash(a: A) = hash(a, 1)
  def hash(a: A, idx: Int): Int
}

I guess want Prob(hash(a, x) == hash(a, y)) = 1.0/Int.MaxValue.

Testing that possible (approximately) using Chebyshev inequalities.

non · 2015-01-16T20:16:09Z

I like that design.

I agree that standard strategies for testing RNGs should be able to test hashing functions (and families of functions) as well.

non · 2015-01-16T20:18:37Z

(We have some tests I could port over from Spire. Relatedly, here's a chapter on testing RNGs by John Cook.)

avibryant · 2015-01-16T20:29:11Z

Feels like we might want to add size to HashFamily? But otherwise that feels right.

johnynek · 2015-01-16T20:31:57Z

@avibryant yes. I guess we really want a finite set type. I guess since we don't have dependent types, we'll just have the size and require that users pass something in the range.

non · 2015-01-16T20:34:04Z

I would have guessed that for many of these things any Int value would work. Is that not true? A lot of RNG functions I'm familiar with treat all the Int values as unsigned and are totally fine with any value.

avibryant · 2015-01-16T23:03:47Z

@non I guess in that case it can return Int.MaxValue as size? :) Just feels more comfortable to have a way to check rather than running into trouble by passing an arg that is out of bounds.

non · 2015-01-16T23:04:53Z

Sure, that seems fine.

This commit introduces a HashTest that can be used to ensure that a hashing function's output is uniformly distributed with 95% confidence. It also introduces an instance of Hash[Long], which uses Knuth's MMIX RNG algorithm, as a test case. One proviso is that as-written, Order[A] and Hash[A] conflict (they both provide `on` methods). Since Order and Hash both extend Eq, it's not possible to provide separate instances without conflicting implicits. For now, I have commented-out Hash#on to avoid these issues. We should discuss what the correct solution is before merging this PR.

non · 2015-01-17T17:57:59Z

So I presumptuously added to this PR in order to get a basic statistical test in place. What do you all think?

I'm not a statistician, so I based this test on code from here. I feel like the test is statistically-valid but I may be wrong.

Importantly, I noted in the commit message that since Hash[A] and Order[A] both extend Eq[A] (and both define an on method) we have created a problematic diamond inheritance situation. For now I commented out Hash#on but that's not necessarily a great solution. Here are some options that I see:

Have Hash[A] stop extending Eq[A]
Have Hash[A] extend Order[A]
Create a Hash[A] with Order[A] super trait that most things could use.
Make my commenting permanent and delete the code.

I think complex numbers mean that (2) is not great. Almost all uses of hash codes rely on Eq as well so I could imagine resistance to (1). I find (3) ugly but maybe it is OK? (4) seems unsatisfying.

What do you all think? This is a place where Algebird has done a lot more work than Spire so I'm happy to defer to @johnynek and @avibryant here.

johnynek · 2015-01-18T04:01:46Z

core/src/main/scala/algebra/Hash.scala

+  //  * using the given function `f`.
+  //  */
+  // def by[@sp A, @sp B](f: A => B)(implicit ev: Hash[B]): Hash[A] =
+  //   ev.on(f)


we could leave it here but comment it out above. Just a thought.

That's a good point. Maybe on doesn't belong on any of the type classes, and we can just use by in all cases.

johnynek · 2015-01-18T04:17:52Z

Actually, we never found a great solution for hashing in algebird and tried many different ideas. Count-min-sketch uses the hash family approach Avi mentioned (though without an explicit size). Other algorithm use a fixed hash and require a serialization which is used to generate a hash on the item in question.

HyperLogLog, to count N uniques, needs something like log N + 1/sqrt(error) bits. So to count a few billion, something like 50 bits are needed. A long would probably be enough, but since you are going to use a byte later in the algorithm, it costs nothing to use a hash up to 256 bits in length, which can clearly count astronomical sets. We actually use murmur128 and so, again, Hash or LongHash would both be insufficient.

Hash is really something like an equivalence, so I wondered if we wanted:

/**
 * Law is Eq.eqv(a, b) implies Eq.eqv(hash(a), hash(b))
 */
trait Hash[K, R] extends Eq[K] {
  def resEq: Eq[R]
  def hash(k: K): R
}

But then, that felt like a false generality, because really hashing is more than that: it is related to a random projection of the item onto the interval [0, 1), and without that you've lost the ability to do much with it (like build hash tables, or all the nice sketch algorithms).

So that left us with either something like: K => Real or practically: K => Int, K => Long, K => (Long, Long). Since scala specializes Tuple2 on Long, I figured the later would be fast enough. You could build a Hash from a LongHash and a LongHash from a LongLongHash.

That said, I could never get much momentum behind it, and I find designs without buy in are often bad designs, so in is still an open PR in algebird, bitrotting.

For this minimal case, I tend to think the easiest way forward would be to move the on method into the companion objects and live with it. That or, as you suggest, make an OrderHash or something. Interesting to note, we are adding just such a typeclass to scalding now with an additional constraint: that we can serialize (and that is an equivalence) and we can compare on the serialized values and get the same result. Seems like a big typeclass, but it has several nice laws and it is exactly what Hadoop needs (or spark could use it) for external (on disk) partitioning of a stream by keys (for which you have an instance of this typeclass).

That's my brain dump on this.

avibryant · 2015-01-18T18:16:09Z

A further brain dump:

At least for algebird, hashing is fundamental to at least half of the interesting typeclasses, so it's pretty sad that we don't have any good abstractions for it. I don't know whether it belongs in algebra or not but I'm unwilling to accept that we shouldn't have a good design for this somewhere (and I trust the people working on this project to collectively be able to produce one). We can worry about actually migrating algebird to it as a second step.
A HashFamily is important not just to Count-Min-Sketch and its variants but also Min Hash Signature and BloomFilter - and there are others (like K-Min Values) that we don't have in algebird now but would want the same thing.
The property you actually want out of a hash family is that the collisions are independent, ie, hash(a,i) == hash(b, i) is independent of hash(a,j) = hash(b,j). This is maybe a little stronger than hash(a,i) is independent of hash(a,j)?
The mapreduce-style "shuffle/sort" application is the only one I can think of where you care about the 3-way interaction of Hash/Eq/Order. It's an important one, but I do think it can be captured by a special-purpose typeclass like Oscar describes. I think that otherwise we might not care too much about the diamond inheritance, ie, we might have Hash+Eq typeclasses and Order+Eq typeclasses but we won't usually need or want one instance to implement all three?

non · 2015-01-18T18:52:12Z

It may be too far-afield but later tonight I'm going to post some demo code involving Spire's BitString[A] type class, a new Long128 type I just wrote, and a possible GenHash[A, O] and GenHashFamily[A, O] abstraction.

avibryant · 2015-01-18T19:53:35Z

I guess Hash seems plausibly more part of spire than algebra...

johnynek · 2015-01-18T22:37:27Z

It would be nice, in my opinion, if we could get this issue fixed here, not in spire and algebird.

That said, it is not very algebraic, and only touches the classes here in that it relates to Eq.

Maybe I should replace this PR with just the bug fixes and remove hash, and then we can add a separate Hash PR.

avibryant · 2015-01-18T22:46:34Z

💭 Maybe we should just model HashFamily[A] as Hash[(Int,A)]; that way we don't need a specific typeclass for it.

johnynek · 2015-01-19T03:15:40Z

@avibryant very interesting idea

non · 2015-01-19T03:25:59Z

I wasn't trying to suggest Hash[A] should go in Spire instead -- I was just mentioning that BitString[A] ends up being a useful abstraction for trying to model generic hashes (at least in the code I was playing with). The nice thing here is that you can abstract the hashing algorithm (or family) from the data type used, as long as you have a few constants available, with things like rotation, shifting, masking, and so on.

(Also, Spire has a notion of how to generate random values for a generic type using type classes using Dist[A] which would also be useful for the kind of hash families it looks like you use in count-min sketch.)

non · 2015-01-22T22:24:21Z

Hey, so I am still playing around with a possible design, but do you think this is something we need for an initial algebra milestone? Maybe we should wait a bit and add it later? Do you think I should merge the current design?

avibryant · 2015-01-22T23:34:20Z

I'd rather we get this right, and there's no rush to get it into the first milestone (algebird wouldn't use it right away anyway).

non · 2015-01-23T00:32:05Z

Yeah, that's my thinking as well.

johnynek · 2015-01-23T19:51:59Z

+1 to not rushing.

But we should get the specialization fixes and the impicit Equiv from Eq.

non · 2015-01-23T20:04:22Z

Oh, right! I'll make another PR with those now.

@sp

These changes were ported here from the 'hash' PR (#38). This commit adds an implicit from Eq[A] => Equiv[A]. It also adds missing @sp annotations.

@sp

These changes were ported here from the 'hash' PR (#38). This commit adds an implicit from Eq[A] => Equiv[A]. It also adds missing @sp annotations.

non · 2015-10-14T20:19:52Z

Picking this back up, I have a concrete proposal:

Create a simple `Hash[A]` type class

The main consumers of this would be folks that want to implement hash tables, hash sets, and other structures which benefit from a single hashing function. It also enables us to provide a hash function with a good default distribution (unlike .hashCode).

For example, something like this (sans derived methods like .contramap, .hashMask, and so on):

trait Hash[@sp A] {
  def hash(a: A): Int
}

We could probably provide a default Hash[A] that re-hashes the result of .hashCode.

We might also choose to support re-hashing, since many structures (e.g. those in Debox) benefit from being able to re-hash on collisions. That might end up being too complex, so we can also leave it out.

Hash[A] would have laws written in terms of a related Eq[A] instance (although I don't think it should extend Eq) and would also have its own probabilistic laws in terms of the distribution of its outputs.

Plan to support a `HashFamily[A]` type class

I'd like to use a separate type class to support hash families. I'm happy to support size/cardinality, or to assume that you can provide 0 to Int.MaxValue hash functions, whichever seems better. I also imagine that you could generate a Hash[A] instance from HashFamily[A] given an Int argument.

Again, we could provide a default HashFamily[A] implementation in terms of .hashCode with re-hashing.

Considerations

I'd like to avoid making Hash[A] return anything other than a primitive (to avoid boxing). However, if we decided we wanted more bits, we could use Long instead of Int. If we really needed more bits we could also support a "larger" hashing type class, such as Hash128 or Hash256, or something similar. My sense is that HashFamily[A] might cover this case but I'm open to something that returns Array[Int] or Array[Byte] if that seems better.

What do you all think? I feel like (1) might not be controversial, in which case we can move forward with it now.

rklaehn · 2015-12-12T10:24:05Z

One remark: I do not want this to automatically fall back on scala.util.hashing.Hashing. That causes a lot of problems in case the default hashing makes no sense (e.g. arrays, functions). If you have an automatic fallback, you get instances where you would rather not have them!

denisrosset · 2017-11-03T13:52:47Z

Now that cats has a Hash class, this is mostly redundant; apart from the nice statistical test for the distribution of hashes!

What should we do?

johnynek · 2017-11-03T20:46:36Z

let's ditch it. We can upstream the statistical test if we want.

johnynek added 2 commits January 16, 2015 08:53

Add Hash

5f9be9a

Actually add the equiv

07a0fc8

non added 2 commits January 17, 2015 12:48

Add credit for rosettacode.

7b7d177

johnynek reviewed Jan 18, 2015
View reviewed changes

non added a commit that referenced this pull request Jan 23, 2015

Make algebra.Eq[A] compatible with scala.math.Equiv[A].

91f8b9e

These changes were ported here from the 'hash' PR (#38). This commit adds an implicit from Eq[A] => Equiv[A]. It also adds missing @sp annotations.

non added a commit that referenced this pull request Jan 23, 2015

Make algebra.Eq[A] compatible with scala.math.Equiv[A].

f43004a

These changes were ported here from the 'hash' PR (#38). This commit adds an implicit from Eq[A] => Equiv[A]. It also adds missing @sp annotations.

non mentioned this pull request Jan 23, 2015

Make algebra.Eq[A] compatible with scala.math.Equiv[A]. #42

Merged

johnynek mentioned this pull request Aug 27, 2015

immutable HashSet/HashMap using Eq/Hash/Order #47

Open

rklaehn mentioned this pull request Oct 29, 2015

Add Show typeclass #106

Closed

denisrosset mentioned this pull request Nov 17, 2015

Add Hash-typeclass based containers denisrosset/metal#7

Open

adelbertc mentioned this pull request May 19, 2017

Add Hash typeclasses to cats typelevel/cats#1690

Closed

johnynek closed this Nov 3, 2017

larsrh deleted the hash branch May 4, 2019 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hash, fix some minor issues #38

Add Hash, fix some minor issues #38

johnynek commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

johnynek commented Jan 16, 2015

non commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

johnynek commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

non commented Jan 17, 2015

johnynek Jan 18, 2015

non Jan 18, 2015

johnynek commented Jan 18, 2015

avibryant commented Jan 18, 2015

non commented Jan 18, 2015

avibryant commented Jan 18, 2015

johnynek commented Jan 18, 2015

avibryant commented Jan 18, 2015

johnynek commented Jan 19, 2015

non commented Jan 19, 2015

non commented Jan 22, 2015

avibryant commented Jan 22, 2015

non commented Jan 23, 2015

johnynek commented Jan 23, 2015

non commented Jan 23, 2015

non commented Oct 14, 2015

rklaehn commented Dec 12, 2015

denisrosset commented Nov 3, 2017

johnynek commented Nov 3, 2017

Add Hash, fix some minor issues #38

Add Hash, fix some minor issues #38

Conversation

johnynek commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

johnynek commented Jan 16, 2015

non commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

johnynek commented Jan 16, 2015

non commented Jan 16, 2015

avibryant commented Jan 16, 2015

non commented Jan 16, 2015

non commented Jan 17, 2015

johnynek Jan 18, 2015

Choose a reason for hiding this comment

non Jan 18, 2015

Choose a reason for hiding this comment

johnynek commented Jan 18, 2015

avibryant commented Jan 18, 2015

non commented Jan 18, 2015

avibryant commented Jan 18, 2015

johnynek commented Jan 18, 2015

avibryant commented Jan 18, 2015

johnynek commented Jan 19, 2015

non commented Jan 19, 2015

non commented Jan 22, 2015

avibryant commented Jan 22, 2015

non commented Jan 23, 2015

johnynek commented Jan 23, 2015

non commented Jan 23, 2015

non commented Oct 14, 2015

Create a simple Hash[A] type class

Plan to support a HashFamily[A] type class

Considerations

rklaehn commented Dec 12, 2015

denisrosset commented Nov 3, 2017

johnynek commented Nov 3, 2017

Create a simple `Hash[A]` type class

Plan to support a `HashFamily[A]` type class