Skip to content
This repository has been archived by the owner on Feb 8, 2022. It is now read-only.

Add Hash, fix some minor issues #38

Closed
wants to merge 4 commits into from
Closed

Add Hash, fix some minor issues #38

wants to merge 4 commits into from

Conversation

johnynek
Copy link
Contributor

  • Adds Hash, gives it an implicit to get Hashing.
  • Adds an implicit Equiv from Eq
  • adds some @sp
  • fixes some comments

Todo: add some hash laws.

Question: should we add LongHash (and LongLongHash)? It seems like those are the main things we use in algebird (LongLong in HyperLogLog).

@avibryant
Copy link
Collaborator

I think LongHash and StringHash are the two critical instances.

@non
Copy link
Contributor

non commented Jan 16, 2015

From my point of view, the one necessary law to encode is that eqv(a, b) implies hash(a) == hash(b).

@avibryant
Copy link
Collaborator

This may be out of scope but do we want to encode the idea of an independent hash family, here? At least in algebird, just a single hash function is not that valuable.

@avibryant
Copy link
Collaborator

The laws there would be that for a family of hashes given by hash(a,i), eqv(a,b) implies hash(a,i) == hash(b,i) for all i, and that hash(a,i) is uncorrelated with hash(a,j) for all i,j

@non
Copy link
Contributor

non commented Jan 16, 2015

I like that idea (an in fact the idea of corrolation is useful for testing a single hash function too) but defining it correctly is probably outside the scope of a single property, i.e. you'd probably need to build a histogram of a lot of calls and inspect it (and occasionally the test might fail).

Or is there a better strategy?

@johnynek
Copy link
Contributor Author

how about something like:

trait HashFamily[A] extends Hash[A] {
  override def hash(a: A) = hash(a, 1)
  def hash(a: A, idx: Int): Int
}

I guess want Prob(hash(a, x) == hash(a, y)) = 1.0/Int.MaxValue.

Testing that possible (approximately) using Chebyshev inequalities.

@non
Copy link
Contributor

non commented Jan 16, 2015

I like that design.

I agree that standard strategies for testing RNGs should be able to test hashing functions (and families of functions) as well.

@non
Copy link
Contributor

non commented Jan 16, 2015

(We have some tests I could port over from Spire. Relatedly, here's a chapter on testing RNGs by John Cook.)

@avibryant
Copy link
Collaborator

Feels like we might want to add size to HashFamily? But otherwise that feels right.

@johnynek
Copy link
Contributor Author

@avibryant yes. I guess we really want a finite set type. I guess since we don't have dependent types, we'll just have the size and require that users pass something in the range.

@non
Copy link
Contributor

non commented Jan 16, 2015

I would have guessed that for many of these things any Int value would work. Is that not true? A lot of RNG functions I'm familiar with treat all the Int values as unsigned and are totally fine with any value.

@avibryant
Copy link
Collaborator

@non I guess in that case it can return Int.MaxValue as size? :) Just feels more comfortable to have a way to check rather than running into trouble by passing an arg that is out of bounds.

@non
Copy link
Contributor

non commented Jan 16, 2015

Sure, that seems fine.

This commit introduces a HashTest that can be used to ensure
that a hashing function's output is uniformly distributed
with 95% confidence.

It also introduces an instance of Hash[Long], which uses
Knuth's MMIX RNG algorithm, as a test case.

One proviso is that as-written, Order[A] and Hash[A] conflict
(they both provide `on` methods). Since Order and Hash both
extend Eq, it's not possible to provide separate instances
without conflicting implicits.

For now, I have commented-out Hash#on to avoid these issues.
We should discuss what the correct solution is before merging
this PR.
@non
Copy link
Contributor

non commented Jan 17, 2015

So I presumptuously added to this PR in order to get a basic statistical test in place. What do you all think?

I'm not a statistician, so I based this test on code from here. I feel like the test is statistically-valid but I may be wrong.

Importantly, I noted in the commit message that since Hash[A] and Order[A] both extend Eq[A] (and both define an on method) we have created a problematic diamond inheritance situation. For now I commented out Hash#on but that's not necessarily a great solution. Here are some options that I see:

  1. Have Hash[A] stop extending Eq[A]
  2. Have Hash[A] extend Order[A]
  3. Create a Hash[A] with Order[A] super trait that most things could use.
  4. Make my commenting permanent and delete the code.

I think complex numbers mean that (2) is not great. Almost all uses of hash codes rely on Eq as well so I could imagine resistance to (1). I find (3) ugly but maybe it is OK? (4) seems unsatisfying.

What do you all think? This is a place where Algebird has done a lot more work than Spire so I'm happy to defer to @johnynek and @avibryant here.

// * using the given function `f`.
// */
// def by[@sp A, @sp B](f: A => B)(implicit ev: Hash[B]): Hash[A] =
// ev.on(f)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could leave it here but comment it out above. Just a thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Maybe on doesn't belong on any of the type classes, and we can just use by in all cases.

@johnynek
Copy link
Contributor Author

Actually, we never found a great solution for hashing in algebird and tried many different ideas. Count-min-sketch uses the hash family approach Avi mentioned (though without an explicit size). Other algorithm use a fixed hash and require a serialization which is used to generate a hash on the item in question.

HyperLogLog, to count N uniques, needs something like log N + 1/sqrt(error) bits. So to count a few billion, something like 50 bits are needed. A long would probably be enough, but since you are going to use a byte later in the algorithm, it costs nothing to use a hash up to 256 bits in length, which can clearly count astronomical sets. We actually use murmur128 and so, again, Hash or LongHash would both be insufficient.

Hash is really something like an equivalence, so I wondered if we wanted:

/**
 * Law is Eq.eqv(a, b) implies Eq.eqv(hash(a), hash(b))
 */
trait Hash[K, R] extends Eq[K] {
  def resEq: Eq[R]
  def hash(k: K): R
}

But then, that felt like a false generality, because really hashing is more than that: it is related to a random projection of the item onto the interval [0, 1), and without that you've lost the ability to do much with it (like build hash tables, or all the nice sketch algorithms).

So that left us with either something like: K => Real or practically: K => Int, K => Long, K => (Long, Long). Since scala specializes Tuple2 on Long, I figured the later would be fast enough. You could build a Hash from a LongHash and a LongHash from a LongLongHash.

That said, I could never get much momentum behind it, and I find designs without buy in are often bad designs, so in is still an open PR in algebird, bitrotting.

For this minimal case, I tend to think the easiest way forward would be to move the on method into the companion objects and live with it. That or, as you suggest, make an OrderHash or something. Interesting to note, we are adding just such a typeclass to scalding now with an additional constraint: that we can serialize (and that is an equivalence) and we can compare on the serialized values and get the same result. Seems like a big typeclass, but it has several nice laws and it is exactly what Hadoop needs (or spark could use it) for external (on disk) partitioning of a stream by keys (for which you have an instance of this typeclass).

That's my brain dump on this.

@avibryant
Copy link
Collaborator

A further brain dump:

  • At least for algebird, hashing is fundamental to at least half of the interesting typeclasses, so it's pretty sad that we don't have any good abstractions for it. I don't know whether it belongs in algebra or not but I'm unwilling to accept that we shouldn't have a good design for this somewhere (and I trust the people working on this project to collectively be able to produce one). We can worry about actually migrating algebird to it as a second step.
  • A HashFamily is important not just to Count-Min-Sketch and its variants but also Min Hash Signature and BloomFilter - and there are others (like K-Min Values) that we don't have in algebird now but would want the same thing.
  • The property you actually want out of a hash family is that the collisions are independent, ie, hash(a,i) == hash(b, i) is independent of hash(a,j) = hash(b,j). This is maybe a little stronger than hash(a,i) is independent of hash(a,j)?
  • The mapreduce-style "shuffle/sort" application is the only one I can think of where you care about the 3-way interaction of Hash/Eq/Order. It's an important one, but I do think it can be captured by a special-purpose typeclass like Oscar describes. I think that otherwise we might not care too much about the diamond inheritance, ie, we might have Hash+Eq typeclasses and Order+Eq typeclasses but we won't usually need or want one instance to implement all three?

@non
Copy link
Contributor

non commented Jan 18, 2015

It may be too far-afield but later tonight I'm going to post some demo code involving Spire's BitString[A] type class, a new Long128 type I just wrote, and a possible GenHash[A, O] and GenHashFamily[A, O] abstraction.

@avibryant
Copy link
Collaborator

I guess Hash seems plausibly more part of spire than algebra...

@johnynek
Copy link
Contributor Author

It would be nice, in my opinion, if we could get this issue fixed here, not in spire and algebird.

That said, it is not very algebraic, and only touches the classes here in that it relates to Eq.

Maybe I should replace this PR with just the bug fixes and remove hash, and then we can add a separate Hash PR.

@avibryant
Copy link
Collaborator

💭 Maybe we should just model HashFamily[A] as Hash[(Int,A)]; that way we don't need a specific typeclass for it.

@johnynek
Copy link
Contributor Author

@avibryant very interesting idea

@non
Copy link
Contributor

non commented Jan 19, 2015

I wasn't trying to suggest Hash[A] should go in Spire instead -- I was just mentioning that BitString[A] ends up being a useful abstraction for trying to model generic hashes (at least in the code I was playing with). The nice thing here is that you can abstract the hashing algorithm (or family) from the data type used, as long as you have a few constants available, with things like rotation, shifting, masking, and so on.

(Also, Spire has a notion of how to generate random values for a generic type using type classes using Dist[A] which would also be useful for the kind of hash families it looks like you use in count-min sketch.)

@non
Copy link
Contributor

non commented Jan 22, 2015

Hey, so I am still playing around with a possible design, but do you think this is something we need for an initial algebra milestone? Maybe we should wait a bit and add it later? Do you think I should merge the current design?

@avibryant
Copy link
Collaborator

I'd rather we get this right, and there's no rush to get it into the first milestone (algebird wouldn't use it right away anyway).

@non
Copy link
Contributor

non commented Jan 23, 2015

Yeah, that's my thinking as well.

@johnynek
Copy link
Contributor Author

+1 to not rushing.

But we should get the specialization fixes and the impicit Equiv from Eq.

@non
Copy link
Contributor

non commented Jan 23, 2015

Oh, right! I'll make another PR with those now.

non added a commit that referenced this pull request Jan 23, 2015
These changes were ported here from the 'hash' PR (#38).
This commit adds an implicit from Eq[A] => Equiv[A].
It also adds missing @sp annotations.
non added a commit that referenced this pull request Jan 23, 2015
These changes were ported here from the 'hash' PR (#38).
This commit adds an implicit from Eq[A] => Equiv[A].
It also adds missing @sp annotations.
@non
Copy link
Contributor

non commented Oct 14, 2015

Picking this back up, I have a concrete proposal:

Create a simple Hash[A] type class

The main consumers of this would be folks that want to implement hash tables, hash sets, and other structures which benefit from a single hashing function. It also enables us to provide a hash function with a good default distribution (unlike .hashCode).

For example, something like this (sans derived methods like .contramap, .hashMask, and so on):

trait Hash[@sp A] {
  def hash(a: A): Int
}

We could probably provide a default Hash[A] that re-hashes the result of .hashCode.

We might also choose to support re-hashing, since many structures (e.g. those in Debox) benefit from being able to re-hash on collisions. That might end up being too complex, so we can also leave it out.

Hash[A] would have laws written in terms of a related Eq[A] instance (although I don't think it should extend Eq) and would also have its own probabilistic laws in terms of the distribution of its outputs.

Plan to support a HashFamily[A] type class

I'd like to use a separate type class to support hash families. I'm happy to support size/cardinality, or to assume that you can provide 0 to Int.MaxValue hash functions, whichever seems better. I also imagine that you could generate a Hash[A] instance from HashFamily[A] given an Int argument.

Again, we could provide a default HashFamily[A] implementation in terms of .hashCode with re-hashing.

Considerations

I'd like to avoid making Hash[A] return anything other than a primitive (to avoid boxing). However, if we decided we wanted more bits, we could use Long instead of Int. If we really needed more bits we could also support a "larger" hashing type class, such as Hash128 or Hash256, or something similar. My sense is that HashFamily[A] might cover this case but I'm open to something that returns Array[Int] or Array[Byte] if that seems better.

What do you all think? I feel like (1) might not be controversial, in which case we can move forward with it now.

@rklaehn
Copy link
Collaborator

rklaehn commented Dec 12, 2015

One remark: I do not want this to automatically fall back on scala.util.hashing.Hashing. That causes a lot of problems in case the default hashing makes no sense (e.g. arrays, functions). If you have an automatic fallback, you get instances where you would rather not have them!

@denisrosset
Copy link
Contributor

Now that cats has a Hash class, this is mostly redundant; apart from the nice statistical test for the distribution of hashes!

What should we do?

@johnynek johnynek closed this Nov 3, 2017
@johnynek
Copy link
Contributor Author

johnynek commented Nov 3, 2017

let's ditch it. We can upstream the statistical test if we want.

@larsrh larsrh deleted the hash branch May 4, 2019 14:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants