-
Notifications
You must be signed in to change notification settings - Fork 69
Add Hash, fix some minor issues #38
Conversation
I think |
From my point of view, the one necessary law to encode is that |
This may be out of scope but do we want to encode the idea of an independent hash family, here? At least in algebird, just a single hash function is not that valuable. |
The laws there would be that for a family of hashes given by |
I like that idea (an in fact the idea of corrolation is useful for testing a single hash function too) but defining it correctly is probably outside the scope of a single property, i.e. you'd probably need to build a histogram of a lot of calls and inspect it (and occasionally the test might fail). Or is there a better strategy? |
how about something like: trait HashFamily[A] extends Hash[A] {
override def hash(a: A) = hash(a, 1)
def hash(a: A, idx: Int): Int
} I guess want Testing that possible (approximately) using Chebyshev inequalities. |
I like that design. I agree that standard strategies for testing RNGs should be able to test hashing functions (and families of functions) as well. |
(We have some tests I could port over from Spire. Relatedly, here's a chapter on testing RNGs by John Cook.) |
Feels like we might want to add |
@avibryant yes. I guess we really want a finite set type. I guess since we don't have dependent types, we'll just have the size and |
I would have guessed that for many of these things any |
@non I guess in that case it can return |
Sure, that seems fine. |
This commit introduces a HashTest that can be used to ensure that a hashing function's output is uniformly distributed with 95% confidence. It also introduces an instance of Hash[Long], which uses Knuth's MMIX RNG algorithm, as a test case. One proviso is that as-written, Order[A] and Hash[A] conflict (they both provide `on` methods). Since Order and Hash both extend Eq, it's not possible to provide separate instances without conflicting implicits. For now, I have commented-out Hash#on to avoid these issues. We should discuss what the correct solution is before merging this PR.
So I presumptuously added to this PR in order to get a basic statistical test in place. What do you all think? I'm not a statistician, so I based this test on code from here. I feel like the test is statistically-valid but I may be wrong. Importantly, I noted in the commit message that since
I think complex numbers mean that (2) is not great. Almost all uses of hash codes rely on What do you all think? This is a place where Algebird has done a lot more work than Spire so I'm happy to defer to @johnynek and @avibryant here. |
// * using the given function `f`. | ||
// */ | ||
// def by[@sp A, @sp B](f: A => B)(implicit ev: Hash[B]): Hash[A] = | ||
// ev.on(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could leave it here but comment it out above. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Maybe on
doesn't belong on any of the type classes, and we can just use by
in all cases.
Actually, we never found a great solution for hashing in algebird and tried many different ideas. Count-min-sketch uses the hash family approach Avi mentioned (though without an explicit size). Other algorithm use a fixed hash and require a serialization which is used to generate a hash on the item in question. HyperLogLog, to count N uniques, needs something like Hash is really something like an equivalence, so I wondered if we wanted: /**
* Law is Eq.eqv(a, b) implies Eq.eqv(hash(a), hash(b))
*/
trait Hash[K, R] extends Eq[K] {
def resEq: Eq[R]
def hash(k: K): R
} But then, that felt like a false generality, because really hashing is more than that: it is related to a random projection of the item onto the interval So that left us with either something like: That said, I could never get much momentum behind it, and I find designs without buy in are often bad designs, so in is still an open PR in algebird, bitrotting. For this minimal case, I tend to think the easiest way forward would be to move the That's my brain dump on this. |
A further brain dump:
|
It may be too far-afield but later tonight I'm going to post some demo code involving Spire's |
I guess |
It would be nice, in my opinion, if we could get this issue fixed here, not in spire and algebird. That said, it is not very algebraic, and only touches the classes here in that it relates to Eq. Maybe I should replace this PR with just the bug fixes and remove hash, and then we can add a separate Hash PR. |
💭 Maybe we should just model |
@avibryant very interesting idea |
I wasn't trying to suggest (Also, Spire has a notion of how to generate random values for a generic type using type classes using |
Hey, so I am still playing around with a possible design, but do you think this is something we need for an initial algebra milestone? Maybe we should wait a bit and add it later? Do you think I should merge the current design? |
I'd rather we get this right, and there's no rush to get it into the first milestone (algebird wouldn't use it right away anyway). |
Yeah, that's my thinking as well. |
+1 to not rushing. But we should get the specialization fixes and the impicit Equiv from Eq. |
Oh, right! I'll make another PR with those now. |
Picking this back up, I have a concrete proposal: Create a simple
|
One remark: I do not want this to automatically fall back on scala.util.hashing.Hashing. That causes a lot of problems in case the default hashing makes no sense (e.g. arrays, functions). If you have an automatic fallback, you get instances where you would rather not have them! |
Now that cats has a What should we do? |
let's ditch it. We can upstream the statistical test if we want. |
@sp
Todo: add some hash laws.
Question: should we add LongHash (and LongLongHash)? It seems like those are the main things we use in algebird (LongLong in HyperLogLog).