Hashing for lists-of-lists and other things is semi-broken. #74

Closed
neongreen opened this Issue Nov 10, 2013 · 9 comments

Projects

None yet

5 participants

@neongreen

Currently structures containing lists are treated as if they were all concatenated: hash ["abc"] == hash ["a","bc"] == hash ("ab", "c") == hash "abc". That is, hashable completely ignores the structure. Is it deliberate?

I think that in some cituations it might be a serious flaw — for instance, storing all partitions of a list (they would all have one hash).

@tibbe
Owner
tibbe commented Nov 11, 2013

Good question. I would have to ponder this for a while. Do you know what other platform do e.g. Java or STL?

@gseitz
gseitz commented Nov 11, 2013

Java basically does this in java.util.AbstractList (with an additional null check per element of course):

foldl' (\acc x -> 31*acc + hash x)  1 xs
@neongreen

I admit that it’s the first time ever I code in Java, but for what it’s worth this piece of code

List a = Arrays.asList(1,2,3);
List b = Arrays.asList(Arrays.asList(1,2),Arrays.asList(3));
List c = Arrays.asList(Arrays.asList(1),Arrays.asList(2,3));
List d = Arrays.asList(Arrays.asList(),Arrays.asList(1,2,3));
List e = Arrays.asList(Arrays.asList(1),Arrays.asList(),Arrays.asList(2,3));
System.out.print(Arrays.asList(a.hashCode(),b.hashCode(),c.hashCode(),d.hashCode(),e.hashCode()));

produces [30817, 31809, 2979, 31809, 61600]. That is, hash [[1,2],[3]] == hash [[],[1,2,3]] /= hash [[1],[2,3]].

@tibbe
Owner
tibbe commented Nov 11, 2013

Here's the guiding principled I've been using when thinking about collisions: we try to avoid collisions between two values of the same types (e.g. ["abc"] and ["a","bc"]). We don't not try to avoid collisions between two values of different types (e.g. ("ab", "c") and "abc"). The latter is not very useful for a library designed to support data types which store a value of a single type (e.g. HashMap Text (String, String)). It's also terribly difficult to avoid in practice.

We should see if we can do something better for lists and tuples though, perhaps based on what Java is doing.

@neongreen

I only included the tuple to further prove my point that structure of whatever kind is ignored; I’m not proposing here to address this problem too.

Now that I’ve said that... Avoiding collisions between two values of different types might be as easy as using hash of TypeRep as salt (if it introduces overhead, you could always make it a separate function).

@tibbe
Owner
tibbe commented Nov 11, 2013

@ArtyomKazak as you said, it slows things down. hashable is not really meant for general purpose hashing, just to support hashing based data types.

@jberryman

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

@phadej phadej referenced this issue Sep 17, 2015
Merged

List with length #103

@phadej
Contributor
phadej commented Jan 16, 2016

With 1.2.4.0 on my machine:

 Data.Hashable > (hash ["abc"], hash ["a","bc"], hash ("ab", "c"), hash "abc")
(770288776435754308,-3254706873997758045,779577897349174717,-2835585523591699257)

This can be closed.

@tibbe
Owner
tibbe commented Jan 16, 2016

Thanks for verifying!

@tibbe tibbe closed this Jan 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment