Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Counting the collisions with perl hash tables per function
Perl
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore freshly initialized to get rid of the gz and bz2 blobs
0001-DH-debug-hash-fill-size-and-collisions.patch
LICENSE freshly initialized to get rid of the gz and bz2 blobs
README.md
crc.patch
hash-result.pl
hash.result.CRC32
hash.result.CRC32.1
hash.result.DJB2
hash.result.MURMUR3
hash.result.ONE_AT_A_TIME
hash.result.ONE_AT_A_TIME_HARD
hash.result.ONE_AT_A_TIME_OLD
hash.result.SDBM
hash.result.SDBM.1
hash.result.SIPHASH freshly initialized to get rid of the gz and bz2 blobs
hash.result.SUPERFAST
hash.stats
sdbm+djb2.patch freshly initialized to get rid of the gz and bz2 blobs

README.md

perl-hash-stats

Counting the collisions with perl hash tables per function. (linear chaining in a linked list, subject to collision attacks)

Average case (perl core testsuite)

Hash Function collisions cycles/hash
CRC32 1.066 29.78
DJB2.1 1.070 44.73
CRC32.1 1.078 29.78
SUPERFAST 1.081 34.72
SDBM.1 1.082 30.57
ONE_AT_A_TIME_HARD 1.092 83.75
SIPHASH 1.091 154.68
ONE_AT_A_TIME 1.098 43.62
ONE_AT_A_TIME_OLD 1.100 43.62
MURMUR3 1.105 34.03
DJB2 1.131 44.73
SDBM 1.146 30.57
CITY ? 30.13

Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).

The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2, the most common hash tables sizes are 7, 63 and 127.

A hash table size of 7 uses the last 3 bits of the hash function result, 63 uses only 6 bits of 32 and 127 uses 7 bits.

  • collisions are the number of linked list iterations per hash table usage.
  • cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
  • SDBM and DJBJ did not produce a workable miniperl. Needed to patch them.
  • The .1 variants add the len to the seed to fight \0 attacks, when users can easily input binary keys (i.e. in perl since 5.16).

Hash table sizes

size count
0 2403
1 383
3 434
7 30816359
15 19761019
31 20566188
63 30131283
127 28054277
255 15104276
511 7146648
1023 3701004
2047 1015462
4095 217107
8191 284997
16383 237284
32767 169823

Note that perl ony supports int32 (32bit) sized tables, not 64bit arrays. Larger keysets need to be tied to bigger key-value stores, such as LMDB_File or at least AnyDBM_File, otherwise you'll get a hell lot of collisions.

It should be studied of leaving out one or two sizes and therefore the costly rehashing is worthwile. Good candidates for this dataset to skip seem to be 15 and 63.

For double hashing perl5 need to use prime number sized hash tables to make the 2nd hash function work. For 32bit the primes can be stored in a constant table as in glibc.

Number of collisions with CRC32

CRC32 is a good and fast hash function, on SSE4 intel processors or armv7 and armv8 it costs just a few cycles.

collisions count
0 26176163
1 100979326
2 25745874
3 4526405
4 512177
5 46749
6 4015
7 187
8 8

Note that 0 collisions can occur with an early return in the hash table lookup function, such as with empty hash tables. The number of collisions is independent of the hash table size or key length. It depends on the fill factor, the quality of the hash function and the key.

This is the average case. Worst cases can be produced by guessing the random hash seed from leakage of sorting order (unsorted keys in JSON, YAML, RSS interfaces, or such), (or even fooling with the ENV or process memory), and then creating colliding keys, which would lead to exponential time DOS attacks with linear time attack costs. RT #22371 and "Denial of Service via Algorithmic Complexity Attacks", S Crosby, D Wallach, Rice 1993. Long running perl processes with publicly exposed sorting order and input acceptance of hash keys should really be avoided without proper countermeasures. PHP e.g. does MAX_POST_SIZE. How to get the private random seed is e.g. described in "REMOTE ALGORITHMIC COMPLEXITY ATTACKS AGAINST RANDOMIZED HASH TABLES", N Bar-Yosef, A Wool - 2009 - Springer.

Perl and similar dynamic languages really need to improve their collision algorithm, and choose a combination of fast and good enough hash function. None of this is currently implemented in standard SW besides Kyoto DB, though Knuth proposed to use sorted buckets "Ordered hash tables", O Amble, D Knuth 1973. Most technical papers accept degeneration into linear search for bucket collisions as is. Notably e.g. even the Linux kernel F. Weimer, “Algorithmic complexity attacks and the linux networking code”, May 2003, though glibc, gcc and libliberty and others switched to open addressing with double hashing recently, where collisions just trigger hash table resizes, and the choice of the 2nd function will reduce collisions dramatically. DJB's DNS server has an explicit check for "hash flooding" attempts. Some rare hash tables implementations use rb-trees.

For city there currently exists a simple universal C function to easily create collisions per seed. crc32 is exploitable even more easily. Note that this exists for every hash function, just encode your hash SAT solver-friendly and look at the generated model. It is even incredibly simple if you calculate only the needed last bits dependent on the hash table size (8-15 bits). So striking out city for such security claims does not hold. The most secure hash function can be attacked this way. Any practical attacker has enough time in advance to create enough colliding keys dependent only on the random seed, and can easily verify it by time measurements. The code is just not out yet, and the costs for some slower (cryptographically secure) hash functions might be too high. But people already encoded SHA-2 into SMTLIB code to attack bitcoin, and high-level frameworks such as frama-c, klee or z3 are becoming increasingly popular.

crc is recommended by xcore Tip & Tricks: Hash Tables and also analysed by Bob Jenkin.

See also

  • See blogs.perl.org: statistics-for-perl-hash-tables for a more detailled earlier description, and

  • Emmanuel Goossaert's blog compares some hash table implementations and esp. collision handling for efficiency, not security.

  • See smhasher for performance and quality tests of most known hash functions.

  • See Perfect::Hash for benchmarks and implementations of perfect hashes, i.e. fast lookup in readonly stringmaps.

  • hash.stats for the distribution of the collisions

  • hash.result.* for the table sizes

Something went wrong with that request. Please try again.