LogLog space version of MinHash by combining ideas from HyperLogLog and b-bit MinHash
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
experiments
experiments_precomputed
.gitignore
LICENSE
README.md
hyperminhash.py
hyperminhash_tests.py

README.md

hyperminhash

LogLog scaling version of MinHash by combining ideas from HyperLogLog and b-bit MinHash

Please cite: Yun William Yu & Griffin Weber. HyperMinHash: MinHash in LogLog space. (2017)
Preprint: https://arxiv.org/abs/1710.08436

Code consists of a class hyperminhash.HyperMinHash that allows:

  • __init__: specify the size and parameters of the sketch
  • update: add hashable items to sketch
  • count: estimate the count-distinct cardinality of the sketch
  • filled_buckets: return the number of buckets that have an item (mostly used for internal algorithms)
  • __add__: given two sketches, A and B, A+B merges the sketches such that every item in A or in B is now in the sum.
  • jaccard: given two sketches, A and B, returns the Jaccard index
  • serialize: returns a ByteString that can be deserialized into the original object
  • deserialize: initializes a HyperMinHash sketch based on a serialized ByteString
  • intersection: returns the intersection cardinality, Jaccard index, number of bucket matches, and union cardinality when combining two sketches.

Full details are in the Python Docstrings. Of note, the Python implementation is not fully space-optimal, as that would require bit packing. Instead, we use size uint8, uint16, uint32, uint64 types from numpy in the implementation. Note also that we depend on Python3.

To test the class, we also provide an experiments/ directory.

  • tests_full.py will regenerate data allowing recreation of Figure 6 in the paper, though it may take weeks on standard workstations.
  • tests_quick.py will run a fast version that should complete within hours on most standard workstations, but at the cost of range and accuracy.
  • error_plot_full.py assumes that the current directory has the output of tests_full.py, and will generate a nice matplotlib graph.
  • error_plot_quick.py assumes that the current directory has the output of tests_quick.py, and will generate a nice matplotlib graph.

We also provide precomputed data of the type generated by tests_*.py. To use these, go to experiments_precomputed/ and run bash regen.sh.

Unit tests are in hyperminhash_tests.py, using the unittest framework.


Thanks to Seif Lotfy for providing a Golang implementation of HyperMinHash: https://github.com/axiomhq/hyperminhash