An efficient simhash implementation for python
Clone or download
Latest commit 1f11fae Apr 23, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
simhash Add python 3 support Apr 13, 2018
LICENSE Add BSD 3-Clause license file Oct 29, 2015
README.rst initial commit Aug 5, 2014
setup.py initial commit Aug 5, 2014

README.rst

simhash

This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on Charikar's simhash. It is a python module, written in C with GCC extentions, and includes the following functions:

fingerprint
Generates a fingerprint from a sequence of hashes
weighted_fingerprint
Generate a fingerprint from a sequence of (long long, weight) tuples
fnvhash
Generates a (FNV-1a) hash from a string
hamming_distance
Calculate the number of bits that differ between 2 long long integers
simpair_indices
Find the indices of hashes in a sequence that differ by less than a certain number of bits. It includes arguments for rotating and grouping hashes. It can be used to help efficiently implement online or batch near duplicate detection, for example as described in Detecting Near-Duplicates for Web Crawling by Gurmeet Manku, Arvind Jain, and Anish Sarma.

Example usage

Generate hashes:

>>> from simhash import fingerprint
>>> hash1 = fingerprint(map(hash, "some text we want to hash"))
>>> hash2 = fingerprint(map(hash, "some more text we want to hash"))

Measure distance between hashes:

>>> from simhash import hamming_distance
>>> hamming_distance(hash1, hash2)
2L

This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.