Permalink
Browse files

Added usage to README, and a bench mark section and script.

  • Loading branch information...
1 parent 13a5f57 commit 05c8fb3ca1568d6439611defea6d022a7fc16bf0 Dan Lecocq committed Jun 25, 2012
Showing with 140 additions and 0 deletions.
  1. +61 −0 README.md
  2. +79 −0 bench.py
View
@@ -9,6 +9,67 @@ facilities to distribute the lookup tables. This implementation follows that
described in the Google paper on the subject of near-duplicate detection with
simhash.
+Building
+========
+This library links against [`libJudy`](http://judy.sourceforge.net/), which must
+be installed before building this. It also depends on Cython. With those pieces
+in place, it's business as usual
+
+ python setup.py install
+
+Usage
+=====
+A `Corpus` is a collection of all the tables necessary to perform the query
+efficiently. There are two parameters, `num_blocks` and `diff_bits` which
+describe the number of blocks into which the 64-bit hashes should be divided
+(see more about this below) and the number of bits by which two hashes may
+differ before being considered near-duplicates. The number of tables needed is
+a function of these two parameters.
+
+ import simhash
+
+ # 6 blocks, 3 bits may differ
+ corpus = simhash.Corpus(6, 3)
+
+With a corpus, you can then insert, remove and query the data structure. You may
+be interested in just _any_ near-duplicate fingerprint in which case you can use
+`find_first` or `find_first_bulk`. If you're interested in finding _all_ matches
+then you should use `find_all` or `find_all_bulk`:
+
+ # Generate 1M random hashes and random queries
+ import random
+ hashes = [random.randint(0, 1 << 64) for i in range(1000000)]
+ queries = [random.randint(0, 1 << 64) for i in range(1000000)]
+
+ # Insert the hashes
+ corpus.insert_bulk(hashes)
+
+ # Find matches; returns a list of results, each element contains the match
+ # for the corresponding element in the query
+ matches = corpus.find_first_bulk(queries)
+
+ # Find more matches; returns a list of lists, each of which corresponds to
+ # the query of the same index
+ matches = corpus.find_all_bulk(queries)
+
+Benchmark
+=========
+This is a rough benchmark, but should help to give you an idea of the order of
+magnitude for the performance available. Running on a single core on a 2011-ish
+MacBook Pro:
+
+ # ./bench.py --random 1000000 --blocks 5 --bits 3
+ Generating 1000000 hashes
+ Generating 1000000 queries
+ Starting Bulk Insertion
+ Ran Bulk Insertion in 2.534197s
+ Starting Bulk Find First
+ Ran Bulk Find First in 4.795310s
+ Starting Bulk Find All
+ Ran Bulk Find All in 7.415205s
+ Starting Bulk Removal
+ Ran Bulk Removal in 3.346022s
+
Architecture
============
Each document gets associated with a 64-bit hash calculated using a rolling
View
@@ -0,0 +1,79 @@
+#! /usr/bin/env python
+
+import time
+import random
+import simhash
+import argparse
+
+# Generate some random hashes with known
+parser = argparse.ArgumentParser(description='Run a quick bench')
+parser.add_argument('--random', dest='random', type=int, default=None,
+ help='Generate N random hashes for querying')
+parser.add_argument('--blocks', dest='blocks', type=int, default=6,
+ help='Number of blocks to divide 64-bit hashes into')
+parser.add_argument('--bits', dest='bits', type=int, default=3,
+ help='How many bits may differ')
+parser.add_argument('--hashes', dest='hashes', type=str, default=None,
+ help='Path to file with hashes to insert')
+parser.add_argument('--queries', dest='queries', type=str, default=None,
+ help='Path to file with queries to run')
+
+args = parser.parse_args()
+
+corpus = simhash.Corpus(args.blocks, args.bits)
+
+# Hashes to run, query
+hashes = []
+queries = []
+
+if args.hashes:
+ with open(args.hashes) as f:
+ hashes = [int(l) for l in f.split('\n')]
+
+if args.queries:
+ with open(args.queries) as f:
+ queries = [int(l) for l in f.split('\n')]
+
+if args.random:
+ if args.hashes and args.queries:
+ print 'Random supplied with both --hashes and --queries'
+ exit(1)
+
+ if not hashes:
+ print 'Generating %i hashes' % args.random
+ hashes = [random.randint(0, 1 << 64) for i in range(args.random)]
+
+ if not queries:
+ print 'Generating %i queries' % args.random
+ queries = [random.randint(0, 1 << 64) for i in range(args.random)]
+elif not args.hashes or args.queries:
+ print 'No hashes or queries supplied'
+ exit(2)
+
+class Timer(object):
+ def __init__(self, name):
+ self.name = name
+
+ def __enter__(self):
+ self.start = -time.time()
+ print 'Starting %s' % self.name
+ return self
+
+ def __exit__(self, t, v, tb):
+ self.start += time.time()
+ if t:
+ print ' Failed %s in %fs' % (self.name, self.start)
+ else:
+ print ' Ran %s in %fs' % (self.name, self.start)
+
+with Timer('Bulk Insertion'):
+ corpus.insert_bulk(hashes)
+
+with Timer('Bulk Find First'):
+ corpus.find_first_bulk(queries)
+
+with Timer('Bulk Find All'):
+ corpus.find_all_bulk(queries)
+
+with Timer('Bulk Removal'):
+ corpus.remove_bulk(hashes)

0 comments on commit 05c8fb3

Please sign in to comment.