Find file
d361799 Aug 12, 2010
21 lines (14 sloc) 545 Bytes
by Joseph Turian
Make a biased sample of a large text corpus, based upon text in a smaller text corpus.
Essentially, Lucene index the large text corpus, and for each document
in the smaller corpus retrieve the top ten Lucene results.
Pipe a large stream of text into the indexer:
/u/turian/data/web_corpus/WaCky2/ | ./
* numpy
Used for Bloom filter.
* murmurhash
Used for Bloom filter.