Permalink
Find file
d361799 Aug 12, 2010
21 lines (14 sloc) 545 Bytes
biased-text-sample
==================
by Joseph Turian
Make a biased sample of a large text corpus, based upon text in a smaller text corpus.
Essentially, Lucene index the large text corpus, and for each document
in the smaller corpus retrieve the top ten Lucene results.
Pipe a large stream of text into the indexer:
/u/turian/data/web_corpus/WaCky2/sentencesplit.py | ./index-sentences.py
REQUIREMENTS:
* numpy
Used for Bloom filter.
* murmurhash
Used for Bloom filter.
* http://github.com/turian/common