📚 Word shingling for near duplicate document detection
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples warn on dir with helpful suggestion Jun 2, 2017
src avoid some more unneccessary list looping Jun 30, 2015
tests tests May 21, 2015
.gitignore wshiml May 18, 2015
.merlin tests May 21, 2015
.ocamlinit wshiml May 18, 2015
CHANGES.md wshiml May 18, 2015
COPYING the linking exception May 20, 2015
DEVEL.md wshiml May 18, 2015
Makefile make gh-pages May 18, 2015
README.md quick deps for Ubuntu Jun 2, 2017
_oasis tests May 21, 2015
configure wshiml May 18, 2015
setup.ml wshiml May 18, 2015

README.md

Wshiml

Implementation of http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

Build requires oasis. Do:

./configure    # optionally with --prefix
make
make install

To build the example command-line program, do

./configure --enable-cli
make
make install
find-similar-docs --help

The command-line program requires cmdliner. The rest of the software has no dependencies apart from Oasis for building from git.

On Debian/Ubuntu, you can install all build dependencies with

sudo apt install oasis libcmdliner-ocaml-dev

So far the code is fairly unoptimised apart from what's described in http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html and uses 7s (4s with super-shingling) to cluster 1100 documents of altogether 766,937 words on an old 2.8 GHz AMD.

API documentation

is here.