docsim -- Document similarity programs
Sentence and file based analysis
This code may be run either ignoring linebreaks in input text files and treating the whole input file as one sentence. or respecting linebreaks in the input files as indicating sentences (as for [SOROKINA+07]). The default is to treat input files as a whole, ignoring sentence boundaries.
To select respecting sentences use the -S option for docsim-analyze, docsim-compare, and overlapd. Note that you will probably get odd results if there is mismatch in respect for sentences or not between analysis and comparison.
Treating files as a whole means that lib/files.cpp needs a buffer that can read in a complete input file. The buffer size is controlled in lib/definitions.h:
#define FILE_BUFFER_SIZE 5000000
As of 2011-02-16 the maximum psv file size for arXiv is ~4.1MB so a buffer size larger than this is required to treat all files whole.
The C++ code in directory ccp and under was developed with gcc 4.1.2. It compiles with gcc 4.4 but throws some deprecated header warnings. (2011-03-03)
> cd cpp > make
This will build the command line docsim tools but will not build
The gSOAP library is needed in order to compile to
daemon that provides SOAP-based communication with a memory resident
As of 2014-07 the standard RHEL6 packages for
are version 2.7.16. These may be installed with:
sudo yum install gsoap gsoap-devel
This installs in the system paths with prefix
/usr (ie. shared
/usr/lib, headers in
/usr/include, binaries in
cpp/soap-server is set up for these
locations. From the root directory, make
Debian and derivatives
Warning - notes from 2007 and may be out of date
On Debian and derivative systems (e.g. Ubuntu) this should be available for etch and later distributions via apt. Install with, e.g.
sudo apt-get install gsoap
In March 2007 the current version on etch was 2.7.6 which was fine and
apt installed it under
is set up for this location.
For other systems, try sourceforge: http://sourceforge.net/projects/gsoap2/
Gzstream - this library was downloaded from http://www.cs.unc.edu/Research/compgeom/gzstream/ on 2011-08-25 and is included within the cpp/include directory. It is used to allow the code to read plain or gzip format data files (to allow disk space saving). The make files for Docsim will compile this code. This is LGPL licensed.
For a simple test run with data supplied. First create analyze a set of testdata to build a keys file:
cpp/docsim-analyze -d testdata/arxiv-publicdomain -f testdata/arxiv-publicdomain/files.txt -b 20
And then compare one of the documents against the corpus:
cpp/docsim-compare -f testdata/arxiv-publicdomain/0012/math0012129.txt.gz -b 20 -T /tmp/allkeys
These programs extend work reported in "Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012]. Most of these programs have been written by Simeon Warner between 2005 and 2009. They include portions based on code written by and copyright Daria Sorokina, 2005 as noted in the source files.
[SOROKINA+07] Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg. "Plagiarism Detection in arXiv" doi:10.1109/ICDM.2006.126 http://arxiv.org/abs/cs/0702012