A simple implementation of simhash algorithm by java.
- compute the simhash of a string
- compute the similarity between all the strings by building smart index, so we can deal with big data.
-
run Main with inputfile and outputfile.
-
The format of inputfile(see src/test_in): one doc eachline with the utf8 charset.
-
The format of outputfile(see src/test_out):
-
start //start flag
-
first line // doc
-
sencode lien // doc1\tdist where dist is the hamming distance between doc and doc1
-
end //end flag
- Build the project to a runnable jar.
- Improve the performace under big data.
- Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!