SimilarityDetection

In this project, we replicated a novel similarity detection algorithm to identify nearly duplicate sentences in Wikipedia articles based on Weissman's work [1]. This is accomplished with a MapReduce/Spark implementation of MinHash and Random Projection, which are locality sensitive hashing (LSH) techniques, to identify sentences with high Jaccard similarity and low Hamming distance respectively. Our experimental results appear to support the conclusion of Weissman's [1] clustering result and part the manual inspection.

Based on what they had explored, we referred their MapReduce implementation of minhash method to compute the signature and reimplement on Spark framework. Moreover, we explored Random Projection hashing method to compute the signature to detect similar sentences on the same Wikipedia source as in minhash method and implement MapReduce framework and Spark framework respectively.

[1] S. Weissman, S. Ayhan, J. Bradley, and J. Lin. Identifying duplicate and contradictory information in wikipedia. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, 2014.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.settings		.settings
article		article
output		output
resources		resources
src/main		src/main
.gitignore		.gitignore
README.md		README.md
Sentence2Vector.ipynb		Sentence2Vector.ipynb
cmd.txt		cmd.txt
output.txt		output.txt
output_rp.txt		output_rp.txt
output_vector.txt		output_vector.txt
pom.xml		pom.xml
project_report.pdf		project_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

article

article

output

output

resources

resources

src/main

src/main

.gitignore

.gitignore

README.md

README.md

Sentence2Vector.ipynb

Sentence2Vector.ipynb

cmd.txt

cmd.txt

output.txt

output.txt

output_rp.txt

output_rp.txt

output_vector.txt

output_vector.txt

pom.xml

pom.xml

project_report.pdf

project_report.pdf

Repository files navigation

SimilarityDetection

About

Releases

Packages

Contributors 2

Languages

zxmeng/SimilarityDetection

Folders and files

Latest commit

History

Repository files navigation

SimilarityDetection

About

Topics

Resources

Stars

Watchers

Forks

Languages