Skip to content

Normalizes text corpuses and uses simhash to find near duplicates.

License

Notifications You must be signed in to change notification settings

vedmathai/nearduplicate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduplication using simhash.

This "simplish" project demonstrates the concept of feature extraction and near duplicate (or similarity detection) utilizing simhashed scores.

One seminal source that brings together everything is Detecting Near-Duplicates for Web Crawling by Manku et al.

This code includes another project which implements the simhashing algorithm as described in the Manku paper.

Requirements

To use the code you'd need:

  • NLTK (Natural Langugage Toolkit)
  • Simhash from the above project.
  • It's written in Python. For best results run it on Linux.

To use

  • Place the text files for similarity detection in the corpus directory. All of the files will be read and each one will be mapped against the other to show the similarity score.
  • The file to run is hashtest.py.
  • The result table will appear in results.csv.
  • To intepret the results: Values closer to 0 indicate higher similarity levels, and further from zero in the positive direction indicates differences. A value of 0 indicates an exact match

Inferences (so far)

  • Removing stopwords and stemming give better results of differences.
  • When trying to hash on the n-gram or shingles, the results become stricter and even small changes are given high differences.

About

Normalizes text corpuses and uses simhash to find near duplicates.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages