This is git repo to my Thesis. Here all of the code and datasets are presented and open. I have also attached my Thesis in this repo, incase anyone wanted to read it :).
Abstract
There are many cases in which a file is created with only minimal modifications from the previous version. This may occur in versioned document sets such as Wikipedia, where a newer version is created by inserting or deleting a paragraph, fixing spelling issues, or even simply correcting a grammar error. Instead of storing the newer version of the file in the entirety, it would be less expensive to store only the pieces of the file that are missing. In this paper, we examine and explain the current algorithms that are used to detect document similarity between files. The algorithms we examine in this paper are Karp-Rabin, Winnowing, TDDD, and 2Min. We run the algorithms on various datasets, such as the Internet Archive, gcc and emacs source files, and randomly generated files, to determine which algorithm finds the most document similarity. We run timing experiments to determine the speed of each of the algorithms. We also present two new algorithms, by making modifications of the 2Min algorithm, which outperform the original in finding more document similarity.