GitHub - therealshah/Thesis: This is the work I'm doing for my Master's Thesis

This is git repo to my Thesis. Here all of the code and datasets are presented and open. I have also attached my Thesis in this repo, incase anyone wanted to read it :).

Abstract

There are many cases in which a file is created with only minimal modifications from the previous version. This may occur in versioned document sets such as Wikipedia, where a newer version is created by inserting or deleting a paragraph, fixing spelling issues, or even simply correcting a grammar error. Instead of storing the newer version of the file in the entirety, it would be less expensive to store only the pieces of the file that are missing. In this paper, we examine and explain the current algorithms that are used to detect document similarity between files. The algorithms we examine in this paper are Karp-Rabin, Winnowing, TDDD, and 2Min. We run the algorithms on various datasets, such as the Internet Archive, gcc and emacs source files, and randomly generated files, to determine which algorithm finds the most document similarity. We run timing experiments to determine the speed of each of the algorithms. We also present two new algorithms, by making modifications of the 2Min algorithm, which outperform the original in finding more document similarity.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
code		code
datasets		datasets
Masters_Thesis.pdf		Masters_Thesis.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

datasets

datasets

Masters_Thesis.pdf

Masters_Thesis.pdf

readme.md

readme.md

Repository files navigation

About

Releases

Packages

Languages

therealshah/Thesis

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages