-
-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find duplicate files without reading their full content #49
Comments
I tried to read and understand what it going on with lazy hashing, but I failed because for now seems that I only can read my own code. But if I correctly understand it opens Looks that this should be very fast solution but isn't suitable for current Czkawka version:
|
It doesn't have to keep the files open. It's fine to close and reopen them when needed. The key insight is that if you split a file into multiple hashes (array of hashes), and put these multi-hashes in a tree (btree or binary tree), then you don't need to know all of the hashes at once. You only need to compare them as bigger/smaller. This means you can stop reading files as soon as you find a difference. And when all of the files are in the same tree, you only compare the file against minimum number of other files as you go down the tree, and you only need to compare minimum number of hashes. |
I'd like to link this thread with this idea #640 |
You could use approach of https://github.com/kornelski/dupe-krill to hash only as little as necessary, instead of hashing whole files.
The text was updated successfully, but these errors were encountered: