Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find duplicate files without reading their full content #49

Open
kornelski opened this issue Oct 7, 2020 · 3 comments
Open

Find duplicate files without reading their full content #49

kornelski opened this issue Oct 7, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@kornelski
Copy link

You could use approach of https://github.com/kornelski/dupe-krill to hash only as little as necessary, instead of hashing whole files.

@qarmin qarmin added the enhancement New feature or request label Oct 7, 2020
@qarmin qarmin mentioned this issue Oct 7, 2020
10 tasks
@qarmin
Copy link
Owner

qarmin commented Jan 14, 2021

I tried to read and understand what it going on with lazy hashing, but I failed because for now seems that I only can read my own code.

But if I correctly understand it opens n files which are in group with same size and read part of file, hashes it and compare it with other partial hashes. Next throw out unique hashes and repeat everything until data ends.

Looks that this should be very fast solution but isn't suitable for current Czkawka version:

  • At first, Windows looks that have limit of 512 files opened at one time, current implementation opens maximal one file per available virtual processor but probably lazy hashing could exceed this limit with checking e.g. 1000 identical files .
  • But the most important, that will really complicate caching data. Recently added feature base on saving/loading full file hash.

@kornelski
Copy link
Author

kornelski commented Jan 14, 2021

It doesn't have to keep the files open. It's fine to close and reopen them when needed.

The key insight is that if you split a file into multiple hashes (array of hashes), and put these multi-hashes in a tree (btree or binary tree), then you don't need to know all of the hashes at once. You only need to compare them as bigger/smaller. This means you can stop reading files as soon as you find a difference. And when all of the files are in the same tree, you only compare the file against minimum number of other files as you go down the tree, and you only need to compare minimum number of hashes.

@x2es
Copy link

x2es commented Feb 27, 2022

I'd like to link this thread with this idea #640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants