Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD deduplicate Dropbox files based on checksum ignoring name #1674

Closed
allanlaal opened this issue Sep 12, 2017 · 23 comments
Closed

ADD deduplicate Dropbox files based on checksum ignoring name #1674

allanlaal opened this issue Sep 12, 2017 · 23 comments

Comments

@allanlaal
Copy link

use case:

I use Dropbox (because I'm masochistic). Dropbox likes to append (1), (2) etc to filenames when merging 2 folders with identical files. I can't just delete everything with (1) etc in their names, because it might not be a duplicate anymore.

I would like to rclone dedup those folders by using the checksums of each files, ignoring names, but keeping the shortest name (so it would prefer file.mp4 over file (1).mp4)

@breunigs breunigs assigned breunigs and unassigned breunigs Sep 14, 2017
@anchepiece
Copy link

While there may be some use cases where it makes sense for rclone to have that option, why not pull this idea out as its own feature? I feel that it could be done as its own tool or modifying an existing tool to monitor your dropbox folder.

@allanlaal
Copy link
Author

there are already such tools out there, but none can access Dropbox on their own

@ncw
Copy link
Member

ncw commented Dec 12, 2017

This would be nice implemented either as part of the rclone dedupe command or as a new command.

@ascendbruce
Copy link

Some random thoughts:

  1. I assume this dedupe behavior should only look for duplicates within the same folder?
  2. If not, might be better to provide more flexibility in deciding which to keep
  3. @allanlaal, you may want to use https://github.com/adrianlopezroche/fdupes or http://www.commandlinefu.com/commands/view/10039/find-duplicate-files-based-on-md5-hash-for-mac-os-x to build a simple dedupe script in the meantime

@ncw
Copy link
Member

ncw commented Mar 10, 2018

If you are thinking about scripting it then I'd use "rclone md5sum remote:path | sort | uniq -c | sort -n" as a starting point which will do nearly all the work for you.

@dzg
Copy link

dzg commented Jun 2, 2018

Wouldn't rclone md5sum remote:path | sort | uniq -c | sort -n possibly match 2 files with different sizes but the same md5? Or are the odds of that too small to be significant?

Most dupe finders first compare sizes, and then onlyif identical, compare checksum.

@breunigs
Copy link
Collaborator

breunigs commented Jun 3, 2018

It's unlikely. Comparing file size first is mostly a performance gain, since that information is usually available in a file system's index, i.e. quick to access. Computing the md5 requires reading the whole file on most filesystems, i.e. rather slow.

@TioBorracho
Copy link
Contributor

will look into this as I am also interested into adding some fslint-alike functions

@allanlaal allanlaal changed the title ADD dedup based on checksum ignoring name ADD deduplicate files based on checksum ignoring name Apr 15, 2019
@allanlaal allanlaal changed the title ADD deduplicate files based on checksum ignoring name ADD deduplicate Dropbox files based on checksum ignoring name Apr 15, 2019
@allanlaal
Copy link
Author

does anyone have a solution?

@ncw
Copy link
Member

ncw commented Apr 16, 2019

does anyone have a solution?

You could use something like this as a start

rclone lsf -R --files-only --hash DropboxHash --format hp dropbox: | sort | uniq -D -w 64

which will print all files with the same content hash

@Nottt
Copy link

Nottt commented Apr 13, 2020

Is there a equivalent for google drive?

@ncw
Copy link
Member

ncw commented Apr 15, 2020

This will do it for google drive

rclone lsf -R --drive-skip-gdocs --files-only drive:test --format hp | sort | uniq -D -w 32

@Nottt
Copy link

Nottt commented Apr 19, 2020

Wow I didn't expect to have such a list of duplicated files...I thought rclone dedupe couldn't let this happen...

So now how I get rid of the duplicates?

@ncw
Copy link
Member

ncw commented Apr 23, 2020

Wow I didn't expect to have such a list of duplicated files...I thought rclone dedupe couldn't let this happen...

rclone dedupe only deduplicates files with identical names and paths... I suspect yours don't have that.

So now how I get rid of the duplicates?

It is not an easy problem... But I'd look through the results and work out if there are whole duplicated directories that I could get rid of.

@Nottt
Copy link

Nottt commented Apr 25, 2020

I think it should be clearer on the documentation about that then, because most people assume deleting duplicated means deleting duplicated files regardless if the filenames are the same or not - i.e by checking checksums.

@NoLooseEnds
Copy link
Contributor

Bumping this. Would be really nice to have.

ncw added a commit that referenced this issue Oct 13, 2020
@ncw
Copy link
Member

ncw commented Oct 13, 2020

I had an idea about this... It turned out to be very easy to add another flag to dedupe --by-hash to enable this functionality.

Anyone like to give rclone dedupe --by-hash a go? Recommended testing with -i.

Any thoughts on the user interface - is a --by-hash a good enough flag?

v1.54.0-beta.4819.52f25b1b0.fix-1674-dedupe-by-hash on branch fix-1674-dedupe-by-hash (uploaded in 15-30 mins)

@NoLooseEnds
Copy link
Contributor

How would this work against a gdrive remote (ie. Would gdrive report the hash or would I need to download anything to get it?) My current use case is backing up from Dropbox to gdrive.

@ncw
Copy link
Member

ncw commented Oct 14, 2020

How would this work against a gdrive remote (ie. Would gdrive report the hash or would I need to download anything to get it?) My current use case is backing up from Dropbox to gdrive.

It will work against any remote which supplies hashes, which includes both Google Drive and Dropbox

@imjuzcy
Copy link

imjuzcy commented Oct 22, 2020

@ncw I'm trying out the --by-hash feature, seems to be working, but just lack automation. It needs user input for every duplicate. I know most of the case it'll be better that it requires user input, but some kind of automation would be nice. (For example, longest or shortest name)

@ncw
Copy link
Member

ncw commented Oct 26, 2020

@ncw I'm trying out the --by-hash feature, seems to be working, but just lack automation. It needs user input for every duplicate. I know most of the case it'll be better that it requires user input, but some kind of automation would be nice. (For example, longest or shortest name)

You should be able to use any of the dedupe modes

  • --dedupe-mode interactive - interactive as above.
  • --dedupe-mode skip - removes identical files then skips anything left.
  • --dedupe-mode first - removes identical files then keeps the first one.
  • --dedupe-mode newest - removes identical files then keeps the newest one.
  • --dedupe-mode oldest - removes identical files then keeps the oldest one.
  • --dedupe-mode largest - removes identical files then keeps the largest one.
  • --dedupe-mode smallest - removes identical files then keeps the smallest one.
  • --dedupe-mode rename - removes identical files then renames the rest to be different.

Longest or shortest name is missing from that list though!

@ncw
Copy link
Member

ncw commented Dec 2, 2020

I've merged the --by-hash flag to master now which means it will be in the latest beta in 15-30 mins and released in v1.54

@GrahamCobb
Copy link

Great feature, thanks. It would be even better if there was a similar "--by-size" flag.

By itself --by-size would work similarly to --by-hash: consider files to be duplicates if they have the same size. But, much more useful, would be using the two together: --by-hash --by-size to mean consider them duplicates if they have the same size and the same hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests