mdomsch commented Jun 18, 2012

The top changes in this request add duplicate file / hardlink recognition, and avoids transfers when a file already exists at the destination we can use to perform copies upon. This branch also includes my previous pull requests adding --delay-updates, --delete-after, and Apply excludes/includes at local os.walk() time.

This is in production in Fedora Infrastructure, pushing content to several S3 mirrors (one per region).

@mdomsch mdomsch Handle hardlinks and duplicate files
Minimize uploads in sync local->remote by looking for existing same
files elsewhere in remote destination and do an S3 COPY command
instead of uploading the file again.

We now store the (locally generated) md5 of the file in the
x-amz-meta-s3cmd-attrs metadata, because we can't count on the ETag
being correct due to multipart uploads.  Use this value if it's

This also reduces the number of local stat() calls made by
recording more useful information during the inital
os.walk().  This cuts the number of stat()s in half.
If remote doesn't have any copies of the file, we transfer one
instance first, then copy thereafter.  But we were dereferencing the
destination list improperly in this case, causing a crash.  This patch
fixes the crash cleanly.
@mdomsch mdomsch handle remote->local transfers with local hardlink/copy if possible
Reworked some of the hardlink / same file detection code to be a
little more general purpose.  Now it can be used to detect duplicate
files on either remote or local side.

When transferring remote->local, if we already have a copy (same
md5sum) of a file locally that we would otherwise transfer, don't
transfer, but hardlink it.  Should hardlink not be avaialble (e.g. on
Windows), use shutil.copy2() instead.  This lets us avoid the second
download completely.

_get_filelist_local() grew an initial list argument.  This lets us
avoid copying / merging / updating a bunch of different lists back
into one - it starts as one list and grows.  Much cleaner (and the
fact these were separate cost me several hours of debugging to track
down why something would get set, like the by_md5 hash, only to have
it be empty shortly thereafter.
s3tools member
mdomsch commented Feb 19, 2013

This was merged into 1.5.0-alpha1.

s3tools member
mdomsch commented Mar 3, 2013

For historical reference, I uploaded the Fedora EPEL tree to the Amazon S3 ap-northeast-1 (Tokyo) region overnight last night. Nearly 22GB uploaded, with another 8.7GB not uploaded due to hardlink detection. :-)

Done. Uploaded 21866691003 bytes in 50302.0 seconds, 424.52 kB/s. Copied 10775 files saving 8657911447 bytes transfer.

