Minimize uploads in sync local->remote by looking for existing same
files elsewhere in remote destination and do an S3 COPY command
instead of uploading the file again.
We now store the (locally generated) md5 of the file in the
x-amz-meta-s3cmd-attrs metadata, because we can't count on the ETag
being correct due to multipart uploads. Use this value if it's
This also reduces the number of local stat() calls made by
recording more useful information during the inital
os.walk(). This cuts the number of stat()s in half.
If remote doesn't have any copies of the file, we transfer one
instance first, then copy thereafter. But we were dereferencing the
destination list improperly in this case, causing a crash. This patch
fixes the crash cleanly.
Reworked some of the hardlink / same file detection code to be a
little more general purpose. Now it can be used to detect duplicate
files on either remote or local side.
When transferring remote->local, if we already have a copy (same
md5sum) of a file locally that we would otherwise transfer, don't
transfer, but hardlink it. Should hardlink not be avaialble (e.g. on
Windows), use shutil.copy2() instead. This lets us avoid the second
_get_filelist_local() grew an initial list argument. This lets us
avoid copying / merging / updating a bunch of different lists back
into one - it starts as one list and grows. Much cleaner (and the
fact these were separate cost me several hours of debugging to track
down why something would get set, like the by_md5 hash, only to have
it be empty shortly thereafter.
Only meaningful at present in the sync local->remote(s) case, this
adds the --add-destination <foo> command line option. For the last
arg (the traditional destination), and each destination specified via
--add-destination, fork and upload after the initial walk of the local
file system has completed (and done all the disk I/O to calculate md5
values for each file).
This keeps us from pounding the file system doing (the same) disk I/O
for each possible destination, and allows full use of our bandwidth to
upload in parallel.
This creates and maintains a cache (aka HashCache) of each inode in
the local tree. This is used to avoid doing local disk I/O to
calculate an MD5 value for a file if it's inode, mtime, and size
haven't changed. If these values have changed, then it does the disk
This introduces command line option --cache-file <foo>. The file is
created if it does not exist, is read upon start and written upon
close. The contents are only useful for a given directory tree, so
caches should not be reused for different directory tree syncs.