Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Parallel destinations #59

Closed
wants to merge 17 commits into
from

Conversation

Projects
None yet
1 participant
Member

mdomsch commented Jun 18, 2012

(This builds on top of my previous pull requests, specifically the hardlink handling request. This also uses os.fork() and os.wait(), which don't exist on Windows. If Windows compatibility is still necessary, that will need to be checked for and handled accordingly.)

sync: add --add-destination, parallelize uploads to multiple destinations

Only meaningful at present in the sync local->remote(s) case, this
adds the --add-destination <foo> command line option.  For the last
arg (the traditional destination), and each destination specified via
--add-destination, fork and upload after the initial walk of the local
file system has completed (and done all the disk I/O to calculate md5
values for each file).

This keeps us from pounding the file system doing (the same) disk I/O
for each possible destination, and allows full use of our bandwidth to
upload in parallel.

mdomsch added some commits Feb 24, 2012

@mdomsch mdomsch Apply excludes/includes at local os.walk() time 2e4769e
@mdomsch mdomsch add --delete-after option for sync 3b3727d
@mdomsch mdomsch add more --delete-after to sync variations 5ca02bd
@mdomsch mdomsch Merge remote-tracking branch 'origin/master' into merge b40aa2a
@mdomsch mdomsch Merge branch 'delete-after' into merge 598402b
@mdomsch mdomsch add Config.delete_after b62ce58
@mdomsch mdomsch Merge branch 'delete-after' into merge e1fe732
@mdomsch mdomsch fix os.walk() exclusions for new upstream code 1eaad64
@mdomsch mdomsch Merge branch 'master' into merge ad1f8cc
@mdomsch mdomsch add --delay-updates option c42c3f2
@mdomsch mdomsch finish merge 2dfe4a6
@mdomsch @mdomsch mdomsch + mdomsch Handle hardlinks and duplicate files
Minimize uploads in sync local->remote by looking for existing same
files elsewhere in remote destination and do an S3 COPY command
instead of uploading the file again.

We now store the (locally generated) md5 of the file in the
x-amz-meta-s3cmd-attrs metadata, because we can't count on the ETag
being correct due to multipart uploads.  Use this value if it's
available.

This also reduces the number of local stat() calls made by
recording more useful information during the inital
os.walk().  This cuts the number of stat()s in half.
264ef82
@mdomsch @mdomsch mdomsch + mdomsch hardlink/copy fix
If remote doesn't have any copies of the file, we transfer one
instance first, then copy thereafter.  But we were dereferencing the
destination list improperly in this case, causing a crash.  This patch
fixes the crash cleanly.
a6e43c4
@mdomsch @mdomsch mdomsch + mdomsch remote_copy() doesn't need to know dst_list anymore cdf25f9
@mdomsch @mdomsch mdomsch + mdomsch handle remote->local transfers with local hardlink/copy if possible
Reworked some of the hardlink / same file detection code to be a
little more general purpose.  Now it can be used to detect duplicate
files on either remote or local side.

When transferring remote->local, if we already have a copy (same
md5sum) of a file locally that we would otherwise transfer, don't
transfer, but hardlink it.  Should hardlink not be avaialble (e.g. on
Windows), use shutil.copy2() instead.  This lets us avoid the second
download completely.

_get_filelist_local() grew an initial list argument.  This lets us
avoid copying / merging / updating a bunch of different lists back
into one - it starts as one list and grows.  Much cleaner (and the
fact these were separate cost me several hours of debugging to track
down why something would get set, like the by_md5 hash, only to have
it be empty shortly thereafter.
f881b16
@mdomsch @mdomsch mdomsch + mdomsch sync: add --add-destination, parallelize uploads to multiple destinat…
…ions

Only meaningful at present in the sync local->remote(s) case, this
adds the --add-destination <foo> command line option.  For the last
arg (the traditional destination), and each destination specified via
--add-destination, fork and upload after the initial walk of the local
file system has completed (and done all the disk I/O to calculate md5
values for each file).

This keeps us from pounding the file system doing (the same) disk I/O
for each possible destination, and allows full use of our bandwidth to
upload in parallel.
7de0789
@mdomsch @mdomsch mdomsch + mdomsch sync: refactor parent/child and single process code
os.fork() and os.wait() don't exist on Windows, and the
multiprocessing module doesn't exist until python 2.6.  So instead, we
conditionalize calling os.fork() depending on its existance, and on
there being > 1 destination.

Also simply rearranges the code so that subfunctions within
local2remote are defined at the top of their respective functions, for
better readability through the main execution of the function.
0277256
Member

mdomsch commented Feb 19, 2013

This was merged in 1.5.0-alpha1.

@mdomsch mdomsch closed this Feb 19, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment