Parallel workers #2

Open
cyno opened this Issue Aug 2, 2011 · 35 comments

Projects

None yet
@cyno
cyno commented Aug 2, 2011

Starting point could be this nice fork by pcorliss who added parallel workers:
https://github.com/pcorliss/s3cmd-modification
In my case, this speeds up my upload by 50x times.
My use case is uploading 20.000.000 small files around several S3 locations (Pearltrees.com assets).

I also forked pcorliss modifications to add parallel workers to cp and mv commands:
https://github.com/cyno/s3cmd-modification

As you are currently working on the 1.1 release, even if the code changed from the fork, it would be very nice to resolve conflicts and merge this great fork.

@cyno cyno closed this Aug 2, 2011
@cyno cyno reopened this Aug 4, 2011
@jbraeuer
Contributor
jbraeuer commented Feb 8, 2012

I'd like to see this in s3cmd master. Hosting a static website with lots of small files (.css, .js) on S3 is a common usecase. So +1.

@sylvinus
sylvinus commented Mar 7, 2012

+1, would love to have this when uploading lots of files to s3

@solidsnack

This would be really helpful in my work as a system administrator.

@sylvinus

Is this feature on the roadmap? Our uploads are very slow due to a big number of files. I can't get Pearltrees' fork to work reliably. What about a bounty to finish this?

@muness
muness commented Jul 28, 2012

Also add my vote for some version of this. It'd be great to have parallelization. In the mean, we're going to have to use the fork at Pearltrees's fork.

@clstokes

+1

@lqez
lqez commented Mar 14, 2013

+1

@skyhorse

+1

@mdomsch
Member
mdomsch commented Mar 18, 2013

I took a quick look at this. We would need to extend the (new) connection reuse (ConnMan) that's in 1.5.0-alpha3 to be able to have multiple connections per endpoint, one per worker thread. I don't see that in ConnMan right now. So this isn't a trivial port of the existing work. For sync, the _upload() code path should be easily shardable across multiple parallel threads.

@mdomsch mdomsch closed this Mar 18, 2013
@mdomsch mdomsch reopened this Mar 18, 2013
@mludvig
Contributor
mludvig commented Mar 18, 2013

ConnMan was designed to support that exact scenario - multiple threads sharing a pool of connections to S3. So no problem there. But the rewrite of the core to support threads is a big undertaking to make it right (indeed a quick hack to parallelise this or that code path may be easier but not quite what we want). Will see after 1.5.0 release if I can revive some old work done in this space.

@strathausen

+1

@jmusits
jmusits commented Apr 15, 2013

It would be great to see this implemented in an future release. It will really speedup file transfers to s3 in my scenario.

@briandailey

+1

@edeustace

+1

@messick
messick commented Feb 14, 2014

👍

@jstanden

+1, I've been using pcorliss/s3cmd-modification for many years, but it hasn't merged with master here since about 2010. We sync incrementals to S3 in nightly batches, and being able to use --parallel --workers=n makes that possible without huge waits. I'd love to have that functionality on top of all the recent improvements.

@ceefour
ceefour commented Apr 27, 2014

+1 for this :)

@alex88
alex88 commented Jun 13, 2014

👍

@PanMan
Contributor
PanMan commented Aug 12, 2014

👍

@verybigbadboy

👍

@dannyman

What I have been doing is something like:

cd /target_dir
for d in *; do
 # If there are errors, make some noise
 egrep -1 'ERROR' /tmp/sync-$d.log
 # Test lock before truncating log file
 /usr/bin/flock -n /tmp/sync-$d true && /usr/bin/flock -n /tmp/sync-$d s3cmd -v ..args..  sync /target_dir/$d/ s3://target_bucket/$d/ > /tmp/sync-$d.log 2>&1 &
 sleep 60
done

Something like that can run in a fairly tight cron loop and it will at least parallelize the directories in the top. The flock stuff is to keep you only syncing a given directory one at a time, and the grep is to notify you that errors have been encountered on a previous run. Crude, but hopefully effective.

Of course, if I could do without a shell loop .. :)

@dieend
dieend commented Jan 28, 2015

Are we still doesn't have this feature? Downloading 20000 file tooks very long time because of no parallel capability.

@overthink

+1

(s4cmd https://github.com/bloomreach/s4cmd is often MUCH faster due to concurrent workers... but it lacks other features that s3cmd has...)

@castro1688

Please.. ??

@fubarhouse

+1

@albarki
albarki commented Jun 15, 2016

+1
parallel workers is an urgent, saves a lot of time for us.

@jlevy
jlevy commented Aug 26, 2016

For those looking here: Although other tools don't exactly replicate everything from s3cmd, the best solution for anyone handling lots of files at once is probably to use s4cmd or aws-cli (discussed here), both of which have parallel workers.

@dannyman

Thanks, @jlevy ... do you know if aws-cli now supports large file transfers? The reason I needed to switch to s3cmd was to chunk large files.

@jlevy
jlevy commented Aug 29, 2016

@dannyman yes, see the article linked in my last comment; it handles multipart.

Slight digression, but would be interested to see a benchmark of s4cmd vs aws-cli for common scenarios like in that blog — anyone want to run those or similar tests with both? 😃 There is some uncertainty around perf for all these tools and a good benchmark would help. Cc @chouhanyang and bloomreach/s4cmd#72

@dannyman
dannyman commented Aug 29, 2016 edited

From what I can tell, via-a-vis sync:

  • aws-cli compares mtime
  • s4cmd compares md5

In my non-scientific testing thus far, it would appear that aws-cli is faster for the task of keeping things in sync. The upload seemed dang zippy as well. In experimenting with s4cmd I had to hot-patch a bug whereby it throws an error on zero-length files as unreadable. Some have noted that for a sufficiently large file hierarchy sync, s4cmd can run out of RAM.

My main source of woe is having written my backup scripts two years ago, when efficiently syncing vast directories (80TB+) which may contain large files was especially painful. With any luck I can now boil it down to a single aws-cli command, or at least a single shell loop that can be run serially. :) Thanksx10^6, @jlevy !!

Update: running aws-cli over the entire 80TB directory ... it is still indexing files but it has taken action along the way. I have yet to see another tool actually get syncs running over a partial index. At least, that is what I infer from this message:
Completed 64573 part(s) with ... file(s) remaining
Color me extremely impressed with aws-cli

@isegal
isegal commented Oct 6, 2016 edited

👍 parallel uploads are extremely useful. In our case using a highest tier C4 instance, we were able to push thousands of files per second up to s3 using https://github.com/mishudark/s3-parallel-put so it would be nice to see it supported in s3cmd.

@tgmedia-nz

same here - is there any progress?? :D

@mdomsch
Member
mdomsch commented Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment