Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Parallel workers #2

Open
cyno opened this Issue · 24 comments
@cyno

Starting point could be this nice fork by pcorliss who added parallel workers:
https://github.com/pcorliss/s3cmd-modification
In my case, this speeds up my upload by 50x times.
My use case is uploading 20.000.000 small files around several S3 locations (Pearltrees.com assets).

I also forked pcorliss modifications to add parallel workers to cp and mv commands:
https://github.com/cyno/s3cmd-modification

As you are currently working on the 1.1 release, even if the code changed from the fork, it would be very nice to resolve conflicts and merge this great fork.

@cyno cyno closed this
@cyno cyno reopened this
@jbraeuer

I'd like to see this in s3cmd master. Hosting a static website with lots of small files (.css, .js) on S3 is a common usecase. So +1.

@sylvinus

+1, would love to have this when uploading lots of files to s3

@solidsnack

This would be really helpful in my work as a system administrator.

@sylvinus

Is this feature on the roadmap? Our uploads are very slow due to a big number of files. I can't get Pearltrees' fork to work reliably. What about a bounty to finish this?

@muness

Also add my vote for some version of this. It'd be great to have parallelization. In the mean, we're going to have to use the fork at Pearltrees's fork.

@lqez

+1

@mdomsch
Owner

I took a quick look at this. We would need to extend the (new) connection reuse (ConnMan) that's in 1.5.0-alpha3 to be able to have multiple connections per endpoint, one per worker thread. I don't see that in ConnMan right now. So this isn't a trivial port of the existing work. For sync, the _upload() code path should be easily shardable across multiple parallel threads.

@mdomsch mdomsch closed this
@mdomsch mdomsch reopened this
@mludvig
Owner

ConnMan was designed to support that exact scenario - multiple threads sharing a pool of connections to S3. So no problem there. But the rewrite of the core to support threads is a big undertaking to make it right (indeed a quick hack to parallelise this or that code path may be easier but not quite what we want). Will see after 1.5.0 release if I can revive some old work done in this space.

@jmusits

It would be great to see this implemented in an future release. It will really speedup file transfers to s3 in my scenario.

@jstanden

+1, I've been using pcorliss/s3cmd-modification for many years, but it hasn't merged with master here since about 2010. We sync incrementals to S3 in nightly batches, and being able to use --parallel --workers=n makes that possible without huge waits. I'd love to have that functionality on top of all the recent improvements.

@ceefour

+1 for this :)

@dannyman

What I have been doing is something like:

cd /target_dir
for d in *; do
 # If there are errors, make some noise
 egrep -1 'ERROR' /tmp/sync-$d.log
 # Test lock before truncating log file
 /usr/bin/flock -n /tmp/sync-$d true && /usr/bin/flock -n /tmp/sync-$d s3cmd -v ..args..  sync /target_dir/$d/ s3://target_bucket/$d/ > /tmp/sync-$d.log 2>&1 &
 sleep 60
done

Something like that can run in a fairly tight cron loop and it will at least parallelize the directories in the top. The flock stuff is to keep you only syncing a given directory one at a time, and the grep is to notify you that errors have been encountered on a previous run. Crude, but hopefully effective.

Of course, if I could do without a shell loop .. :)

@dieend

Are we still doesn't have this feature? Downloading 20000 file tooks very long time because of no parallel capability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.