Starting point could be this nice fork by pcorliss who added parallel workers:
In my case, this speeds up my upload by 50x times.
My use case is uploading 20.000.000 small files around several S3 locations (Pearltrees.com assets).
I also forked pcorliss modifications to add parallel workers to cp and mv commands:
As you are currently working on the 1.1 release, even if the code changed from the fork, it would be very nice to resolve conflicts and merge this great fork.
Just moved my fork to :
I'd like to see this in s3cmd master. Hosting a static website with lots of small files (.css, .js) on S3 is a common usecase. So +1.
+1, would love to have this when uploading lots of files to s3
This would be really helpful in my work as a system administrator.
Is this feature on the roadmap? Our uploads are very slow due to a big number of files. I can't get Pearltrees' fork to work reliably. What about a bounty to finish this?
Also add my vote for some version of this. It'd be great to have parallelization. In the mean, we're going to have to use the fork at Pearltrees's fork.
I took a quick look at this. We would need to extend the (new) connection reuse (ConnMan) that's in 1.5.0-alpha3 to be able to have multiple connections per endpoint, one per worker thread. I don't see that in ConnMan right now. So this isn't a trivial port of the existing work. For sync, the _upload() code path should be easily shardable across multiple parallel threads.
ConnMan was designed to support that exact scenario - multiple threads sharing a pool of connections to S3. So no problem there. But the rewrite of the core to support threads is a big undertaking to make it right (indeed a quick hack to parallelise this or that code path may be easier but not quite what we want). Will see after 1.5.0 release if I can revive some old work done in this space.
It would be great to see this implemented in an future release. It will really speedup file transfers to s3 in my scenario.
+1, I've been using pcorliss/s3cmd-modification for many years, but it hasn't merged with master here since about 2010. We sync incrementals to S3 in nightly batches, and being able to use --parallel --workers=n makes that possible without huge waits. I'd love to have that functionality on top of all the recent improvements.
+1 for this :)
What I have been doing is something like:
for d in *; do
# If there are errors, make some noise
egrep -1 'ERROR' /tmp/sync-$d.log
# Test lock before truncating log file
/usr/bin/flock -n /tmp/sync-$d true && /usr/bin/flock -n /tmp/sync-$d s3cmd -v ..args.. sync /target_dir/$d/ s3://target_bucket/$d/ > /tmp/sync-$d.log 2>&1 &
Something like that can run in a fairly tight cron loop and it will at least parallelize the directories in the top. The flock stuff is to keep you only syncing a given directory one at a time, and the grep is to notify you that errors have been encountered on a previous run. Crude, but hopefully effective.
Of course, if I could do without a shell loop .. :)
Are we still doesn't have this feature? Downloading 20000 file tooks very long time because of no parallel capability.
(s4cmd https://github.com/bloomreach/s4cmd is often MUCH faster due to concurrent workers... but it lacks other features that s3cmd has...)
parallel workers is an urgent, saves a lot of time for us.
For those looking here: Although other tools don't exactly replicate everything from s3cmd, the best solution for anyone handling lots of files at once is probably to use s4cmd or aws-cli (discussed here), both of which have parallel workers.
Thanks, @jlevy ... do you know if aws-cli now supports large file transfers? The reason I needed to switch to s3cmd was to chunk large files.
@dannyman yes, see the article linked in my last comment; it handles multipart.
Slight digression, but would be interested to see a benchmark of s4cmd vs aws-cli for common scenarios like in that blog — anyone want to run those or similar tests with both? 😃 There is some uncertainty around perf for all these tools and a good benchmark would help. Cc @chouhanyang and bloomreach/s4cmd#72
From what I can tell, via-a-vis sync:
In my non-scientific testing thus far, it would appear that aws-cli is faster for the task of keeping things in sync. The upload seemed dang zippy as well. In experimenting with s4cmd I had to hot-patch a bug whereby it throws an error on zero-length files as unreadable. Some have noted that for a sufficiently large file hierarchy sync, s4cmd can run out of RAM.
My main source of woe is having written my backup scripts two years ago, when efficiently syncing vast directories (80TB+) which may contain large files was especially painful. With any luck I can now boil it down to a single aws-cli command, or at least a single shell loop that can be run serially. :) Thanksx10^6, @jlevy !!
Update: running aws-cli over the entire 80TB directory ... it is still indexing files but it has taken action along the way. I have yet to see another tool actually get syncs running over a partial index. At least, that is what I infer from this message:
Completed 64573 part(s) with ... file(s) remaining
Color me extremely impressed with aws-cli
Completed 64573 part(s) with ... file(s) remaining
👍 parallel uploads are extremely useful. In our case using a highest tier C4 instance, we were able to push thousands of files per second up to s3 using https://github.com/mishudark/s3-parallel-put so it would be nice to see it supported in s3cmd.
same here - is there any progress?? :D