Parallel multipart upload and download support for s3cmd #11

Closed
wants to merge 16 commits into
from

Projects

None yet

10 participants

@t3rm1n4l
t3rm1n4l commented Nov 6, 2011

I have developed parallel multipart upload and download support for s3cmd.

We can enable the parallel upload and download using the following configuration.
parallel_multipart_download = True
parallel_multipart_upload = True
parallel_multipart_download_threads = 5
parallel_multipart_upload_threads = 5
parallel_multipart_download_count = 5
parallel_multipart_upload_count = 5

Please merge my changes to the s3cmd

Looking into your review comments.

Thanks

Sarath Lakshman
http://www.sarathlakshman.com

t3rm1n4l added some commits Oct 19, 2011
@t3rm1n4l t3rm1n4l Adding changes to `recv_file` to support partial file download by spe…
…cifying start-position and end-position in bytes
a8f80ea
@t3rm1n4l t3rm1n4l Added `concat_files()` function
A function that takes destination file handle and list of source file handles, concatenate source files data and write into destination file
e60ae89
@t3rm1n4l t3rm1n4l Added `object_multipart_get()` function
* object_multipart_get() - Download a file from s3 by parallel download of multiple split files using Worker thread pool, merge split files and perform md5 checksum verification
* Added new parameters in Config to specify worker thread numbers, file split count and to toggle parallel split download on and off
8772c9a
@t3rm1n4l t3rm1n4l Added changes to s3cmd for switching parallel split download on/off b…
…ased on configuration file
7bacce0
@t3rm1n4l t3rm1n4l Cleanup handlers for temporary files and disk usage improvement to co…
…ncat_files() method to readily remove split files once data is read.
e515259
@t3rm1n4l t3rm1n4l Changed threading.active_count() to threading.activeCount() for backw…
…ard compatibility with python 2.4
16c3db4
@t3rm1n4l t3rm1n4l Added parameters signing support for auth signature calculation metho…
…d sign()
abf8a6b
@t3rm1n4l t3rm1n4l Added multipart upload support. Multipart upload can be enabled by a…
…dding parallel_multipart_upload = True in config file
14d3015
@t3rm1n4l t3rm1n4l Replaced email.Util.formatdate() with time module - ensure compatibil…
…ity with python 2.4
9f3003c
@t3rm1n4l t3rm1n4l FIX: Correct split numbering, unsymmetric file split issue 077bf4c
@t3rm1n4l t3rm1n4l s3cmd info - fix to show correct md5sum for multipart uploaded file b…
…ased on custom md5 meta header
bd70d2b
@t3rm1n4l t3rm1n4l Fix for python 2.4 daemon thread by using setDaemon(True) by replacin…
…g daemon=True
942b19f
@t3rm1n4l t3rm1n4l Disable progress bar for multipart upload b5f6590
@t3rm1n4l t3rm1n4l Added meta based md5 verfication for non-parallel downloader for file…
…s uploaded with multipart upload
2fb607a
@t3rm1n4l @t3rm1n4l t3rm1n4l Added exit_status for s3cmd program
s3cmd does not return valid exit status codes. Hence it is unable to identify whether the program succeeded or failed (with cause of failure)
This commit adds exit status for s3cmd sync upload, sync download, get and put operations
Exit codes are as follows : SIZE_MISMATCH=1, MD5_MISMATCH=2, RETRIES_EXCEEDED=3, UPLOAD_ABORT=4, MD5_META_NOTFOUND=5, KEYBOARD_INTERRUPT=6
b30fd20
@t3rm1n4l @t3rm1n4l t3rm1n4l Added separate config parameters for thread limit and split count wrt…
… to download and upload

added download config parameters parallel_multipart_download_threads (thread count), parallel_multipart_download_count (split count)
added upload config parameters parallel_multipart_upload_threads (thread count), parallel_multipart_upload_count (split count)
be7845f
@mludvig
Contributor
mludvig commented Nov 6, 2011

Hi Sarath

Thanks for that. I'm now on holiday, will have a look at your changes when I get back in a couple of weeks.

Cheers

Michal

On 7/11/2011, at 5:26, Sarath Lakshman reply@reply.github.com wrote:

I have developed parallel multipart upload and download support for s3cmd.

We can enable the parallel upload and download using the following configuration.
parallel_multipart_download = True
parallel_multipart_upload = True
parallel_multipart_download_threads = 5
parallel_multipart_upload_threads = 5
parallel_multipart_download_count = 5
parallel_multipart_upload_count = 5

Please merge my changes to the s3cmd

Looking into your review comments.

Thanks

Sarath Lakshman
http://www.sarathlakshman.com

You can merge this Pull Request by running:

git pull https://github.com/t3rm1n4l/s3cmd master

Or you can view, comment on it, or merge it online at:

#11

-- Commit Summary --

  • Adding changes to recv_file to support partial file download by specifying start-position and end-position in bytes
  • Added concat_files() function
  • Added object_multipart_get() function
  • Added changes to s3cmd for switching parallel split download on/off based on configuration file
  • Cleanup handlers for temporary files and disk usage improvement to concat_files() method to readily remove split files once data is read.
  • Changed threading.active_count() to threading.activeCount() for backward compatibility with python 2.4
  • Added parameters signing support for auth signature calculation method sign()
  • Added multipart upload support. Multipart upload can be enabled by adding parallel_multipart_upload = True in config file
  • Replaced email.Util.formatdate() with time module - ensure compatibility with python 2.4
  • FIX: Correct split numbering, unsymmetric file split issue
  • s3cmd info - fix to show correct md5sum for multipart uploaded file based on custom md5 meta header
  • Fix for python 2.4 daemon thread by using setDaemon(True) by replacing daemon=True
  • Disable progress bar for multipart upload
  • Added meta based md5 verfication for non-parallel downloader for files uploaded with multipart upload
  • Added exit_status for s3cmd program
  • Added separate config parameters for thread limit and split count wrt to download and upload

-- File Changes --

M S3/Config.py (6)
M S3/S3.py (390)
M S3/Utils.py (24)
M s3cmd (65)

-- Patch Links --

https://github.com/s3tools/s3cmd/pull/11.patch
https://github.com/s3tools/s3cmd/pull/11.diff


Reply to this email directly or view it on GitHub:
#11

@t3rm1n4l
t3rm1n4l commented Nov 7, 2011

Hi Michal,

Thanks for that. I'm now on holiday, will have a look at your changes when

I get back in a couple of weeks.

Thanks for the reply.
Looking forward to the code merge.

Happy Hacking,
Sarath Lakshman
http://www.sarathlakshman.com

On 7/11/2011, at 5:26, Sarath Lakshman reply@reply.github.com wrote:

I have developed parallel multipart upload and download support for
s3cmd.

We can enable the parallel upload and download using the following
configuration.
parallel_multipart_download = True
parallel_multipart_upload = True
parallel_multipart_download_threads = 5
parallel_multipart_upload_threads = 5
parallel_multipart_download_count = 5
parallel_multipart_upload_count = 5

Please merge my changes to the s3cmd

Looking into your review comments.

Thanks

Sarath Lakshman
http://www.sarathlakshman.com

You can merge this Pull Request by running:

git pull https://github.com/t3rm1n4l/s3cmd master

Or you can view, comment on it, or merge it online at:

#11

-- Commit Summary --

  • Adding changes to recv_file to support partial file download by
    specifying start-position and end-position in bytes
  • Added concat_files() function
  • Added object_multipart_get() function
  • Added changes to s3cmd for switching parallel split download on/off
    based on configuration file
  • Cleanup handlers for temporary files and disk usage improvement to
    concat_files() method to readily remove split files once data is read.
  • Changed threading.active_count() to threading.activeCount() for
    backward compatibility with python 2.4
  • Added parameters signing support for auth signature calculation method
    sign()
  • Added multipart upload support. Multipart upload can be enabled by
    adding parallel_multipart_upload = True in config file
  • Replaced email.Util.formatdate() with time module - ensure
    compatibility with python 2.4
  • FIX: Correct split numbering, unsymmetric file split issue
  • s3cmd info - fix to show correct md5sum for multipart uploaded file
    based on custom md5 meta header
  • Fix for python 2.4 daemon thread by using setDaemon(True) by replacing
    daemon=True
  • Disable progress bar for multipart upload
  • Added meta based md5 verfication for non-parallel downloader for files
    uploaded with multipart upload
  • Added exit_status for s3cmd program
  • Added separate config parameters for thread limit and split count wrt
    to download and upload

-- File Changes --

M S3/Config.py (6)
M S3/S3.py (390)
M S3/Utils.py (24)
M s3cmd (65)

-- Patch Links --

https://github.com/s3tools/s3cmd/pull/11.patch
https://github.com/s3tools/s3cmd/pull/11.diff


Reply to this email directly or view it on GitHub:
#11


Reply to this email directly or view it on GitHub:
#11 (comment)

@tommeier

Looking forward to this :)

It may be worth looking into using something like http://www.gnu.org/s/parallel/ as well to do the leg work.

@t3rm1n4l
t3rm1n4l commented Dec 4, 2011

@Michal May I know the status for the merge ?

@colinhowe

I'm just trying this out. I noticed that this is only implemented in the sync command. Any chance of getting it added to put?

@colinhowe colinhowe commented on the diff Dec 14, 2011
@@ -309,6 +336,156 @@ def website_delete(self, uri, bucket_location = None):
return response
+ def object_multipart_upload(self, filename, uri, cfg, extra_headers = None, extra_label = ""):
+ if uri.type != "s3":
+ raise ValueError("Expected URI type 's3', got '%s'" % uri.type)
+
+ if not os.path.isfile(filename):
+ raise InvalidFileError(u"%s is not a regular file" % unicodise(filename))
+ try:
+ file = open(filename, "rb")
+ file_size = os.stat(filename)[ST_SIZE]
+ except (IOError, OSError), e:
+ raise InvalidFileError(u"%s: %s" % (unicodise(filename), e.strerror))
+
+ parts_size = file_size / cfg.parallel_multipart_upload_count
@colinhowe
colinhowe Dec 14, 2011

I think the part size should be configurable instead of the number of parts. Ideally I'd like to be able to upload a 20gb file in 100mb chunks using 5 threads...

@sylvinus

+1 would love to see this in

@revolunet

+1 too :)

@mludvig
Contributor
mludvig commented Dec 30, 2011

Hi Sarath

finally got time to have a look at your code. I've got a couple of comments...

  • The configuration should be as simple as max_threads=5 (or =0 or =1 to force non-parallel upload), IMO there is no need to specify separate download/upload thread count and separate yes/no for using multipart.
  • As @colinhowe pointed out the config should specify a upload chunk size, say 10MB by default, instead of the number of parts.
  • Both the number of threads and chunk size config options should have command line parameters.
  • All the changes to the upload logic should have been done to S3.object_put() - there should be made the decision to use multipart upload (file_size > 1.5_chunk_size) or classic all-at-once upload (file_size <= 1.5_chunk_size). That will make it work for both sync and put at the same time.
  • The multipart logic should be more structured - don't worry to create one function for initiating upload, one for uploading a single part and one for finalising upload. Preferably without copy&pasting the existing object_put(). And obviously one controller to manage the workers.
  • Is there any possibility to get away without md5'ing the whole file (potentially multi-GB size) prior to the upload? That may delay the upload significantly.
  • A new command for listing all multipart uploads in progress and removing those aborted may be handy. Resuming an interrupted upload would be a killer feature too (not sure if it's possible as I haven't studied the multipart S3 docs very closely yet).
  • Concentrate on multipart upload first, once that is done let's work on multipart download. Please :) I'm not pulling in such a big chunk of code at once.

Sorry for so many comments! :)

Michal

@mludvig mludvig closed this Dec 30, 2011
@mludvig
Contributor
mludvig commented Dec 30, 2011

By the way I have just revived an old multipart/threading work done by Adys and merged into s3tools/s3cmd@adys-threading-multipart branch. It's not ideal but close to being acceptable. If you decide to work on this feature again I suggest you base it on that code. Thanks!

@GeoffreyPlitt

+1 would love @t3rm1n4l to finish what he started :)

@mezzin
mezzin commented Aug 28, 2012

Any status updates on parallel uploading? Because we have a lot of syncs running with s3cmd and this will speed up the process exponentially.

@ghost
ghost commented Oct 12, 2012

thats also what we need.

@Taytay
Taytay commented Nov 17, 2012

In my Googling, I came across this fork of s3cmd that supports parallel sync:
https://github.com/pearltrees/s3cmd-modification

In my tests, using 100 workers to upload a LOT of little files sped things up by orders of magnitude. (Haven't played around with it enough to know what a truly good worker number is).

@ramilexe

any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment