-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2] Bug fix and optimization in s3transfer's download #9345
base: v2
Are you sure you want to change the base?
Conversation
* Bump PyInstaller to 6.11.1 This includes pinning `macholib` because of this change in `PyInstaller` that makes `macholib` a build dependency. pyinstaller/pyinstaller#5883 --------- Co-authored-by: Github Actions <> Co-authored-by: Steve Yoo <hssyoo@amazon.com> Co-authored-by: Alex Shovlin <shovlia@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth running this new benchmark continuously?
Mostly asking for ergonomics, since GitHub won't even show the file diff now. If we don't want to run it as part of the new suite, wondering if it'd be better to have a separate benchmarks-s3-advanced.json
(or whatever naming) separately.
@ashovlin Agreed that would be a good idea, I iterated on this feedback in the latest revision. |
Issue #, if available:
Context
This PR aims to address a bug that occurs when an
IncompleteReadError
is encountered while downloading a multipart AWS S3 object to a non-seekable stream (e.g. stdout). In this case, all bytes of the object with offset larger than the offset of the incomplete read get queued in a buffer, and the CLI terminates before the buffer is flushed.The results of this are (1) the object does not fully get written to the stream, and (2) the memory usage of the CLI grows linearly with respect to the amount of object bytes with offset larger than the offset of the incomplete read. This means that most of the object may be stored in memory no matter how large the object is.
The scope of the changes in this PR to s3transfer will only impact users downloading multi-part AWS S3 objects to a non-seekable stream (e.g. stdout).
Description of changes
s3transfer
IncompleteReadError
is encountered while downloading a multipart AWS S3 object to a non-seekable stream (e.g. stdout).DeferQueue
so that it will overwrite pending writes to the same offset with whichever write request has the most data. So, when the request with the incomplete read is retried, it will overwrite the incomplete data in the queue assuming the retry contains more data (e.g. the full part).DeferQueue
so that if it has already dequeued an incomplete part, it will queue the subset of the bytes in the retry that were not previously dequeued.Performance Benchmark Scripts
benchmarks.json
: one simulates a standard multi-part download of a 10GB file (1,192 parts), and the other simulates the same download except anIncompleteReadError
occurs (with retry) on the 32nd part.--debug-dir <path>
path to make benchmark-definition debugging easier. When this path is specified, all output of the child (benchmarked) processes gets written to files in this directory.Description of tests
Correctness
DeferQueue
. Also updated names/documentation of existing related tests.DeferQueue
contained in this PR, I verified that the full object does NOT successfully get written.Performance Testing
DeferQueue
contained in this PR and observed significant optimizations to memory usage and execution time. The results are summarized belowSummary
Raw Data
Each cell in the table below is the average value of the metric taken over 5 iterations.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.