Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3store: Performance improvements #478

Merged
merged 17 commits into from
May 18, 2021
Merged

s3store: Performance improvements #478

merged 17 commits into from
May 18, 2021

Conversation

Acconut
Copy link
Member

@Acconut Acconut commented May 16, 2021

This PR attempts to improve the performance (especially the throughput) of uploads using the S3Store. It does this by following changes:

  • The information retrieved during a GetInfo call is now cached, so a WriteChunk and FinishUpload call do not have to repeat the calls to list the parts on S3 etc.
  • The calls to S3 for GetInfo are now executed concurrently instead of sequentially to reduce total time.
  • The UploadPart calls inside WriteChunk are not executed concurrently instead of sequentially. This allows to better utilize the available throughput to S3. The concurrency is limited by a semaphore to not drain the system of resources.

Remaining tasks:

  • Address TODO comments
  • Implement more resilient error handling (in another PR)
  • Make pre-release for testing

⚠️ This PR should best be released in a new major version because it changes the behavior of the S3 store quite significantly.

@acj I remember that you spent some time in the past working on the S3Store, so I figured you might be interested in these changes. Let me know if you have any feedback :)

@Acconut Acconut marked this pull request as ready for review May 18, 2021 08:23
@Acconut Acconut changed the base branch from master to v2 May 18, 2021 08:27
@Acconut
Copy link
Member Author

Acconut commented May 18, 2021

I am going to merge this into a new v2 branch which will be used as the base for working on the next major release.

@Acconut Acconut merged commit 8fd1836 into v2 May 18, 2021
@Acconut Acconut deleted the feat/parallel-parts branch May 18, 2021 08:29
@acj
Copy link
Contributor

acj commented Jul 6, 2021

@Acconut Sorry for the long delay. The improvements seem reasonable to me, though I'm still concerned (and possibly confused) about what happens if a part fails to upload when we're making API calls in parallel. For example, if your upload has three parts (1, 2, 3) and 1 and 3 succeed but 2 fails, what happens? My understanding was that the offset calculation (e.g. at the bottom of fetchInfo) assumed that all successfully-uploaded parts were consecutive, which isn't true in my example, and that the upload would be broken at that point.

We ran tusd in production for a while with a similar parallel part upload mechanism, a prototype of the code in #402, and there were occasional random failures that seemed to be caused by failed parts (not certain of this, but it seemed likely). That's why I didn't use parallelism for requests to S3 in #402.

My understanding might not be correct, though, and I'd love to see this improvement in tusd. I hope this context helps. :)

Thanks for your continued work on this project.

@Acconut
Copy link
Member Author

Acconut commented Jul 9, 2021

Thank for the feedback. Your concerns are justified :) Right now, such a situation would lead to a corrupted upload, but I think we can solve this in the future by storing a list of usable parts on S3, instead of always fetching a list of all uploaded parts and treating every one of them as usable (i.e. if we upload parts 1,2,3 and 2 fails, only 1 would be considered usable). That should make sure the upload never ends up in a corrupted state.

However, I first wanted to measure the real world impact of this PR, so I merged it into a v2 branch, so we can test the prelease without affecting users on the mainline version. That being said, we haven´t experienced problems with uploading parts to S3 in production, so this was not a pressing issue for us.

So, all in all, this should be improved in the future :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants