Introduce prototype for upload API, based on S3 #5

rohanpm · 2020-07-22T06:26:01Z

This API can be used to upload objects.
Rather than inventing our own API for this, let's use a subset of
the true S3 API. This allows usage of existing clients (such as
boto) to upload content through exodus-gw. The examples/s3-upload
script can be used to demonstrate this.

This implementation is prototype-level and comes with a TODO list.
Some major missing pieces are:

performance testing to verify that this doesn't perform notably
worse than using S3 directly
a proper way to handle configuration (it currently allows the boto
library to use whatever config is discovered by default, meaning
only a single environment is supported)

This commit also experiments a bit more on how to document these
APIs. There is now a separate API user guide and reference, as it's
not clear if there's a decent way of including general usage info
into the redoc-hosted reference.

rohanpm · 2020-07-22T06:28:03Z

@nathanegillett

This isn't ready for proper review yet, I need other PRs to be landed first. But I pushed it so you can have a look in case you're also prototyping APIs.

However, the way things are done here in some cases isn't a great example, because it's trying to make an API compatible with S3 which doesn't follow the preferred conventions of FastAPI.

rohanpm · 2020-07-23T03:32:07Z

Manually published docs at https://rohanpm.github.io/exodus-gw/ so they can be reviewed.

dichn · 2020-07-23T14:10:23Z

@rohanpm In order to make exdus-gw acess localstack(for integration test), I did some changes like this.

async def object_put(bucket: str, key: str, request: Request):
     # Single-part upload handler: entire object is written via one PUT.
     body = await request.body()
 
-    async with aioboto3.client("s3") as s3:
+    async with aioboto3.client("s3", endpoint_url='http://localhost:4572') as s3:
         response = await s3.put_object(Bucket=bucket, Key=key, Body=body)
 
     return Response(headers={"ETag": response["ETag"]})

It works with the endpoint_url argument added. If we use the origin version, it's kind difficult to configure the integration test.
should this be scoped in the PR?
others look good to me.

This API can be used to upload objects. Rather than inventing our own API for this, let's use a subset of the true S3 API. This allows usage of existing clients (such as boto) to upload content through exodus-gw. The examples/s3-upload script can be used to demonstrate this. This implementation is prototype-level and comes with a TODO list. Some major missing pieces are: - performance testing to verify that this doesn't perform notably worse than using S3 directly - a proper way to handle configuration (it currently allows the boto library to use whatever config is discovered by default, meaning only a single environment is supported) This commit also experiments a bit more on how to document these APIs. There is now a separate API user guide and reference, as it's not clear if there's a decent way of including general usage info into the redoc-hosted reference.

Previously, all request bodies were awaited in full before starting any requests to S3. That's unnecessarily slow and also means that all data from the user's request has to be held in memory. Fix it up so that requests can be streamed properly (which requires a bit of a tweak to make botocore and aiobotocore work together properly). Note that this does require the caller always to provide the Content-MD5 and Content-Length headers. The higher-level boto methods such as upload_file already do this by default, so this shouldn't be cumbersome (and we probably would have required Content-MD5 by policy anyway). In my test environment with 20MB chunk size and 10 threads, this reduced the typical overhead of using exodus-gw for uploads from about 10-20% to essentially zero.

To assist with testing, make it possible to override the default S3 endpoint URL by setting EXODUS_GW_S3_ENDPOINT_URL. It can be used to (e.g.) test against localstack. Note that, at some point, we expect to support multiple target environments and have some method of managing their configuration; most likely this env var will be dropped at that time.

rohanpm · 2020-07-24T01:10:51Z

Added two commits just now:

enable streaming of requests, after I figured out how to get that working. Performance now seems to be roughly on par with using S3 directly, though I've seen some unexplained spikes at times.
added an (undocumented) environment variable EXODUS_GW_S3_ENDPOINT_URL which can be used for the localstack testing case mentioned by @dichen16 .

OK to merge this? I think it'd be good to get something in here that others can start building on, even if the API is later expected to change.

lmilbaum · 2020-07-26T16:59:05Z

@rohanpm This is a large PR in the sense of lots of code changes. Is it possible to break it to little atomic PRs?

rohanpm · 2020-07-26T22:45:26Z

@lioramilbaum

Is it possible to break it to little atomic PRs?

In general I'm very much in support of that and I do tend to make small PRs with focused goals, but in a case like this - early development of a new project - I'm struggling to find an efficient workflow for it. Maybe I'm missing some option or tool in github, but I don't see any way to make one pull request depend on another.

If I consider this PR as an example then, a way that I could have split it up is:

tweaks to the publishing of docs
initial implementation of upload APIs
add env var to make endpoint configurable
support streaming requests

The problem is the dependencies between them... (2) arguably depends on (1), (3) and (4) depend on (2).
I can make those separate PRs, but if (e.g.) PR4 is including some commits from PR2 and PR1, then the UI is going to show all the changes from PR1 and PR2 again in PR4 with no hint to reviewers that the changes will be added by some other PR and so should be ignored here. Reviewers can be told in a comment "please note that this depends on PR 1 & 2, so ignore the changes introduced by those", but I doubt it's practical for reviewers to do that if the UI is not helping them. PR4 wouldn't appear as little & atomic unless the previous PRs are submitted first.

A simple way to fix this at the cost of time is to wait for review of each PR to be completed before asking for review on the next one... but that's extremely inefficient. From my point of view, every review is at least overnight (or more if it doesn't pass on the first attempt). That means making one bigger PR has a chance of getting merged in a day or two, but splitting it into 5 smaller dependent PRs would need at least a full week to get them all merged even if everything goes perfectly.

I'm interested in solutions to this problem (and not only for this project), so if you have some experience or suggestions on how to manage dependencies between PRs please let me know.

lmilbaum · 2020-07-27T09:08:56Z

@rohanpm You have raised great discussion points. Let's continue the discussion over an email thread or even better, a meeting. For this PR specifically, I do not have the tools (yet) to provide an officiant code review.

rohanpm force-pushed the s3-test branch from ad64232 to b7a11be Compare July 23, 2020 03:26

rohanpm marked this pull request as ready for review July 23, 2020 03:32

rohanpm requested a review from negillett as a code owner July 23, 2020 03:32

rohanpm requested review from crungehottman, dichn and lmilbaum July 23, 2020 03:33

rohanpm added 3 commits July 24, 2020 10:03

rohanpm force-pushed the s3-test branch from b7a11be to 7d570e1 Compare July 24, 2020 01:06

dichn approved these changes Jul 24, 2020

View reviewed changes

negillett approved these changes Jul 24, 2020

View reviewed changes

crungehottman approved these changes Jul 25, 2020

View reviewed changes

lmilbaum assigned rohanpm Jul 26, 2020

rohanpm merged commit ac273c3 into release-engineering:master Jul 27, 2020

rohanpm deleted the s3-test branch July 27, 2020 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce prototype for upload API, based on S3 #5

Introduce prototype for upload API, based on S3 #5

rohanpm commented Jul 22, 2020 •

edited

Loading

rohanpm commented Jul 22, 2020

rohanpm commented Jul 23, 2020

dichn commented Jul 23, 2020 •

edited

Loading

rohanpm commented Jul 24, 2020

lmilbaum commented Jul 26, 2020

rohanpm commented Jul 26, 2020

lmilbaum commented Jul 27, 2020

Introduce prototype for upload API, based on S3 #5

Introduce prototype for upload API, based on S3 #5

Conversation

rohanpm commented Jul 22, 2020 • edited Loading

rohanpm commented Jul 22, 2020

rohanpm commented Jul 23, 2020

dichn commented Jul 23, 2020 • edited Loading

rohanpm commented Jul 24, 2020

lmilbaum commented Jul 26, 2020

rohanpm commented Jul 26, 2020

lmilbaum commented Jul 27, 2020

rohanpm commented Jul 22, 2020 •

edited

Loading

dichn commented Jul 23, 2020 •

edited

Loading