Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce prototype for upload API, based on S3 #5

Merged
merged 3 commits into from
Jul 27, 2020

Conversation

rohanpm
Copy link
Member

@rohanpm rohanpm commented Jul 22, 2020

This API can be used to upload objects.
Rather than inventing our own API for this, let's use a subset of
the true S3 API. This allows usage of existing clients (such as
boto) to upload content through exodus-gw. The examples/s3-upload
script can be used to demonstrate this.

This implementation is prototype-level and comes with a TODO list.
Some major missing pieces are:

  • performance testing to verify that this doesn't perform notably
    worse than using S3 directly
  • a proper way to handle configuration (it currently allows the boto
    library to use whatever config is discovered by default, meaning
    only a single environment is supported)

This commit also experiments a bit more on how to document these
APIs. There is now a separate API user guide and reference, as it's
not clear if there's a decent way of including general usage info
into the redoc-hosted reference.

@rohanpm
Copy link
Member Author

rohanpm commented Jul 22, 2020

@nathanegillett

This isn't ready for proper review yet, I need other PRs to be landed first. But I pushed it so you can have a look in case you're also prototyping APIs.

However, the way things are done here in some cases isn't a great example, because it's trying to make an API compatible with S3 which doesn't follow the preferred conventions of FastAPI.

@rohanpm
Copy link
Member Author

rohanpm commented Jul 23, 2020

Manually published docs at https://rohanpm.github.io/exodus-gw/ so they can be reviewed.

@rohanpm rohanpm marked this pull request as ready for review July 23, 2020 03:32
@rohanpm rohanpm requested a review from negillett as a code owner July 23, 2020 03:32
@dichn
Copy link
Collaborator

dichn commented Jul 23, 2020

@rohanpm In order to make exdus-gw acess localstack(for integration test), I did some changes like this.

async def object_put(bucket: str, key: str, request: Request):
     # Single-part upload handler: entire object is written via one PUT.
     body = await request.body()
 
-    async with aioboto3.client("s3") as s3:
+    async with aioboto3.client("s3", endpoint_url='http://localhost:4572') as s3:
         response = await s3.put_object(Bucket=bucket, Key=key, Body=body)
 
     return Response(headers={"ETag": response["ETag"]})

It works with the endpoint_url argument added. If we use the origin version, it's kind difficult to configure the integration test.
should this be scoped in the PR?
others look good to me.

This API can be used to upload objects.
Rather than inventing our own API for this, let's use a subset of
the true S3 API. This allows usage of existing clients (such as
boto) to upload content through exodus-gw. The examples/s3-upload
script can be used to demonstrate this.

This implementation is prototype-level and comes with a TODO list.
Some major missing pieces are:

- performance testing to verify that this doesn't perform notably
  worse than using S3 directly
- a proper way to handle configuration (it currently allows the boto
  library to use whatever config is discovered by default, meaning
  only a single environment is supported)

This commit also experiments a bit more on how to document these
APIs.  There is now a separate API user guide and reference, as it's
not clear if there's a decent way of including general usage info
into the redoc-hosted reference.
Previously, all request bodies were awaited in full before starting
any requests to S3. That's unnecessarily slow and also means that
all data from the user's request has to be held in memory.

Fix it up so that requests can be streamed properly (which requires
a bit of a tweak to make botocore and aiobotocore work together
properly).

Note that this does require the caller always to provide the
Content-MD5 and Content-Length headers. The higher-level boto
methods such as upload_file already do this by default, so this
shouldn't be cumbersome (and we probably would have required
Content-MD5 by policy anyway).

In my test environment with 20MB chunk size and 10 threads,
this reduced the typical overhead of using exodus-gw for uploads
from about 10-20% to essentially zero.
To assist with testing, make it possible to override the default
S3 endpoint URL by setting EXODUS_GW_S3_ENDPOINT_URL. It can be
used to (e.g.) test against localstack.

Note that, at some point, we expect to support multiple target
environments and have some method of managing their configuration;
most likely this env var will be dropped at that time.
@rohanpm
Copy link
Member Author

rohanpm commented Jul 24, 2020

Added two commits just now:

  • enable streaming of requests, after I figured out how to get that working. Performance now seems to be roughly on par with using S3 directly, though I've seen some unexplained spikes at times.
  • added an (undocumented) environment variable EXODUS_GW_S3_ENDPOINT_URL which can be used for the localstack testing case mentioned by @dichen16 .

OK to merge this? I think it'd be good to get something in here that others can start building on, even if the API is later expected to change.

@lmilbaum
Copy link
Contributor

@rohanpm This is a large PR in the sense of lots of code changes. Is it possible to break it to little atomic PRs?

@rohanpm
Copy link
Member Author

rohanpm commented Jul 26, 2020

@lioramilbaum

Is it possible to break it to little atomic PRs?

In general I'm very much in support of that and I do tend to make small PRs with focused goals, but in a case like this - early development of a new project - I'm struggling to find an efficient workflow for it. Maybe I'm missing some option or tool in github, but I don't see any way to make one pull request depend on another.

If I consider this PR as an example then, a way that I could have split it up is:

  1. tweaks to the publishing of docs
  2. initial implementation of upload APIs
  3. add env var to make endpoint configurable
  4. support streaming requests

The problem is the dependencies between them... (2) arguably depends on (1), (3) and (4) depend on (2).
I can make those separate PRs, but if (e.g.) PR4 is including some commits from PR2 and PR1, then the UI is going to show all the changes from PR1 and PR2 again in PR4 with no hint to reviewers that the changes will be added by some other PR and so should be ignored here. Reviewers can be told in a comment "please note that this depends on PR 1 & 2, so ignore the changes introduced by those", but I doubt it's practical for reviewers to do that if the UI is not helping them. PR4 wouldn't appear as little & atomic unless the previous PRs are submitted first.

A simple way to fix this at the cost of time is to wait for review of each PR to be completed before asking for review on the next one... but that's extremely inefficient. From my point of view, every review is at least overnight (or more if it doesn't pass on the first attempt). That means making one bigger PR has a chance of getting merged in a day or two, but splitting it into 5 smaller dependent PRs would need at least a full week to get them all merged even if everything goes perfectly.

I'm interested in solutions to this problem (and not only for this project), so if you have some experience or suggestions on how to manage dependencies between PRs please let me know.

@lmilbaum
Copy link
Contributor

@rohanpm You have raised great discussion points. Let's continue the discussion over an email thread or even better, a meeting. For this PR specifically, I do not have the tools (yet) to provide an officiant code review.

@rohanpm rohanpm merged commit ac273c3 into release-engineering:master Jul 27, 2020
@rohanpm rohanpm deleted the s3-test branch July 27, 2020 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants