Object service proposal first draft #116

jhford · 2018-03-23T12:52:43Z

This proposal was discussed during the Berlin work week on March 22, 2018. I'm going to the PR stage directly as we've already had a lot of discussions about this with almost the whole team present. I believe that this PR reflects the design we've been speaking about. My feeling was also that we'd agreed that a concrete proposal was the best step to take next.

Let's discuss this! I'm flagging people for review who were most actively involved in the discussion.

jonasfj · 2018-03-24T10:06:31Z

rfcs/0116-artifact-service.md

+
+# Summary
+
+Taskcluster tasks often product Artifacts during the course of being run.


produce artifacts

jonasfj · 2018-03-24T10:07:04Z

rfcs/0116-artifact-service.md

@@ -0,0 +1,196 @@
+# RFC 116 Artifact Service


It's not an artifacts service... it's a generic blob/object storage service. Please use said terminology.

jonasfj

The service is used for artifacts, but should be generic object/blob storage service, that implements an object/blob store by wrapping providers from different clouds.

Terminology should be changed, I view this as blocking.

jonasfj · 2018-03-24T10:07:51Z

rfcs/0116-artifact-service.md

+
+Taskcluster tasks often product Artifacts during the course of being run.
+Examples include browser archives, json metadata and log files.  We currently
+store Artifacts in a single region of S3 and split management of Artifacts


"Artifacts" is in upper case why?

jonasfj · 2018-03-24T10:08:57Z

rfcs/0116-artifact-service.md

+Taskcluster tasks often product Artifacts during the course of being run.
+Examples include browser archives, json metadata and log files.  We currently
+store Artifacts in a single region of S3 and split management of Artifacts
+between the Queue and Cloud Mirror.  This project is to take Artifact handling


terminology matters, as I understood it the idea was that queue would still own artifacts.
But this service would provide the blob/object storage used, instead of S3.

We talked about how ec2-manager is the API we wish EC2 had. I think the same is true here:

This is the API we wish Amazon S3, DigitalOcean Spaces, Ceph, Azure Blob Storage, etc. had, with a focus on caching near compute resources, automatic expiration, and secure content hashing.

Maybe re-working this introductory section around that focus, with artifacts only mentioned in the motivation section, would help illustrate this focus.

It might be a useful thought exercise to imagine this as a replacement storage mechanism for tooltool -- the tooltool upload process is pretty bad, and picks regions at random so this isn't an entirely bad idea. Actually doing the work is out of scope, but building a service that could support tooltool is probably useful to help the design be general.

jonasfj · 2018-03-24T10:10:10Z

rfcs/0116-artifact-service.md

+* The Artifact service will start from the Queue's `artifacts.js` file
+* Auth might not be done with scopes and tc-auth, but a shared secret between
+  Queue and Artifacts
+* The Queue will block new tasks after task resolution by not issuing new


new artifacts

jonasfj · 2018-03-24T10:14:55Z

rfcs/0116-artifact-service.md

+  transferLength: 1234,
+  transferSha256: 'abcdef1234',
+  contentEncdoing: 
+  expiration: new Date(),


Don't we have to specify parts?

That's an implementation detail we should finalize as we progress. I want to spend a bit of time with other object stores before we solidify multipart uploads across services.

I think the public API is one of the things to settle before we start implementing this.

If this will match the existing blob storage type, it might be OK to just refer to that -- if that's sufficiently general to support other clouds' multipart implementations.

jonasfj · 2018-03-24T10:15:45Z

rfcs/0116-artifact-service.md

+
+```
+DELETE /artifacts/:name
+DELETE /caches/artifacts/:name


I suggest that we don't allow this API end-point, we only allow deletion through expiration.

Keeping thing minimal.

Tooltool doesn't support deletion at all, and TC doesn't for artifacts, either, so I tend to agree. If this is really easy to implement, fine, but if the implementation is complicated or racy then perhaps not. It will probably be useful once or twice to delete some accidentally-uploaded private data. That's happened before with tooltool, for example (and required some manual database and S3 operations to clear up).

it'll be racy by definition...

As in there is a possibility for racing between creating and deleting the resource, especially considering that uploads can retry.

If we really need to purge a secret that shoudn't have been uploaded, it's better open the database... People could put such a secret in the task definition too... this could always happen.

which would you prefer to do if you were the one cleaning up a chemspill level leak:

objadm --auth-token abcdef1234 delete our-pgp-private-key.txt

or

Figure out how the service stores data

Figure out which tools need to be used to access the database

Figure out the credentials for the database

Edit the database schemas underlying

Figure out how to access every underlying object store

Figure out how to map the :name to each underlying object store's version of the name

Figure out how to delete things safely from all the regions

Delete all the files.

By the way: don't delete anything you're not supposed to. This could be as easy as accidentally forgetting a WHERE clause on your DELETE.

We wouldn't be exposing this part of the API to anyone other than the Queue and ourselves. "Just opening the database" is a gross simplification here, and assumes that the people who will be operating this service are knowledgable on Postgres, Node.js, Taskcluster services. We wouldn't be deleting things often, but it is pretty much expected that if you create and delete the same resource at the same time, you'd have weird behaviour.

As someone who does understand these things, in production, I'd rather use a unit tested and known-to-work option than opening a sql client and the S3 management console.

As for cache clearing, it's by definition and per the HTTP spec that it's very clear to be a best-effort initiation of a delete.

jonasfj · 2018-03-24T10:19:21Z

rfcs/0116-artifact-service.md

+
+# Open Questions
+
+* Will we move old artifacts over?


Most of the artifact we use a recent, or so I'm guessing. Besides we can keep cloud-mirror around for those..

Yes, I see this as either a new artifact type in the queue or, if it's still unused, a different implementation of the new "blob" type.

we actually don't want to keep cloud-mirror around in this world. CM is only useful for high volume requests from CI. The chances that any file is requested by a CI system in high volume more than a couple days after creation is low enough that I'd gamble the interregion storage costs.

jonasfj · 2018-03-24T10:19:28Z

rfcs/0116-artifact-service.md

+# Open Questions
+
+* Will we move old artifacts over?
+* Will we move deprecated artifact types (s3, azure) over?


Let's migrate workers, and remaining artifact types will be low volume... and eventually completely removed.

jonasfj · 2018-03-24T10:19:47Z

rfcs/0116-artifact-service.md

+
+* Will we move old artifacts over?
+* Will we move deprecated artifact types (s3, azure) over?
+* How will we support data centers where we have no native object store


I would suggest pull-through caches

This would be a different kind of thing to configure -- the service would need to know that it can copy from S3 region to S3 region, but for other regions it can redirect to a cache URL that will pull through. That said, a pull-through cache and a CDN are pretty similar, so perhaps supporting both would be beneficial.

There are also some S3-like services out there which we could deploy, including Ceph.

Yep, pull-through cache and CDN would could be implemented with something like a "redirector" function, basically something which takes the name and plugs it into a URL pattern.

It seems to me that setting up the specific in-dc portion is out of scope for this RFC, but that support for this sort of redirecting would be in-scope

jonasfj · 2018-03-24T10:19:55Z

rfcs/0116-artifact-service.md

+* Will we move old artifacts over?
+* Will we move deprecated artifact types (s3, azure) over?
+* How will we support data centers where we have no native object store
+* What auth scheme will we use between Queue and Artifacts


JWT in querystring

I suspect we could use JWTs to authenticate pull-through cache requests too... Not entirely sure..

that seems reasonable to me. For the storage providers, we'll likely need to use their specific authentication systems. For example, S3 requires aws request signing. Since a pull through cache would be implemented as any other 'cdn' type thing, we could build this into the redirector logic.

jonasfj · 2018-03-24T10:22:11Z

rfcs/0116-artifact-service.md

+* Will we move deprecated artifact types (s3, azure) over?
+* How will we support data centers where we have no native object store
+* What auth scheme will we use between Queue and Artifacts
+* Will we use Content Addressable storage?


Yes, please this would be awesome...

Biggest hurdle here is the security issues... But we if can avoid unnecessary uploads that would be awesome.
We should brain storm on this, I'm not sure how to solve it... and what the security aspects are... Obviously, a hash isn't enough to prove that you're in possession of a file and therefore able to change its name...

It would. I also think this is something we don't need to figure out right now, since the mapping of name that the queue would use is unrelated to the internal storage name.

Let's decide to not decide on this point until we're further along. maybe

Can we make it an open question how to deduplicate private artifacts...

jonasfj · 2018-03-24T10:25:26Z

rfcs/0116-artifact-service.md

+the worker and not the request to the artifact service.  If the origin is an
+IPv4 or IPv6, a mapping of the address and IP will occur to find the backing
+storage.  If the origin is not an IPv4 or IPv6 address and is identical to a
+set of known and mappable identifiers (e.g. `s3_us-west-2`), then that will be


I'm not sure we should allow identifiers, other than IPs, but I'm not sure...

In the general case, I think this could be useful. For example, when glandium wants to upload something from a coffeeshop in Nagoya, he should be able to specify 'ap-northeast-1` to get a quick upload, without requiring us to configure the IP range of that coffeeshop in the service (but requiring that we configure that region)

yes, that's the idea. I think in the general case, this is a critical feature and not one that's really all that difficult to implement. We can verify that the configured identifiers aren't valid IP addresses, which makes this safe to do. This would also support things like having the queue use a specific region or service for its unit tests. It would also allow something like saying "All private artifacts must go to s3-us-west-2 for auditing reasons" or any of a bunch of other possible rules based on payload.

That said, I'm sure we could use some sort of geo-ip database service to do 'smart' things. I think using a Geo-ip service for this would be an extreme amount of overkill.

Until there's a specific concern which cannot be addressed, let's stick with these identifiers.

The counter argument against allowing this is that the service is supposed to abstract the fact that the storage is geographically distributed. Hence, we shouldn't allow people to peak past the abstraction.

It's a nit :)

Besides glandium will upload through the queue which will just take the IP of the caller as origin.

The assumption is that ip blocks are published for every single cloud provider and that any possible user of this service has ip blocks of all of their users. I don't think we can make the latter assumption.

It's also super annoying as a person debugging artifacts to have to launch an EC2 instance in eu-central-1 to be able to get that artifact's copy in eu-central-1.

origin is just parameter, you can make up and IP for debugging...

I don't have strong objections... just saying we could reduce the surface exposed here.

If we add/remove origins, then what do we do to strings that are no-longer valid? treat them as public internet, like IPs for which we have no cache. I guess that could work.

But it means that origin will be a free form string field, and not a ip-pattern or choice from enum.

My thinking was that if it is not an IPv4/v6 and not a known id, that it would generate an error, so that misconfigured usage is immediately apparent.

but that would make us break compatibility whenever we remove an origin, or rename an origin.

hiding internals well, is a good way to avoid compatibility breaks :)

jonasfj · 2018-03-24T10:27:44Z

rfcs/0116-artifact-service.md

+* We want to consolidate the handling of Artifacts into a single service
+* We want to build generalised software when possible as participants of a free
+  software community
+


Could we add something about how someone who runs their own data center could point the queue at something else that implements the same API as this blob/object storage service thing..

A large portion of CoT's complexity is dedicated to making sure artifacts at rest haven't been altered. If we could save sha metadata with the artifacts which could a) be compared with the uploaded sha, and b) be compared with the downloaded contents, that would go a long way to simplifying our security checks.

Are there any plans to add such information to the artifact service?

@escapewindow, the new artifact kind storageType: 'blob', see createArtifact, actually has contentSha256. CoT could definitely use this in some way.

It's not immediately obvious to me how we'll do it.. as we'll still need to know which artifacts were covered by the CoT certificate. Since artifacts like task log can't be under the CoT. Or maybe in future we put that under CoT and accept that certain errors will be more silent, that might be okay, since errors computing CoT is an error in the worker... or network error between worker and queue.

yep, the service will always do as strong checks as possible to ensure that we allow only uploading of a known object (e.g. same Sha256 value), and that downloads will allow information to ensure that the object is what it should've been when uploaded. The download verification will need to happen in the client, as there's nothing in the HTTP spec stronger than Content-MD5 for verification.

At minimum, even if the Sha256 cannot be verified during upload, we can ensure that the downloaded artifact has metadata on what the Sha256 value was expected from the Queue/Uploader and provide that information for consumers of the service.

I plan to adapt taskcluster-lib-artifact to support this automatic verification when downloading from this service.

Note: this requires support from the underlying storage provider. This might be a reason to use something like Ceph as Dustin suggested over a pull through cache.

jonasfj · 2018-03-24T10:30:13Z

The big conceptual issue for me is making sure this has nothing to do with artifacts, but rather that it's an object/blob storage service, implemented by wrapping existing object/blob stores.

I suggest calling it something like object/blob/content-storage-service... especially we get the content-addressable thing right and do deduplication :)

djmitche · 2018-03-26T15:57:04Z

I like "taskcluster-content-storage" :)

I do want to review this, but my queue is quite long at the moment, so it might be a few days.

jonasfj · 2018-03-26T16:15:17Z

I want to review is in more detail too, but I would like to see the terminology resolved first, so I can better talk about things :)

@djmitche, so far the aim have been to make this a non-taskcluster branded service, as in it would use JWTs for authentication... and would in effect just be a thing we use to build our services on top of...

We could also aim to make it a taskcluster service that uses tc-auth, and offers a generic interface for anyone wants blob storage that is cached across regions on demand... I'm not sure it would be useful for sscache, etc... tooltool maybe, but it could be interesting..

djmitche · 2018-03-26T16:19:46Z

OK, that's fair -- the taskcluster- prefix is optional then, but that's a pretty strong commitment to non-taskcluster-specificity :)

djmitche

A few other more general things I'd like to see discussed:

Can this be configured to use CDNs? Is that what's behind the "DELETE /caches/" endpoint -- that it would invalidate the object on any associated CDNs?
Since this service requires a "frontend service" (Queue or the Tooltool server), it might be good to have a name for that service (probably "frontend service" isn't a good choice!) and to see what the overall interaction between the user, that service, and this service would look like both for uploads and for downloads.

In general I'd like to see more detail here, but I like what I see so far (aside from use of "artifact" everywhere...)

djmitche · 2018-03-26T21:11:01Z

rfcs/0116-artifact-service.md

+# Summary
+
+Taskcluster tasks often product Artifacts during the course of being run.
+Examples include browser archives, json metadata and log files.  We currently


What's a browser archive? I know this paragraph is just illustrative, but I'm curious :)

example: the content of firefox-53-win32.zip

Ah, OK, those seem to be referred to as "installers" :)

true, but in the case of Linux and Mac, they're not really installers in the sense that the windows. I can change the phrasing if you'd like. Do you think "browser installers" would be good?

djmitche · 2018-03-26T21:28:30Z

rfcs/0116-artifact-service.md

+Taskcluster tasks often product Artifacts during the course of being run.
+Examples include browser archives, json metadata and log files.  We currently
+store Artifacts in a single region of S3 and split management of Artifacts
+between the Queue and Cloud Mirror.  This project is to take Artifact handling


We talked about how ec2-manager is the API we wish EC2 had. I think the same is true here:

This is the API we wish Amazon S3, DigitalOcean Spaces, Ceph, Azure Blob Storage, etc. had, with a focus on caching near compute resources, automatic expiration, and secure content hashing.

Maybe re-working this introductory section around that focus, with artifacts only mentioned in the motivation section, would help illustrate this focus.

It might be a useful thought exercise to imagine this as a replacement storage mechanism for tooltool -- the tooltool upload process is pretty bad, and picks regions at random so this isn't an entirely bad idea. Actually doing the work is out of scope, but building a service that could support tooltool is probably useful to help the design be general.

djmitche · 2018-03-26T21:46:15Z

rfcs/0116-artifact-service.md

+
+# Details
+
+* The Artifact service will start from the Queue's `artifacts.js` file


I don't understand what this means -- is the service implemented in the https://github.com/taskcluster/taskcluster-queue repository??

I think he want's to implement those API end-points, which makes no sense to me..

I can remove that detail, it's more just that we have some of this already implemented in the Queue and I wanted to salvage as much of that code as possible.

djmitche · 2018-03-26T21:51:13Z

rfcs/0116-artifact-service.md

+
+* The Artifact service will start from the Queue's `artifacts.js` file
+* Auth might not be done with scopes and tc-auth, but a shared secret between
+  Queue and Artifacts


This needs to be fleshed out. I assume that if this is not a TC-specific service, then it definitely won't use scopes. But if this service is implementing its own access control, we'll need to do an RRA and think about the failure modes. For example, if the shared secret is exposed, could someone overwrite an existing blob? Add a blob that looks like an artifact on another task?

We need to settle on a specific auth mechanism. JWT looks fine to me so far, I'll play around with it some.

We should block overwriting of artifacts in general, since that's not likely something we actually want ever.

I presume the namespace of this service here to not really be a 'task' based namespace, but that the Queue would internally implement its own 'task/run/artifactName' namespacing. If someone got the secret from this service, it would almost certainly be stored beside the same secrets that would allow for raw access to the storage system and so isn't really a concern.

That secret could also be leaked from the Queue, which wouldn't have the underlying storage credentials. That would then allow the bearer of that token to create new artifacts with any name. That would allow someone to pre-emptively create an artifact for a task, but the real task would fail if we block overwriting and would alert us to the name being already taken.

At some level though, a secret which allows an operation must be kept secret. This is no more risky than S3 itself. Right now, the secret used in the Queue is a set of raw S3 credentials. If you lose your access token and client id, someone can change your stored objects.

djmitche · 2018-03-26T21:51:52Z

rfcs/0116-artifact-service.md

+* The redirecting logic of the Artifact service will adhere to the HTTP spec
+  using 300-series redirects where possible
+* Where content-negotiation is impossible, the service will block creation of
+  artifacts when incoming requests aren't compatible with required response


I don't understand this..

respond 406 if resource is gzipped and Accept: gzip is missing...

I still don't understand it - respond 406 on which request?

so you're Client A and requesting resource X which is stored with gzip content-encoding in s3. You submit a request with the headers {} (i.e. none). Because we know that we're going to serve a response which must have {Content-Encoding: gzip}, but we don't know for sure if you can accept gzip encoding, we fail with and send a 406. When the client is updated to send headers {Accept-Encoding: gzip}, we send the 300 series redirect to S3.

The request which would possibly send the 406 would be the first from this service. The queue would 302 to this service, the service would look up the underying storage and see that it's going to force a content encoding but that necoding isn't in the accept-encoding header, then this service would send a 406.

Thanks, that makes a lot more sense. Probably including a scenario like that in the doc would help :)

djmitche · 2018-03-26T22:10:01Z

rfcs/0116-artifact-service.md

+closest to the request's IP address.  If needed, this will initiate from the
+original backing store into a different storage location.  It will wait up to a
+configurable amount of time before redirecting to the non-optimal original
+backing store.


Is this always the "original" backing store, or might it be another closer source? For example, if an artifact was uploaded to us-east-1 and is already copied to us-west-2, it would make sense to copy from us-west-2 for a us-west-1-originating request. That sounds pretty fancy, but it's probably a good idea to leave the possibility of doing such things open.

fair enough. s/to the non-optimal original backing store/to a non-optimal copy/

djmitche · 2018-03-26T22:10:44Z

rfcs/0116-artifact-service.md

+client is willing to wait for an artifact.  The default value should be based
+on the size of the artifact.  For example, it will wait 2s for each 100MB of
+artifact size.  While waiting, the service will issue intermediate redirects
+to itself to reduce the waiting time.


This is really clever :)

djmitche · 2018-03-26T22:11:15Z

rfcs/0116-artifact-service.md

+GET/PUT/DELETE /errors/:name
+```
+
+Since redirects aren't artifacts and errors aren't artifacts, they will have


100% agreed

djmitche · 2018-03-26T22:12:14Z

rfcs/0116-artifact-service.md

+
+# Open Questions
+
+* Will we move old artifacts over?


Yes, I see this as either a new artifact type in the queue or, if it's still unused, a different implementation of the new "blob" type.

djmitche · 2018-03-26T22:17:04Z

rfcs/0116-artifact-service.md

+
+* Will we move old artifacts over?
+* Will we move deprecated artifact types (s3, azure) over?
+* How will we support data centers where we have no native object store


This would be a different kind of thing to configure -- the service would need to know that it can copy from S3 region to S3 region, but for other regions it can redirect to a cache URL that will pull through. That said, a pull-through cache and a CDN are pretty similar, so perhaps supporting both would be beneficial.

There are also some S3-like services out there which we could deploy, including Ceph.

jonasfj · 2018-04-05T17:32:10Z

rfcs/0116-artifact-service.md

+  contentSha256: 'abcdef1234',
+  transferLength: 1234,
+  transferSha256: 'abcdef1234',
+  contentEncdoing: 


contentEncdoing -> contentEncoding

jonasfj · 2018-04-05T17:33:05Z

rfcs/0116-artifact-service.md

+### Retreival
+
+```
+GET /artifacts/:name[?max_time=30]


can we rename it ?timeout=30, as the client is expected to timeout the request in 30s... So the server should use this as a hint..

But the server would use the max_time as a hint. The client does not timeout the request at all here, this is about being able to handle response handlers which take >30s to run through.

jonasfj · 2018-04-05T17:34:37Z

rfcs/0116-artifact-service.md

+Example inside heroku:
+```
+10:00:00 <-- GET /artifacts/my_file?max_time=90
+10:00:25 --> 302 /artifacts/my_file?max_time=65


This is probably unwise as most clients stop following redirects after 5-7 hops. We can keep a TCP connection alive longer by writing whitespace in HTTP, or some other hack like that.

Okay, I'm not sure.. just saying there is something to think about here.

I'm not sure where you're getting that 5-7 amount. Clients can choose to follow more redirects if they want. This API also doesn't force us a specific mechanism of waiting, since the user says "Hey, I'll wait 200 seconds" and we just do our best to stall them for 200s.

Just like outside of heroku, we cannot write to the body of a response before we know where the 302 should redirect to as headers would've had to have been fully sent by the time the first byte is written. We could redirect to a waiter endpoint, but then that waiter endpoint would require another 302 every single time since we'd have to redirect back to the original endpoint. And since this is all HTTP spec compliant stuff, there's nothing stopping us from changing between the two approaches whenever we choose.

By the time we write whitespace, we'd need to know if the request redirected to the original bucket or to the copy.

It's something to think about for sure, but I'm not too concerned. Based on this totally, 100% legit, A-Grade site, browsers follow >=20 redirects, which is 10 minutes of waiting in heroku. Curl does 50 by default on my mac. curl -LI https://httpbin.org/redirect/1000

Okay, it's possible I'm wrong.

I think I noticed a lot of clients in npm, golang that defaults to something like 5-7 redirects.

golang by default follows 10 redirects (not bad).

npm request follows 10 by default.

npm got is also 10 redirects

python requests defaults to 30 redirects

npm superagent defaults to 5 redirects (we already don't like this)

wget defaults to 20 redirects (could be build dependent)

curl defaults to 50 redirects if following redirects is enabled with -L

So maybe it's not such a crazy idea after all :)

djmitche

Even if we only dedup publiic artfacts, that would be a huge win. Maybe doing so is out of spec in the initial implementation, but the proposal should probably consider it so that we don't need to break anything to implement it later. Note that we will need a little bit of protocol support for deduplication, as a POST to the object service for an object that already exists needs to return something to the client to indicate that it need do nothing further (except perhaps PATCH?). Maybe that's just an empty list of parts -- let's be explicit, anyway.

Regarding origin names: the simplest initial approach is to only allow IPv4/6. If it turns out to be useful to add names later, that is a backward-compatible change.

Regarding deletion: I agree that having a tested deletion API is better than mucking about under duress (having mucked about under duress). I'm not worried about the raciness from the user's perspective -- if I delete an artifact that you are about to download, it's OK to fail (with a reasonable error code). What I'm worried about is that such a deletion will crash background processes or that those background processes might accidentally "undelete" the object by mirroring it back from a region that the DELETE operation hadn't gotten to yet. That said, in a content-addressible architecture, this is a lot easier: we keep the content, but delete the names, making it inaccessible and letting lifetime policies delete it eventually (or mucking about in the backend service if we really want it dead).

I have a vague idea of what the overall upload interaction looks like, but I'd like to see it in more detail here. The bit about post-upload actions is kind of vague -- what are the requirements there? What happens if that method is not called?

I think it's pretty key here that the retrieval process doesn't require anything more than a standard (but full-featured) HTTP client. That is probably worth mentioning!

A few additional questions:

Does this service just distinguish "public" and "private" or does it have other levels of access control? I don't see any such distinctions in the document, actually.
For private artifacts, how can we remove the authentication information from the URL? The risk is that if a private HTML artifact is loaded in a browser, and a user clicks a link, the referrer field contains that URL and would allow the link target to access the artifact. It'd be neat to be able to fix this!
What are the timing constraints on an upload? I expect the generalized requests will have a limited lifetime -- presumably each request must start by a given time. Must the upload be finished by a certain time, too? What if the uploader crashes before finishing the upload? Can I re-call the POST endpoint with the same body in that case?
I'm keen to know how the JWTs will work. JWTs are a way of communicating a set of claims, so between two parties (frontend service and object service) I don't see a lot of benefit over just using a bearer token. But I'd love to be surprised :)

djmitche · 2018-04-06T13:22:55Z

rfcs/0116-artifact-service.md

+
+### Generalised request format
+There will be a concept of a generalised HTTP request.  These requests are used
+to transmit metadata of an HTTP request which the artifact service requires the


object service

djmitche · 2018-04-06T13:23:04Z

rfcs/0116-artifact-service.md

+
+An example usage of this format is uploading a file.  The creation endpoint of
+this service will generate all of the HTTP request metadata that the uploading
+worker will use to upload file.  Then the worker will match these HTTP requests


s/worker/uploader/g

djmitche · 2018-04-06T13:25:43Z

rfcs/0116-artifact-service.md

+
+Where present, the `origin` rest parameter in this service for will be either
+an IPv4 address, IPv6 address, or an identifier.  This parameter will specify
+the source location of the ultimately originating request.  This means the


the source location for the uploaded data

djmitche · 2018-04-06T13:26:14Z

rfcs/0116-artifact-service.md

+Where present, the `origin` rest parameter in this service for will be either
+an IPv4 address, IPv6 address, or an identifier.  This parameter will specify
+the source location of the ultimately originating request.  This means the
+origin of the request to the Queue from the worker and not the request to the


s/Queue/frontend service/g

djmitche · 2018-04-06T13:28:51Z

rfcs/0116-artifact-service.md

+  transferLength: 1234,
+  transferSha256: 'abcdef1234',
+  contentEncdoing: 
+  expiration: new Date(),


If this will match the existing blob storage type, it might be OK to just refer to that -- if that's sufficiently general to support other clouds' multipart implementations.

djmitche · 2018-04-06T13:29:17Z

rfcs/0116-artifact-service.md

+
+The `PATCH` endpoint will be sent without a request body and will be used to
+perform any post-upload actions.  These include things like the commit step on
+an S3 multipart upload.  This endpoint would be run by the Queue and not by the


s/Queue/frontend service/

djmitche · 2018-04-06T13:31:46Z

rfcs/0116-artifact-service.md

+DELETE /caches/artifacts/:name
+```
+
+This service has at least one copy of each stored file.  Any copies above 


above what?

djmitche · 2018-04-06T13:35:29Z

rfcs/0116-artifact-service.md

+10:00:50 --> 302 /artifacts/my_file?max_time=40
+10:00:50 <-- GET /artifacts/my_file?max_time=40
+10:00:55 --> 302 http://us-west-2.s3.amazonaws/artifacts/my_file
+```


I like this interaction diagram -- a similar thing for uploads would be great.

djmitche · 2018-04-06T13:37:10Z

rfcs/0116-artifact-service.md

+This endpoint returns a 302 redirect to the location of the object which is
+closest to the request's IP address.  If needed, this will initiate a copy from
+the original backing store into a different storage location.  It will wait up
+to a configurable amount of time before redirecting to a non-optimal copy.


There is an origin involved here, too, right? Can that be provided as a query parameter?

djmitche · 2018-04-06T13:59:08Z

I should also add, this is looking good. I'm trying to envision using this outside of taskcluster, and I think it still makes sense in that context..

jonasfj · 2018-04-06T23:42:05Z

@djmitche, look at how jwts are used in webhooktunnel, or statsum. We could also make it a TC service. And I agree, we need a brainstorming session on how to deal with private vs public artifacts.. we could make that a queue only thing too... As it's the only one that knows names anyways.

…

On Fri, 6 Apr 2018, 15.59 Dustin J. Mitchell, ***@***.***> wrote: I should also add, this is looking good. I'm trying to envision using this outside of taskcluster, and I think it still makes sense in that context.. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJI5CfzHi-8C9oGjdbqyINkkGZFhe4Pks5tl3StgaJpZM4S4qqW> .

djmitche · 2018-04-07T13:44:19Z

JWTs in statusum and webhooktunnel are used by three parties (issuer, client, and server). There they make sense: the client presents a JWT containing signed claims from the issuer. Those claims become promises about the bearer of the token ("I checked this client's identity, here's what I know.."). Would we follow the same model here? What would act as the issuer? I'm not saying JWTs are not a good plan -- I just need to know more about the plan :) The object service needs to know that one object can be read with no authentication, while others require authentication to read. If the queue handles that, and always redirects to a signed URL even for public artifacts, then users will copy those signed URLs and be sad when they don't work an hour later. Tooltool has "INTERNAL" and "PUBLIC" visibilities, which have exactly the same semantics: a simple GET for a public blob will get you the data, while an INTERNAL blob requires a token.

jonasfj · 2018-04-09T13:25:19Z

I think queue would be the `issuer` when it signs JWT, adds it to a URL and redirects the client to said URL.

…

-- Regards Jonas Finnemann Jensen. 2018-04-07 15:44 GMT+02:00 Dustin J. Mitchell <notifications@github.com>:

JWTs in statusum and webhooktunnel are used by three parties (issuer, client, and server). There they make sense: the client presents a JWT containing signed claims from the issuer. Those claims become promises about the bearer of the token ("I checked this client's identity, here's what I know.."). Would we follow the same model here? What would act as the issuer? I'm not saying JWTs are not a good plan -- I just need to know more about the plan :) The object service needs to know that one object can be read with no authentication, while others require authentication to read. If the queue handles that, and always redirects to a signed URL even for public artifacts, then users will copy those signed URLs and be sad when they don't work an hour later. Tooltool has "INTERNAL" and "PUBLIC" visibilities, which have exactly the same semantics: a simple GET for a public blob will get you the data, while an INTERNAL blob requires a token. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJI5Cax8h52J8-tK1rKESwatXFZ4_Hjks5tmMKzgaJpZM4S4qqW> .

djmitche · 2018-04-09T14:07:29Z

Ah, I thought the JWTs were used to authenticate the calls from the frontend service to the object service. I suppose the queue could issue those to itself!

In any case, if we're using asymmetric keys to sign the JWTs, then the object service will need a way to configure the keys it accepts, maybe using a set of jwks endpoints? Then every frontend server would need to provide such an endpoint, and the object service could just be configured with a list of endpoints for services it trusts.

jonasfj · 2018-04-09T17:43:11Z

I would just use a symmetric key as we do in the other services. Asymmetric algorithms are a bit slow. (maybe it's not so bad anymore)

…

-- Regards Jonas Finnemann Jensen. 2018-04-09 16:07 GMT+02:00 Dustin J. Mitchell <notifications@github.com>:

Ah, I thought the JWTs were used to authenticate the calls from the frontend service to the object service. I suppose the queue could issue those to itself! In any case, if we're using asymmetric keys to sign the JWTs, then the object service will need a way to configure the keys it accepts, maybe using a set of jwks endpoints <https://auth0.com/docs/jwks>? Then every frontend server would need to provide such an endpoint, and the object service could just be configured with a list of endpoints for services it trusts. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJI5JzVwj6GtE5lzDKyU5Rj63r63Ysmks5tm2sigaJpZM4S4qqW> .

jvehent · 2018-04-09T18:51:43Z

I would just use a symmetric key as we do in the other services.

+1 for not adding additional key management methods and keeping everything the same.

djmitche

A few editing things inline. There's nothing in here I don't like, just stuff I don't see addressed!

Remaining concerns from my perspective (reiterating some from my previous comment):

Are the object service's API methods idempotent? Could/should we make them so?
How does the object service say "yep, I already have that file" if deduplication were implemented?
The DELETE-caches endpoint doesn't make much sense in a deduplicated world -- the DELETE operation is really only deleting the name, not (necessarily) the underlying data. I know deduplication is not an initial feature here, but I think it is wise to build an API that is compatible with deduplication.
What are the timing constraints on an upload?
What happens if the post-upload method isn't called? I assume that we'll reflect what S3 does and time that upload out after some long duration. What is that duration?
Does this service just distinguish "public" and "private" or does it have other levels of access control?
For private artifacts, how can we remove the authentication information from the URL?
How are requests to the object service authenticated?

djmitche · 2018-04-12T17:28:06Z

rfcs/0116-artifact-service.md

+* We will support uploading objects to different regions and services so that
  we can minimize interregion transfer costs for initial object creation

+### Content Negitiation


*Negotiation

djmitche · 2018-04-12T17:32:10Z

rfcs/0116-artifact-service.md

+The completion endpoint is used to perform any post-upload validation required.
+This might include the "Complete Multipart Upload" for S3 based objects, or
+possible a post-upload hash and verification step for a locally managed file
+server.


This is partially redundant with the last paragraph. Perhaps combine them?

djmitche · 2018-04-12T17:33:55Z

rfcs/0116-artifact-service.md

-`blob` storage type in the queue.
+an S3 multipart upload.  This endpoint would be run by the frontend service and
+not by the uploader, and the frontend service would also need a complimentary
+method for workers to call when they complete the upload.  This is what is


workers -> uploaders

djmitche · 2018-04-12T17:34:45Z

rfcs/0116-artifact-service.md

+This is a simplified view, meant to highlight the key interactions rather than
+every detail.  For the purpose of this example, the uploader speaks directly
+with the object service.  In reality this interaction would have something like
+the Queue as an intermediary between the uploader and the object service.


This is the bit of the interaction I'd like to understand! I think the above documentation clarifies it enough, though.

djmitche · 2018-04-12T17:36:06Z

rfcs/0116-artifact-service.md

+                          method: 'PUT'
+                        }, {
+                          url: 'http://s3.com/object-abcd123?partNumber=2&uploadId=u-123',
+                          headers: {authorization: 'signature-request-1'},


djmitche · 2018-04-12T17:38:51Z

rfcs/0116-artifact-service.md

+those copies which are created in other regions or services to provide more
+local copies of each object.  This cache purging must return a `202 Accepted`
+response and not `200/204` due to the nature of how caches work in the real
+world.


I'm not sure what this means.. the object being deleted isn't named "/caches/objects/:name", so HTTP semantics don't apply here anyway.

djmitche · 2018-04-12T17:46:23Z

rfcs/0116-artifact-service.md

-DELETE /artifacts/:name
-DELETE /caches/artifacts/:name
+DELETE /objects/:name
+DELETE /caches/objects/:name


From a user's perspective, is this a useful distinction? The use-cases I see are "I regret having uploaded this object (e.g., sec issue)" and "this object is expired". In either case, I think I would use the first endpoint.

djmitche · 2018-04-19T18:35:51Z

------> #120

jhford force-pushed the artifact-service branch from a61138a to a354884 Compare March 23, 2018 12:54

Artifact service proposal first draft

5800370

jhford force-pushed the artifact-service branch from a354884 to 5800370 Compare March 23, 2018 12:56

jhford requested review from ccooper, djmitche, imbstack, jonasfj and petemoore March 23, 2018 13:00

jhford self-assigned this Mar 23, 2018

jonasfj reviewed Mar 24, 2018

View reviewed changes

rfcs/0116-artifact-service.md Outdated

# Summary

Taskcluster tasks often product Artifacts during the course of being run.

Copy link

jonasfj Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

produce artifacts

jonasfj reviewed Mar 24, 2018

View reviewed changes

jonasfj suggested changes Mar 24, 2018

View reviewed changes

jonasfj reviewed Mar 24, 2018

View reviewed changes

djmitche reviewed Mar 26, 2018

View reviewed changes

djmitche added the Phase: Proposal label Mar 27, 2018

jhford added 2 commits April 5, 2018 19:00

Address first round of feedback

34cfc25

More changes, including renaming service

e645ae8

jonasfj changed the title ~~Artifact service proposal first draft~~ Object service proposal first draft Apr 5, 2018

jonasfj reviewed Apr 5, 2018

View reviewed changes

djmitche reviewed Apr 6, 2018

View reviewed changes

Another round of feedback integrated

f43a644

djmitche self-requested a review April 12, 2018 16:15

djmitche reviewed Apr 12, 2018

View reviewed changes

jhford mentioned this pull request Apr 18, 2018

Object service (second round) #120

Merged

djmitche closed this Apr 19, 2018

djmitche removed the Phase: Proposal label Apr 19, 2018

ccooper removed their request for review April 26, 2018 21:31


		# Summary

		Taskcluster tasks often product Artifacts during the course of being run.


		# Details

		* The Artifact service will start from the Queue's `artifacts.js` file

Object service proposal first draft #116

Object service proposal first draft #116

Uh oh!

Conversation

jhford commented Mar 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonasfj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonasfj Mar 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jonasfj Mar 24, 2018 •

edited

Loading

jhford Apr 5, 2018 •

edited

Loading