Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent workers from using same workerId #71

Closed
wants to merge 1 commit into from
Closed

Prevent workers from using same workerId #71

wants to merge 1 commit into from

Conversation

djmitche
Copy link
Contributor

@djmitche djmitche commented Feb 7, 2018

Originating from bug 1374978:

What about if the provisioner continued to provide credentials, as it currently does, but instead of granting the current scopes it grants:

  • assume:worker-id:*
  • assume:worker-type:aws-provisioner-v1/<workerType>

it could instead grant the single scope:

  • queue:register-worker:<provisionerId>/<workerType>/<slugid>

Then once the worker starts up, it could call queue.registerWorker(provisionerId, workerType, slugId, workerGroup, workerId) using the provisioner-provided credentials, and this queue call would return new temporary credentials for the worker, iff no credentials have previously been provided for this slugId (otherwise would return a HTTP 403).

The queue would need to maintain a list of "used" aws-provisioner-v1 slugIds for at least 96 hours (since 96 hours is currently maximum lifetime of an aws-provisioner-v1 worker). Optionally, a worker could "deregister" as a last step before terminating (on a best effort basis). Deregistering might be overkill.

Another thing I like about this approach, is that registerWorker is a great placeholder to provide other pertinent metadata, such as a json schema for the worker's task payload, version information about the worker ({"name": "generic-worker", "version": "v10.0.5", ....}). Worker registration would logically therefore only be necessary for provisioned workers that require temporary credentials, so people can still write their throwaway/dev workers without needing to make this call (since they can create their own credentials for the worker).

This would be a generic solution independent of provisioner (e.g. can be used by scl3-puppet, aws-provisioner-v1, packet-net-v1, ....), and removes on thing that all provisioners might otherwise need to implement, placing it in the queue which already is somewhat of a worker administrator.

Lastly, it somewhat simplifies the analysis of provisioned worker pools, since all workers are registered and can be easily counted (although this can be currently calculated as a side effect of task claim/reclaims and claimWork polling, counting registration/deregistration calls is simpler).

@petemoore
Copy link
Member Author

For more context, the reason I propose this is to deal with the current limitation that the aws provisioner currently needs to grant assume:worker-id:* to workers since it can't reliably know what their worker group (region) and workerId (instanceId) are. Therefore workers have the potential to impersonate other workers with different instanceIds and in different regions.

This approach would ensure that each worker can only claim to be in one region and have one instanceId, so even if it claims invalid values, it can't claim the same values as another worker, so we can still safely isolate the workers from interfering with each other.

@djmitche
Copy link
Contributor

The schema metadata part would be useful for #22.

What is the utility of the slugId?

The registrations would then be "kept alive" by claimWork and reclaimTask calls, with expirations occurring pretty aggressively, so that we can have an accurate notion of the active workers, and not keep an infinite list of workers that once were :)

@petemoore
Copy link
Member Author

petemoore commented Jun 22, 2017

@djmitche

What is the utility of the slugId?

To guarantee one-time-use semantics.

Consider the following use case if the slugId is removed. The provisioner grants credentials to a worker. The worker uses the credentials to register with the queue, and then begins claiming tasks. A task is executed which is able to compromise the provisioner-provided credentials (maybe they are accidentally logged in a system log which is accessible to the task user). The attacker then uses the credentials to register a fake worker of the same workerType, at which point it receives valid worker credentials. It can then use these to claim real tasks, and post malicious artifacts.

The slugId uniqueness means that once the queue has registered a worker with a given slugId, it will refuse to allow any future worker registrations that use that same slugid. Since the slugId is burned into the temporary credentials provided by the provisioner, the worker cannot use an alternative slugId (since it would not have the temporary scopes to do so).

@djmitche
Copy link
Contributor

Oo, I like that -- it means that the creds provided to the host are automatically single-use-only. I think we'd need to do a little more contemplating how that could work in the hardware case, because we also want to limit workerGroup and workerId there, and have an easy (configuration-free) way of saying that workers with a specific prefix (e.g., com.mozilla.releng.scl3.test.t-yosmite-*) can only start up as one or a few worker types. We're currently doing that with a role containing the reversed hostname, but that wouldn't work with the slugid..

@djmitche
Copy link
Contributor

We discussed a variant of this proposal today in SF. Here's what we determined:

The model is three nested REST entities: provisioner, worker type, and worker

  • declareProvisioner
    • declares a provisionerId (do not expire)
    • declares "actions" that can be taken on workers
    • may later declare policies for managing those
  • declareWorkerType
    • declares a worker type, with docs and a payload schema
    • expires can be set to weeks for "production" workerTypes, an hour for test workerTypes
  • declareWorker
    • declares a worker id, worker type, expires, boot time, and worker metadata (worker.extra)
    • expiration is extended by claimWork; an expired worker is deleted

This data is queried as follows:

  • list/get provisioner (just returning the declared data)
  • delete provisioner (administrative operation)
  • list/get worker type (by provisionerId; get returns declared data, worker count, and pending count)
  • list/get worker (by worker type; get returns declared data, including worker.extra)
  • get worker last claims (returns a best-effort list of recent taskIds claimed and when they were claimed)

Things that are still undefined:

  • scopes required for declaration
  • how actions are defined
    • defined at the provisioner level or the worker type level?
    • schema for definition?
  • the stuff above regarding slugids (@jonasfj and @petemoore are discussing this now)

Short-term implementation of this would involve the queue gathering data from claimWork, so that we don't need to modify any workers. Most of these fields will be null.

@djmitche
Copy link
Contributor

Pete and Jonas agreed that we will not use slugids or return credentials from declareWorker.

@djmitche
Copy link
Contributor

This will require putting some data in postgres, but this is not #65. The tables we will need are something like (with * indicating primary key):

Provisioners

  • provisionerId *

WorkerTypes

  • provisionerId *
  • workerType *
  • docs
  • payload schema
  • expires

Workers

  • workerGroup *
  • workerId *
  • provisionerId *
  • workerType *
  • extra
  • boot time
  • expires
  • created (set on first declaration)
  • last claims

@djmitche djmitche assigned djmitche and helfi92 and unassigned djmitche Jun 30, 2017
@djmitche djmitche changed the title queue.registerWorker call for gaining worker credentials and providing metadata declare provisioners, worker types, and workers Jun 30, 2017
@jhford
Copy link
Contributor

jhford commented Jul 6, 2017

Aside from a registry of provisioners, I don't agree with this RFC. Besides registering of provisioners, the information stored within is mostly a duplicate of information stored in the various provisioners, the queue or could better be derived from other sources. Because it's a best effort service and because it does things like TTL on registration, the information in it cannot be trusted to make any real assertions.

Instead, I wonder if a better way would be to log all the tasks created for a given ProvisionerId/WorkerType in the Queue to a datastore. We could store information about which worker claims a given task in the same datastore. If we have a registry of provisioners, we can take the provisionerId from the task definition and make calls to the corresponding provisioner when that information is needed.

The advantage to doing this is that there's no registration. This means that the information is available for every workertype, every worker automatically and without the complexity of registration. It also means that we can trust this information a lot more, not least because it's based on what's actually happening, not an opt-in-best-effort-expiring-registration.

As well, the various (future) provisioners should already know about which workerTypes they have, and for those which do not use a provisioner, a simple "fake provisioner" could be built which does the best-effort-opt-in-registration model.

@petemoore
Copy link
Member Author

petemoore commented Jul 6, 2017

Taking a step back, the original RFC was simply about deprecating the use of assume:worker-id:* (see original bug comment) by implementing a mechanism to ensure that one worker couldn't impersonate another worker. My original proposal was to solve this by adding a registerWorker endpoint which did not allow two callers to use the same workerId. I am open to alternative solutions.

During a meeting where unfortunately I was not present, this RFC was adapted (title changed, proposal changed) also to incorporate the extended topics of workerType and provisionerId registration.

Perhaps these extended concepts could be relocated to a second RFC, as they have wider implications.

The objectives I would like to meet are:

  1. I would like a mechanism that enforces no two workers can claim the same workerId.
  2. I would like to be able to declare a json schema for validating task definitions of a given provisionerId/workerType, such that the queue can resolve tasks as exception/malformed-payload on task submission (and not need to wait for worker to do this), and the Task Creator can validate the task definition interactively during editing.

The other topics that came up in this RFC are:

  1. How can we get a handle on the various provisionerIds active at any time (where the provisionerId can also be a "fake" provisioner, or one that doesn't feed into the generic provisioner, such as releng-hardware which is a static pool rather than something that actually provisions workers). Do we see value in this, or is the data too transient to be useful?
  2. How can we see which provisionerIds have been used historically?
  3. How do we see at a snapshot in time, what workerGroup/workerIds we believe to be active are for a given provisionerId/workerType?
  4. How can we audit which workers claimed which tasks, historically?

It is the questions 1-4 above that I believe should not be part of this PR, as this is scope creep from the original RFC.

@petemoore petemoore changed the title declare provisioners, worker types, and workers Prevent workers from claiming same workerId Jul 6, 2017
@petemoore petemoore changed the title Prevent workers from claiming same workerId Prevent workers from using same workerId Jul 6, 2017
@gregarndt
Copy link

#82 has been created to address the expanded problem we are attempting to solve

@jhford
Copy link
Contributor

jhford commented Jul 10, 2017

In order to acheive this, I wonder if the service which provides our temporary credentials (e.g. aws-provisioner, tc-host-secrets) could be updated to also generate a worker-id in the form of a slugid which is unique to each worker. If we then issued the temporary credentials with the scope assume:worker-id:<slugid> instead of assume:worker-id:* we could ensure that each worker would only be able to access things which it should be able to.

An example could be that in order to resolve a particular claiming of a task, that the same worker-id be used as was originally used to claim the task.

If we did it this way, would we still need a registry?

@petemoore
Copy link
Member Author

petemoore commented Jul 10, 2017

Historically we used instanceId for workerId as it was useful for mapping data back to Amazon and associating papertrail logs to workers etc. If we used something other than the instanceId for the workerId, it certainly would overcome the issue of two workers impersonating each other, although I wonder if we might still have the problem that a worker could pretend to be running on a different instanceId (e.g. if it logs its instanceId or reports it in a task log).

I think if we could find a way to keep the workerId as the instanceId, it might make troubleshooting easier (as there is one less level of indirection when connecting to workers or auditing logs etc). That said, we probably already blindly trust the reported IP address of a worker, so maybe there is little or no advantage of using real instanceId for workerId. What do others think?

@jhford
Copy link
Contributor

jhford commented Jul 10, 2017

I agree that instance-id is a much nicer thing to use for the reasons you mentioned. The problem is that there's no guarantees that instance ids are unique over a period of time. The only guarantee that I'm aware of is that two instances aren't using the same id at the same time in a given region.

It's exceedingly unlikely, but there is a chance that we get two different instances with the same instance id occur within the time period of a temporary credentials validity.

I think it's possible to map the incoming request back to see if it originates from an instance with an id of i-1234, but that only works if the instance makes the connection directly and not through any sort of caching or other proxy system. Since that would involve querying the EC2 api, it would also open us up to errors or inaccuracies from relying on the api being consistent.

The example is 10.0.0.1 is initially given to i-1234. This instance shuts down and we boot a new machine, i-1235 which is also given 10.0.0.1. The EC2 API's eventual consistency is very slow and in the meantime i-1235 asks the provisioner for credentials saying it's i-1234 and gets the credentials for i-1234. Again, this is an extremely unlikely occurrence, but still possible.

One other thing to note is that instance-ids are allocated per-region, so if we were to use them we'd need to add the region as well.

What if we used the region/instance-id in user-facing areas, but used this generated slugid in authentication related places?

@jonasfj
Copy link

jonasfj commented Jul 10, 2017

In order to acheive this, I wonder if the service which provides our temporary credentials (e.g. aws-provisioner, tc-host-secrets) could be updated to also generate a worker-id

I think @jhford is spot on here!
I agree that it would be neat if we could use instance-id, but at it's heart it is the responsibility of the aws-provisioner/tc-host-secrets to issue unqiue worker-ids.

I imagine that workerGroup/workerId could be used for:

  • debugging,
  • audit logging
  • blacklisting unhealthy workers or throttling workers that are misbehaving

The degree to which we guarantee uniqueness and security of workerIds can be provisioner-specific.

@jhford, what if when workers call aws-provisioner to obtain temporary credentials they reported their instance-id to the aws-provisioner, which uses it when issuing worker-id. As you've pointed out the aws-provisioner can't verify the instance-id, but at this point in time the worker haven't been running any tasks yet. Hence, the worker is only compromised if the AMI it was started from was compromised.

Alternatively, the workers could provide their EC2 hostname to the aws-provisioner. The aws-provisioner can then lookup the hostname and verify the IP of the call and embed the hostname in the workerId.

Note: The biggest concern security-wise is tasks that escape the container. If the AMI is compromised, then most likely so is aws-provisioner.

@petemoore
Copy link
Member Author

petemoore commented Jul 11, 2017

@jhford, what if when workers call aws-provisioner to obtain temporary credentials they reported their instance-id to the aws-provisioner, which uses it when issuing worker-id

I think we've come full circle now. The original proposal was that provisioner provided a scope bearing a slugId (to guarantee uniqueness), and that the worker declared its workerId in a subsequent call which would then be uniquely bound to the slugId provided by the provisioner. The queue at that point was to take care of persisting the association between the two, to make sure that the same workerId could not be associated to a different slugId.

The difference here is just that the queue (as a global entity, across provisioners) assures workerId uniqueness, and is singularly responsible for maintaining mappings, rather than each provisioner needing to maintain slugId <=> workerId mappings, and implement the same logic.

@djmitche
Copy link
Contributor

djmitche commented Feb 5, 2018

@walac does the new provisioner secrets stuff support this effort?

@jhford
Copy link
Contributor

jhford commented Feb 5, 2018

@djmitche it does

@djmitche
Copy link
Contributor

Closing as this hasn't seen any updates in a while. It's always easy to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants