-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent workers from using same workerId #71
Conversation
For more context, the reason I propose this is to deal with the current limitation that the aws provisioner currently needs to grant This approach would ensure that each worker can only claim to be in one region and have one instanceId, so even if it claims invalid values, it can't claim the same values as another worker, so we can still safely isolate the workers from interfering with each other. |
The schema metadata part would be useful for #22. What is the utility of the slugId? The registrations would then be "kept alive" by |
To guarantee one-time-use semantics. Consider the following use case if the slugId is removed. The provisioner grants credentials to a worker. The worker uses the credentials to register with the queue, and then begins claiming tasks. A task is executed which is able to compromise the provisioner-provided credentials (maybe they are accidentally logged in a system log which is accessible to the task user). The attacker then uses the credentials to register a fake worker of the same workerType, at which point it receives valid worker credentials. It can then use these to claim real tasks, and post malicious artifacts. The slugId uniqueness means that once the queue has registered a worker with a given slugId, it will refuse to allow any future worker registrations that use that same slugid. Since the slugId is burned into the temporary credentials provided by the provisioner, the worker cannot use an alternative slugId (since it would not have the temporary scopes to do so). |
Oo, I like that -- it means that the creds provided to the host are automatically single-use-only. I think we'd need to do a little more contemplating how that could work in the hardware case, because we also want to limit workerGroup and workerId there, and have an easy (configuration-free) way of saying that workers with a specific prefix (e.g., |
We discussed a variant of this proposal today in SF. Here's what we determined: The model is three nested REST entities: provisioner, worker type, and worker
This data is queried as follows:
Things that are still undefined:
Short-term implementation of this would involve the queue gathering data from claimWork, so that we don't need to modify any workers. Most of these fields will be null. |
Pete and Jonas agreed that we will not use slugids or return credentials from |
This will require putting some data in postgres, but this is not #65. The tables we will need are something like (with
|
Aside from a registry of provisioners, I don't agree with this RFC. Besides registering of provisioners, the information stored within is mostly a duplicate of information stored in the various provisioners, the queue or could better be derived from other sources. Because it's a best effort service and because it does things like TTL on registration, the information in it cannot be trusted to make any real assertions. Instead, I wonder if a better way would be to log all the tasks created for a given ProvisionerId/WorkerType in the Queue to a datastore. We could store information about which worker claims a given task in the same datastore. If we have a registry of provisioners, we can take the provisionerId from the task definition and make calls to the corresponding provisioner when that information is needed. The advantage to doing this is that there's no registration. This means that the information is available for every workertype, every worker automatically and without the complexity of registration. It also means that we can trust this information a lot more, not least because it's based on what's actually happening, not an opt-in-best-effort-expiring-registration. As well, the various (future) provisioners should already know about which workerTypes they have, and for those which do not use a provisioner, a simple "fake provisioner" could be built which does the best-effort-opt-in-registration model. |
Taking a step back, the original RFC was simply about deprecating the use of During a meeting where unfortunately I was not present, this RFC was adapted (title changed, proposal changed) also to incorporate the extended topics of workerType and provisionerId registration. Perhaps these extended concepts could be relocated to a second RFC, as they have wider implications. The objectives I would like to meet are:
The other topics that came up in this RFC are:
It is the questions 1-4 above that I believe should not be part of this PR, as this is scope creep from the original RFC. |
#82 has been created to address the expanded problem we are attempting to solve |
In order to acheive this, I wonder if the service which provides our temporary credentials (e.g. aws-provisioner, tc-host-secrets) could be updated to also generate a An example could be that in order to resolve a particular claiming of a task, that the same worker-id be used as was originally used to claim the task. If we did it this way, would we still need a registry? |
Historically we used instanceId for workerId as it was useful for mapping data back to Amazon and associating papertrail logs to workers etc. If we used something other than the instanceId for the workerId, it certainly would overcome the issue of two workers impersonating each other, although I wonder if we might still have the problem that a worker could pretend to be running on a different instanceId (e.g. if it logs its instanceId or reports it in a task log). I think if we could find a way to keep the workerId as the instanceId, it might make troubleshooting easier (as there is one less level of indirection when connecting to workers or auditing logs etc). That said, we probably already blindly trust the reported IP address of a worker, so maybe there is little or no advantage of using real instanceId for workerId. What do others think? |
I agree that instance-id is a much nicer thing to use for the reasons you mentioned. The problem is that there's no guarantees that instance ids are unique over a period of time. The only guarantee that I'm aware of is that two instances aren't using the same id at the same time in a given region. It's exceedingly unlikely, but there is a chance that we get two different instances with the same instance id occur within the time period of a temporary credentials validity. I think it's possible to map the incoming request back to see if it originates from an instance with an id of The example is One other thing to note is that instance-ids are allocated per-region, so if we were to use them we'd need to add the region as well. What if we used the region/instance-id in user-facing areas, but used this generated slugid in authentication related places? |
I think @jhford is spot on here! I imagine that
The degree to which we guarantee uniqueness and security of @jhford, what if when workers call aws-provisioner to obtain temporary credentials they reported their instance-id to the aws-provisioner, which uses it when issuing worker-id. As you've pointed out the aws-provisioner can't verify the instance-id, but at this point in time the worker haven't been running any tasks yet. Hence, the worker is only compromised if the AMI it was started from was compromised. Alternatively, the workers could provide their EC2 hostname to the aws-provisioner. The aws-provisioner can then lookup the hostname and verify the IP of the call and embed the hostname in the Note: The biggest concern security-wise is tasks that escape the container. If the AMI is compromised, then most likely so is aws-provisioner. |
I think we've come full circle now. The original proposal was that provisioner provided a scope bearing a slugId (to guarantee uniqueness), and that the worker declared its workerId in a subsequent call which would then be uniquely bound to the slugId provided by the provisioner. The queue at that point was to take care of persisting the association between the two, to make sure that the same workerId could not be associated to a different slugId. The difference here is just that the queue (as a global entity, across provisioners) assures workerId uniqueness, and is singularly responsible for maintaining mappings, rather than each provisioner needing to maintain slugId <=> workerId mappings, and implement the same logic. |
@walac does the new provisioner secrets stuff support this effort? |
@djmitche it does |
Closing as this hasn't seen any updates in a while. It's always easy to re-open. |
Originating from bug 1374978:
What about if the provisioner continued to provide credentials, as it currently does, but instead of granting the current scopes it grants:
assume:worker-id:*
assume:worker-type:aws-provisioner-v1/<workerType>
it could instead grant the single scope:
queue:register-worker:<provisionerId>/<workerType>/<slugid>
Then once the worker starts up, it could call
queue.registerWorker(provisionerId, workerType, slugId, workerGroup, workerId)
using the provisioner-provided credentials, and this queue call would return new temporary credentials for the worker, iff no credentials have previously been provided for this slugId (otherwise would return a HTTP 403).The queue would need to maintain a list of "used"
aws-provisioner-v1
slugIds for at least 96 hours (since 96 hours is currently maximum lifetime of anaws-provisioner-v1
worker). Optionally, a worker could "deregister" as a last step before terminating (on a best effort basis). Deregistering might be overkill.Another thing I like about this approach, is that
registerWorker
is a great placeholder to provide other pertinent metadata, such as a json schema for the worker's task payload, version information about the worker ({"name": "generic-worker", "version": "v10.0.5", ....}
). Worker registration would logically therefore only be necessary for provisioned workers that require temporary credentials, so people can still write their throwaway/dev workers without needing to make this call (since they can create their own credentials for the worker).This would be a generic solution independent of provisioner (e.g. can be used by
scl3-puppet
,aws-provisioner-v1
,packet-net-v1
, ....), and removes on thing that all provisioners might otherwise need to implement, placing it in the queue which already is somewhat of a worker administrator.Lastly, it somewhat simplifies the analysis of provisioned worker pools, since all workers are registered and can be easily counted (although this can be currently calculated as a side effect of task claim/reclaims and claimWork polling, counting registration/deregistration calls is simpler).