Conversation
|
@escapewindow I think we talked about this in irc before the SF all-hands. Does this sound useful? Any adjustments to make? |
|
re: I feel like we should address the queue's limit to the number of tasks it can depend on in another way - creating dummy tasks somewhat arbitrarily breaks up the dependencies with the consequence of having an unnecessarily complex dependency tree that morphs (and thus obscures) the true dependency relationships. I believe the reason we limit the total number of dependencies is so that they can still be inlined in the the task definition without creating a task definition of unlimited size. Perhaps an alternative approach would be to allow a task to provide a dependencyList; a task artifact, which is a json file containing a list of taskIds. This also is not optimal, as it adds a layer of indirection (you need to grab the artifact to see what the dependencies are) but does limit the size of the task definition, which addresses the problem statement that we don't want task definitions of unlimited size. However, it adds work for tools that analyse task dependencies. I'm not sure my suggestion is any better than the original, but I am concerned that we introduce additional complexity (also for decision tasks that need to build "fake" dependency relationships) as well as to the platform as a whole (special casing worker types) rather than tackle the root problem of how do we allow an unlimited number of task dependencies without having inordinate sized task definitions (or maybe even that isn't such a bad thing). |
Yeah, do you think it's worth adding a special claim-and-resolve API endpoint for this? I kinda don't.. @jonasfj can you speak up regarding number of dependencies? I recall that limit is to control algorithmic complexity in the queue (O(n^2) in the number of dependencies). |
|
I thinking there is a quadratic complexity... it's |
If it is I see no reason to introduce a limit here under say 10,000. If a task depends on 10,000 other tasks, then let it list 10,000 tasks that it depends on in its payload. More than 10,000 and we could foreseeably hit memory issues. This feels analogous to limiting filenames to 10 characters, as used to be the case in Windows, or limiting the length of a command and its arguments to 256 characters rather than e.g. 65536 characters. I think the setting of an absolute limit could be reasonable, but then it really should be something astonishingly high, that is unlikely ever to be required, and is only set in order to minimise risk of server crashes etc due to memory issues etc, rather than imposing a limit due to our ideals about how someone might want to use our tools, and then enforcing this on them. |
Thinking about this more, I see this as just a regular worker implementation, and I think it requires no action on our side. For example, Release Engineering can create an interface to allow a user to approve a release. This would simply claim and resolve a task, like you say. It is effectively like implementing their own simple worker, which just makes two API calls. I think they should be able to name the provisionerId and workerType to values under their existing (scope) control, and the task they create can arbitrarily have a deadline that is several months into the future if it can sit around for several months without being actioned. In other words, I think all the mechanics are already in place to achieve this, including namespacing, in order for this to already be implemented without needing to carve out And if we agree that introducing dummy worker types isn't needed at all (by increasing the limit on number of dependent tasks to something more realistic) then maybe we can get away without needing to implement anything in this issue. 🎉 |
I think Jonas meant "it's not Dependencies aren't the only reason to want dummy tasks. We also want to use dummies to control startup of task graphs or to summarize the results their results, for example. Regarding wait-for, what namespace would you suggest? Most projects only have aws-provisioner-v1/ and this definitely isn't AWS-related. Why not have a convention for this? |
@jonasfj Can you clarify if you meant I read this as
Can you elaborate?
I feel the concept of provisionerId isn't well defined - sometimes it is an actual provisioner ( That said, it would be late in the day to change the term "provisioner" to something else, but in hindsight, something like workerTypeGroup might have been more appropriate. So for example, let's say Release Engineering wanted to resolve this task via a manual human approval in their ShipIt application, I could imagine them using The problem I see with using a gloabl |
|
We've been talking about how to make a decision task for which the tasks don't start until after the decision task is complete, and for which you could re-run the decision task, the latest idea is to create a graph like the following: I'm not totally sure that's the right idea, but it was another case where "dummy task" was a useful concept. I recall it coming up in a few other conversations in SF, although details elude me, but the general idea is that it's an idea that pops up frequently, and running "echo hello world" in docker-worker is the best recommendation we have right now. I'd like to do a bit better. I agree with thinking of "provisionerId" as "workerTypeGroup", and that trying to actually rename it is probably not worthwhile. And I see your point that there's no reason to stuff the "wait for" bit into the provisionerId anymore than anywhere else in the whole worker type name. |
|
So let's consider the proposal adjusted to omit |
An algorithm that takes So while the number of remote API operations is still linear in the size of the input, with N being sufficiently high that might not be acceptable in terms of stability. Yeah, the magic number 100 is just a feeling, but so far those worked well so far :) Maybe it could be 200, or 500, or 5000; the point is 100 is safe, so let's not push the envelope more than we have to... |
My feeling here is we could do some testing to see what the potential range could realistically be. For example, we could set it to 10,000 and see how long various http requests take to complete (and if they might hit 30s timeout). I'm in favour of an upper limit, I just think it should be something higher, to make breaking up dependencies into multiple dummy tasks something that should almost never be required, just like you should almost never need to execute a command whose arguments have more than 65536 characters, or have a filename that has more than 256 characters, or a directory that has more than 65536 files inside of it, or an environment block that has more than 32768 bytes on Windows. All these examples are limits that were dramatically increased from former extremely conservative settings, such that 99.9% of use cases were covered by increasing the limit substantially (now I'm just making up statistics on the spot). But you see the idea - a limit is reasonable, but let's set it way above what the normal use case is likely to be, and make sure via testing, that it is still well within a range that we can comfortably handle (i.e. HTTP responses don't time out, we don't run out of memory, operations don't hog the OS and impact other resources etc). However the last of these we can probably not concern ourselves with, since we've agreed it is O(N) and the overall hit it likely to have less overhead rather than more, due to the fewer number of tasks to process, but the same total number of dependencies. |
|
I can see the usefulness of a task that automatically fails or succeeds. I would also possibly suggest a task that fails or succeeds based on the results of its dependencies. I'm not sure I see the usefulness of the wait-for task; isn't that solvable by scheduling it against a specific |
|
My feeling here is we could do some testing to see what the potential range could realistically be
We have some experience with this for task-graph-schduler... Where 1k
caused issues.. not all the time, but some times... These issues can't be
reliably reproduced because they are effects of network degration on top of
other things...
But if we moved to postgres it might be a different story... Granted I
still think we want a limit to avoid huge locks on the database...
[edit(@djmitche): fix email formatting]
|
|
OK, let me see if I can summarize the topics here:
It would be pretty cool if we could lift the dependency limit by a factor of 10 or more, but it's only tangentially related to this RFC, in that one of the uses of dummy tasks would be to work around the dependency limit. A few examples of other uses for dummy tasks, to help re-focus the RFC:
..and maybe @escapewindow has some more ideas as suggested in his comment above. |
|
(proposal updated to match) |
|
I wrote up http://escapewindow.dreamwidth.org/243472.html#dummy_jobs a while back. Some of these ideas might belong here; it's possible some others may either be unneeded or complex enough that we may want to split them out into some other thing. success / failure (status?) tasks
timer tasks
breakpoint tasks
notification tasks
|
|
Sorry to let this sit for so long! I don't understand all of them, but some make good sense, especially the breakpoint. I'd like to start with the two currently in the proposal, and then we can look at adding more later. I think there's value in having these sorts of "utility tasks" in some shared context that everyone has access to. However now that I think about this, and now that I've gotten into the queue code, I don't think it makes sense to special-case these. Rather, I propose we build these in a super-simple service like I'll update the first comment accordingly. |
|
Pete's on PTO for a while, so I will wait until he's back to see if we can decide on this. |
++ sgtm :-) |
|
FCP ends August 30th. |
|
Okay, so this is a proposal for: success/fail dummy tasks ? Finally, can we call it: provisionerId: taskcluster or built-in? |
|
Off topic: limit on number of dependencies is possible too small. We could likely increase it, but it's probably best with postgres migration or something . |
|
I like provisionerId Dependencies count is definitely off-topic :) |
|
crazy idea: it could be a json-e worker-type: task.provisionerId = 'static-v1'
task.workerType = 'json-e-v1'
task.payload = {
// json-e expression, that given the variable "dependencies" as a list of task status
// structures for each task.dependencies evaluates to:
success: true,
artifacts: {
"public/live.log": {text: "hello world!"},
"public/live.json": {json: {result: "ok"}},
"public/release-binary": {url: "https://...."},
}
}Async support in json-e would be neat as it would allow for fetching and grepping dependencies/artifacts on-demand... |
|
That sounds more like a different kind of utility worker (sort of like lambda: one that doesn't run in any specific "environment" but just executes code given right in the task definition). Interesting idea, but not what I'm going for here. I feel like this is lacking motivation, really. Without a strong motivation we seem to wander a bit. I don't personally have a use for these workerTypes, and it's not clear the release mechanics would find them immediately useful either. I think I'll set this aside until we do have a strong motivating case. |
|
I think the only concrete use case we have is for the 'dummy' tasks to work around the dependency limit. |
|
I think status/notification tasks would be useful for porting releasetasks graphs, where status is just a single task you can monitor to see the status of all its dependencies. |
|
One of the use cases for dummy workers would be grouping parallel tasks by some property, platform for example. |
If that is the case, I think this is best solved by bumping the limit to something much higher, e.g. 10000 (assuming load testing shows no problems) and us parking the dummy tasks design. |
|
OK, we'll continue to leave the dependency-limit issue out for now (emphasis mine). @escapewindow and @rail have described another concrete use for the "succeed" workerType. The notifications from that can come from taskcluster-notify, so this workerType doesn't need to do anything notification-related directly. With Regarding @jonasfj's suggestion about a single workerType, I think it's a minor implementation difference, and I think "a worker that always fails" is easier to comprehend and invites less feature-bolting-on than "a worker that resolves with a result based on its payload". So I'm going to stick with the two workerTypes. I've updated the first comment to reflect:
Any additional adjustments (that do not relate to the dependency limit) before we re-try final comment here? |
|
..specifically @jonasfj @petemoore |
|
https://bugzilla.mozilla.org/show_bug.cgi?id=1318253#c3 -- a use of a dummy workerType (and using an outdated workerType at that) |
|
Well, that's been 5 months with no further comment, so I suspect we can go to a final comment period here :) |
|
This will finish final comment period in 6 days (I forgot to say so yesterday) |
Motivation
The Taskcluster team recommends use of dummy tasks for various purposes, and release engineering has also found them useful for various unusually-behaved tasks.
built-in/succeed)Proposal
Taskcluster should supply a collection of special-cased workerTypes with
simple, predefined, useful behaviors, gathered under the
built-inprovisionerId [*].built-in/succeed-- When a task of this worker type is scheduled, it isimmediately resolved as successful.
built-in/fail-- When a task of this worker type is scheduled, it isimmediately resolved as failed.
Scopes for these workerTypes would be given to
assume:repo:*. Since the tasks do not do anything interesting, store any potentially-compromisable state, or allow pending tasks, everyone can share the same workerTypes.[*] this is treating provisionerId as something closer to "workerTypeGroup", since there is no provisioner service associated with this provisionerId.
Implementation
The new workers would be implemented in a very simple, single-instance service called "taskcluster-built-in-workers" that simultaneously polls all of the given workerTypes.