Miscellaneous Workers by djmitche · Pull Request #75 · taskcluster/taskcluster-rfcs

djmitche · 2018-02-07T18:23:13Z

Motivation

The Taskcluster team recommends use of dummy tasks for various purposes, and release engineering has also found them useful for various unusually-behaved tasks.

sending a message or being able to check when some subset of a task group is complete (e.g., all windows builds)
"faking out" production-only tasks when making development or staging task graphs (e.g., replace the balrog update task with built-in/succeed)

Proposal

Taskcluster should supply a collection of special-cased workerTypes with
simple, predefined, useful behaviors, gathered under the built-in provisionerId [*].

built-in/succeed -- When a task of this worker type is scheduled, it is
immediately resolved as successful.
built-in/fail -- When a task of this worker type is scheduled, it is
immediately resolved as failed.

Scopes for these workerTypes would be given to assume:repo:*. Since the tasks do not do anything interesting, store any potentially-compromisable state, or allow pending tasks, everyone can share the same workerTypes.

[*] this is treating provisionerId as something closer to "workerTypeGroup", since there is no provisioner service associated with this provisionerId.

Implementation

The new workers would be implemented in a very simple, single-instance service called "taskcluster-built-in-workers" that simultaneously polls all of the given workerTypes.

djmitche · 2017-07-03T14:13:03Z

@escapewindow I think we talked about this in irc before the SF all-hands. Does this sound useful? Any adjustments to make?

petemoore · 2017-07-05T16:52:40Z

re: wait-for/<workerType> - you can only resolve a task you have already claimed, so with current mechanics, it would require at least two API calls. Also note that if the second call is not within around 20 mins of the first call, you could end up with a state of exception/claim-expired.

I feel like we should address the queue's limit to the number of tasks it can depend on in another way - creating dummy tasks somewhat arbitrarily breaks up the dependencies with the consequence of having an unnecessarily complex dependency tree that morphs (and thus obscures) the true dependency relationships.

I believe the reason we limit the total number of dependencies is so that they can still be inlined in the the task definition without creating a task definition of unlimited size. Perhaps an alternative approach would be to allow a task to provide a dependencyList; a task artifact, which is a json file containing a list of taskIds.

This also is not optimal, as it adds a layer of indirection (you need to grab the artifact to see what the dependencies are) but does limit the size of the task definition, which addresses the problem statement that we don't want task definitions of unlimited size. However, it adds work for tools that analyse task dependencies.

I'm not sure my suggestion is any better than the original, but I am concerned that we introduce additional complexity (also for decision tasks that need to build "fake" dependency relationships) as well as to the platform as a whole (special casing worker types) rather than tackle the root problem of how do we allow an unlimited number of task dependencies without having inordinate sized task definitions (or maybe even that isn't such a bad thing).

djmitche · 2017-07-05T17:21:30Z

re: wait-for/ - you can only resolve a task you have already claimed, so with current mechanics, it would require at least two API calls.

Yeah, do you think it's worth adding a special claim-and-resolve API endpoint for this? I kinda don't..

@jonasfj can you speak up regarding number of dependencies? I recall that limit is to control algorithmic complexity in the queue (O(n^2) in the number of dependencies).

jonasfj · 2017-07-05T20:08:09Z

I thinking there is a quadratic complexity... it's O(N) but N*3 or so... hence, each dependency adds something like 2 or 3 operations... which slows things down -- and hence, creates instability if it goes too high...

petemoore · 2017-07-06T11:16:35Z

I thinking there is a quadratic complexity... it's O(N) but N*3 or so... hence, each dependency adds something like 2 or 3 operations... which slows things down -- and hence, creates instability if it goes too high...

If it is O(N) then there is no benefit from splitting up dependencies - the total processing time is the same, in fact there is no doubt increased load when splitting up into multiple tasks, as there will be overhead per task, in addition to the other disadvantages listed above.

I see no reason to introduce a limit here under say 10,000. If a task depends on 10,000 other tasks, then let it list 10,000 tasks that it depends on in its payload. More than 10,000 and we could foreseeably hit memory issues.

This feels analogous to limiting filenames to 10 characters, as used to be the case in Windows, or limiting the length of a command and its arguments to 256 characters rather than e.g. 65536 characters. I think the setting of an absolute limit could be reasonable, but then it really should be something astonishingly high, that is unlikely ever to be required, and is only set in order to minimise risk of server crashes etc due to memory issues etc, rather than imposing a limit due to our ideals about how someone might want to use our tools, and then enforcing this on them.

petemoore · 2017-07-06T11:35:05Z

re: wait-for/ - you can only resolve a task you have already claimed, so with current mechanics, it would require at least two API calls.

Yeah, do you think it's worth adding a special claim-and-resolve API endpoint for this? I kinda don't..

Thinking about this more, I see this as just a regular worker implementation, and I think it requires no action on our side. For example, Release Engineering can create an interface to allow a user to approve a release. This would simply claim and resolve a task, like you say. It is effectively like implementing their own simple worker, which just makes two API calls. I think they should be able to name the provisionerId and workerType to values under their existing (scope) control, and the task they create can arbitrarily have a deadline that is several months into the future if it can sit around for several months without being actioned. In other words, I think all the mechanics are already in place to achieve this, including namespacing, in order for this to already be implemented without needing to carve out wait-for/<workerType> or define anything new on our side.

And if we agree that introducing dummy worker types isn't needed at all (by increasing the limit on number of dependent tasks to something more realistic) then maybe we can get away without needing to implement anything in this issue. 🎉

djmitche · 2017-07-06T14:04:47Z

If it is O(N) then there

I think Jonas meant "it's not O(N) but N*3" - that pesky habit of forgetting "not". So O(N*3) for 10000 is 1000000000000 which gets into memory/time issues. Even with 100 dependencies that's 1000000 operations.

Dependencies aren't the only reason to want dummy tasks. We also want to use dummies to control startup of task graphs or to summarize the results their results, for example.

Regarding wait-for, what namespace would you suggest? Most projects only have aws-provisioner-v1/ and this definitely isn't AWS-related. Why not have a convention for this?

petemoore · 2017-07-07T12:00:12Z

I think Jonas meant "it's not O(N) but N*3" - that pesky habit of forgetting "not".

@jonasfj Can you clarify if you meant O(N) or O(N^3)?

I read this as O(N) based on "hence, each dependency adds something like 2 or 3 operations".

Dependencies aren't the only reason to want dummy tasks. We also want to use dummies to control startup of task graphs or to summarize the results their results, for example.

Can you elaborate?

Regarding wait-for, what namespace would you suggest? Most projects only have aws-provisioner-v1/ and this definitely isn't AWS-related. Why not have a convention for this?

I feel the concept of provisionerId isn't well defined - sometimes it is an actual provisioner (aws-provisioner-v1), sometimes the task is just a placeholder for artifacts and we use a non-existent provisioner (null-provisioner), sometimes it refers to a static farm of workers (releng-hardware), sometimes to an interface to another system (buildbot-bridge). It was my feeling if we expand this concept even further, to include semantic handling (e.g. wait-for) that we further move away from the concept of a provisioner being something that provisions workers.

That said, it would be late in the day to change the term "provisioner" to something else, but in hindsight, something like workerTypeGroup might have been more appropriate.

So for example, let's say Release Engineering wanted to resolve this task via a manual human approval in their ShipIt application, I could imagine them using provisionerId = "shipIt" and workerType = "promoteRelease". It feels ugly using an application name for provisionerId, I agree - but I think that all comes down to the fact that including provisionerId in a task definition assumes that all workes are provisioned, which in the case of a web application, is not the case. Because the application may have several different approval steps built in, it may be that they wish to use a different workerType for each type, but use "shipIt" as the encompassing provisionerId. The problem is what we call provisionerId is really a container for a set of worker types, so in this case, that would be the application name "shipIt" and the worker types, would represent the different types of work that application might want to do.

The problem I see with using a gloabl wait-for provisionerId, is that this is shared across teams/projects, so we have to be careful to manage that two projects don't have task-claim/task-resolve rights to the same worker types within the wait-for namespace (so they can't resolve each others tasks). By allowing the provisionerId to vary, we have a simpler way of separating namespaces for different projects, and the provisionerId provides insight into which application is resolving these tasks.

djmitche · 2017-07-07T14:56:16Z

We've been talking about how to make a decision task for which the tasks don't start until after the decision task is complete, and for which you could re-run the decision task, the latest idea is to create a graph like the following:

Decision <-- Dummy <-- Build <-- Test
                 \---- Build <-- Test

I'm not totally sure that's the right idea, but it was another case where "dummy task" was a useful concept. I recall it coming up in a few other conversations in SF, although details elude me, but the general idea is that it's an idea that pops up frequently, and running "echo hello world" in docker-worker is the best recommendation we have right now. I'd like to do a bit better.

I agree with thinking of "provisionerId" as "workerTypeGroup", and that trying to actually rename it is probably not worthwhile. And I see your point that there's no reason to stuff the "wait for" bit into the provisionerId anymore than anywhere else in the whole worker type name.

djmitche · 2017-07-07T14:56:58Z

So let's consider the proposal adjusted to omit wait-for, although how to represent that in an RFC is currently in flux..

jonasfj · 2017-07-07T20:09:22Z

I think Jonas meant "it's not O(N) but N*3" - that pesky habit of forgetting "not".

@jonasfj Can you clarify if you meant O(N) or O(N^3)?

An algorithm that takes N*c+k steps is still O(N). So when I say it's N*3 operations the complexity is still O(N), with c = 3 (which disappears in big-O notation). In practice, however, if we're talking N*3 remote API calls and N > 100 that starts to take some time and in this case it becomes slow.

So while the number of remote API operations is still linear in the size of the input, with N being sufficiently high that might not be acceptable in terms of stability. Yeah, the magic number 100 is just a feeling, but so far those worked well so far :) Maybe it could be 200, or 500, or 5000; the point is 100 is safe, so let's not push the envelope more than we have to...

petemoore · 2017-07-07T20:42:36Z

So while the number of remote API operations is still linear in the size of the input, with N being sufficiently high that might not be acceptable in terms of stability. Yeah, the magic number 100 is just a feeling, but so far those worked well so far :) Maybe it could be 200, or 500, or 5000; the point is 100 is safe, so let's not push the envelope more than we have to...

My feeling here is we could do some testing to see what the potential range could realistically be. For example, we could set it to 10,000 and see how long various http requests take to complete (and if they might hit 30s timeout). I'm in favour of an upper limit, I just think it should be something higher, to make breaking up dependencies into multiple dummy tasks something that should almost never be required, just like you should almost never need to execute a command whose arguments have more than 65536 characters, or have a filename that has more than 256 characters, or a directory that has more than 65536 files inside of it, or an environment block that has more than 32768 bytes on Windows. All these examples are limits that were dramatically increased from former extremely conservative settings, such that 99.9% of use cases were covered by increasing the limit substantially (now I'm just making up statistics on the spot). But you see the idea - a limit is reasonable, but let's set it way above what the normal use case is likely to be, and make sure via testing, that it is still well within a range that we can comfortably handle (i.e. HTTP responses don't time out, we don't run out of memory, operations don't hog the OS and impact other resources etc). However the last of these we can probably not concern ourselves with, since we've agreed it is O(N) and the overall hit it likely to have less overhead rather than more, due to the fewer number of tasks to process, but the same total number of dependencies.

escapewindow · 2017-07-10T01:09:20Z

I can see the usefulness of a task that automatically fails or succeeds. I would also possibly suggest a task that fails or succeeds based on the results of its dependencies.

I'm not sure I see the usefulness of the wait-for task; isn't that solvable by scheduling it against a specific workerType?

jonasfj · 2017-07-10T08:21:06Z

My feeling here is we could do some testing to see what the potential range could realistically be

We have some experience with this for task-graph-schduler... Where 1k caused issues.. not all the time, but some times... These issues can't be reliably reproduced because they are effects of network degration on top of other things... But if we moved to postgres it might be a different story... Granted I still think we want a limit to avoid huge locks on the database... [edit(@djmitche): fix email formatting]

djmitche · 2017-07-10T13:34:44Z

OK, let me see if I can summarize the topics here:

succeed / fail workers -- potentially useful (and maybe room for some more interesting behavior based on dependency results)
wait-for -- not useful
100-dependency limit -- possibly too low

It would be pretty cool if we could lift the dependency limit by a factor of 10 or more, but it's only tangentially related to this RFC, in that one of the uses of dummy tasks would be to work around the dependency limit.

A few examples of other uses for dummy tasks, to help re-focus the RFC:

sending a pulse message when some subset of a task group is complete (e.g., all bulids)
"faking out" production-only tasks when making development or staging task graphs

..and maybe @escapewindow has some more ideas as suggested in his comment above.

djmitche · 2017-07-10T13:37:27Z

(proposal updated to match)

escapewindow · 2017-07-10T16:40:50Z

I wrote up http://escapewindow.dreamwidth.org/243472.html#dummy_jobs a while back. Some of these ideas might belong here; it's possible some others may either be unneeded or complex enough that we may want to split them out into some other thing.

success / failure (status?) tasks

I think it would be good to be able to specify worst_level, so the task only succeeds if all deps succeed.
someday we may want something like "7 of these 10 deps must pass", but I think we should probably wait for a real world need before writing it.

timer tasks

succeed / fail at absolute or relative time t (succeed/fail/worst_level 5min after this task is marked as pending, or succeed/fail/worst_level after 7pm once marked as pending, f/e)
it might be nice to have this functionality in tasks in general: Run this docker-worker task after all deps have completed, at/after time t. I don't know that we need this, however, and we can accomplish the same thing by having a dummy timer task blocking the docker-worker task.

breakpoint tasks

we accomplish this by specifying a nonexisting workerType atm

notification tasks

pulse
I'd love support for IRC, email, Slack, bugzilla comments, github comments in some fashion. This might be direct support, or a pulse listener that can act on special pulse messages.

djmitche · 2017-08-02T17:35:32Z

Sorry to let this sit for so long!

I don't understand all of them, but some make good sense, especially the breakpoint. I'd like to start with the two currently in the proposal, and then we can look at adding more later.

I think there's value in having these sorts of "utility tasks" in some shared context that everyone has access to. However now that I think about this, and now that I've gotten into the queue code, I don't think it makes sense to special-case these. Rather, I propose we build these in a super-simple service like taskcluster-misc-workers that just runs as a single heroku dyno with no API or anything like that (at least initially).

I'll update the first comment accordingly.

djmitche · 2017-08-10T17:15:33Z

Pete's on PTO for a while, so I will wait until he's back to see if we can decide on this.

petemoore · 2017-08-21T13:57:31Z

Rather, I propose we build these in a super-simple service like taskcluster-misc-workers that just runs as a single heroku dyno with no API or anything like that (at least initially).

++ sgtm :-)

djmitche · 2017-08-23T20:29:33Z

FCP ends August 30th.

jonasfj · 2017-08-23T20:45:25Z

Okay, so this is a proposal for:

success/fail dummy tasks ?
Can't it just be a dummy task with task.payload.success = true/false?

Finally, can we call it: provisionerId: taskcluster or built-in?

jonasfj · 2017-08-23T20:46:18Z

Off topic: limit on number of dependencies is possible too small. We could likely increase it, but it's probably best with postgres migration or something .

djmitche · 2017-08-23T20:54:17Z

I like provisionerId built-in. I don't mind putting the output in the payload, but then what do we name the workerType? immediate?

Dependencies count is definitely off-topic :)

jonasfj · 2017-08-23T20:58:31Z

crazy idea: it could be a json-e worker-type:

task.provisionerId = 'static-v1'
task.workerType = 'json-e-v1'
task.payload = {
  // json-e expression, that given the variable "dependencies" as a list of task status
  // structures for each task.dependencies evaluates to:
  success: true,
  artifacts: {
    "public/live.log": {text: "hello world!"},
    "public/live.json": {json: {result: "ok"}},
    "public/release-binary": {url: "https://...."},
  }
}

Async support in json-e would be neat as it would allow for fetching and grepping dependencies/artifacts on-demand...

djmitche · 2017-08-23T21:59:38Z

That sounds more like a different kind of utility worker (sort of like lambda: one that doesn't run in any specific "environment" but just executes code given right in the task definition). Interesting idea, but not what I'm going for here.

I feel like this is lacking motivation, really. Without a strong motivation we seem to wander a bit. I don't personally have a use for these workerTypes, and it's not clear the release mechanics would find them immediately useful either.

I think I'll set this aside until we do have a strong motivating case.

catlee · 2017-08-23T22:13:07Z

I think the only concrete use case we have is for the 'dummy' tasks to work around the dependency limit.

escapewindow · 2017-08-23T22:15:27Z

I think status/notification tasks would be useful for porting releasetasks graphs, where status is just a single task you can monitor to see the status of all its dependencies.

rail · 2017-08-24T06:29:15Z

One of the use cases for dummy workers would be grouping parallel tasks by some property, platform for example.

mac chunk 1 <-- \
mac chunk 2 <-- all mac done (notify QE)
win chunk 1 <--\
win chunk 2 <-- all win done (notify QE)

petemoore · 2017-08-24T11:50:06Z

I think the only concrete use case we have is for the 'dummy' tasks to work around the dependency limit.

If that is the case, I think this is best solved by bumping the limit to something much higher, e.g. 10000 (assuming load testing shows no problems) and us parking the dummy tasks design.

djmitche · 2017-08-24T12:45:53Z

OK, we'll continue to leave the dependency-limit issue out for now (emphasis mine).

@escapewindow and @rail have described another concrete use for the "succeed" workerType. The notifications from that can come from taskcluster-notify, so this workerType doesn't need to do anything notification-related directly. With all-resolved that notification could occur regardless of the succeed/fail status of the dependent tasks. I think any deeper analysis of the dependents' status would necessitate a "normal" docker-worker task (or one of these fancy tc-λ tasks @jonasfj suggested), rather than embed more logic in a custom worker implementation.

Regarding @jonasfj's suggestion about a single workerType, I think it's a minor implementation difference, and I think "a worker that always fails" is easier to comprehend and invites less feature-bolting-on than "a worker that resolves with a result based on its payload". So I'm going to stick with the two workerTypes.

I've updated the first comment to reflect:

provisionerId "built-in"
motivating use-case (hah, that was already there, and another one!)

Any additional adjustments (that do not relate to the dependency limit) before we re-try final comment here?

djmitche · 2017-08-24T12:46:26Z

..specifically @jonasfj @petemoore

djmitche · 2017-09-22T14:20:15Z

https://bugzilla.mozilla.org/show_bug.cgi?id=1318253#c3 -- a use of a dummy workerType (and using an outdated workerType at that)

djmitche · 2018-02-05T19:34:29Z

Well, that's been 5 months with no further comment, so I suspect we can go to a final comment period here :)

djmitche · 2018-02-06T22:50:20Z

This will finish final comment period in 6 days (I forgot to say so yesterday)

djmitche self-assigned this Jul 3, 2017

djmitche added the Phase: Proposal label Jul 5, 2017

djmitche changed the title ~~Miscellaneous Worker~~ Miscellaneous Workers Jul 10, 2017

djmitche added Phase: Final Comment and removed Phase: Proposal labels Aug 23, 2017

djmitche added Phase: Proposal and removed Phase: Final Comment labels Aug 23, 2017

djmitche removed their assignment Aug 23, 2017

djmitche mentioned this pull request Sep 29, 2017

Helper methods or library for submitting large graphs of tasks #92

Closed

djmitche self-assigned this Feb 5, 2018

djmitche added Phase: Final Comment and removed Phase: Proposal labels Feb 5, 2018

djmitche and others added 2 commits February 7, 2018 13:23

Add RFC #75

1697c9b

add implementation link

e5c9619

djmitche merged commit 0092286 into master Feb 27, 2018

djmitche removed the Phase: Final Comment label Feb 27, 2018

Conversation

djmitche commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Implementation

Uh oh!

djmitche commented Jul 3, 2017

Uh oh!

petemoore commented Jul 5, 2017

Uh oh!

djmitche commented Jul 5, 2017

Uh oh!

jonasfj commented Jul 5, 2017

Uh oh!

petemoore commented Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petemoore commented Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djmitche commented Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petemoore commented Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djmitche commented Jul 7, 2017

Uh oh!

djmitche commented Jul 7, 2017

Uh oh!

jonasfj commented Jul 7, 2017

Uh oh!

petemoore commented Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

escapewindow commented Jul 10, 2017

Uh oh!

jonasfj commented Jul 10, 2017 via email • edited by djmitche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djmitche commented Jul 10, 2017

Uh oh!

djmitche commented Jul 10, 2017

Uh oh!

escapewindow commented Jul 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

success / failure (status?) tasks

timer tasks

breakpoint tasks

notification tasks

Uh oh!

djmitche commented Aug 2, 2017

Uh oh!

djmitche commented Aug 10, 2017

Uh oh!

petemoore commented Aug 21, 2017

Uh oh!

djmitche commented Aug 23, 2017

Uh oh!

jonasfj commented Aug 23, 2017

Uh oh!

jonasfj commented Aug 23, 2017

Uh oh!

djmitche commented Aug 23, 2017

Uh oh!

jonasfj commented Aug 23, 2017

Uh oh!

djmitche commented Aug 23, 2017

Uh oh!

catlee commented Aug 23, 2017

Uh oh!

escapewindow commented Aug 23, 2017

Uh oh!

rail commented Aug 24, 2017

Uh oh!

petemoore commented Aug 24, 2017

Uh oh!

djmitche commented Aug 24, 2017

Uh oh!

djmitche commented Feb 7, 2018 •

edited

Loading

petemoore commented Jul 6, 2017 •

edited

Loading

petemoore commented Jul 6, 2017 •

edited

Loading

djmitche commented Jul 6, 2017 •

edited

Loading

petemoore commented Jul 7, 2017 •

edited

Loading

petemoore commented Jul 7, 2017 •

edited

Loading

jonasfj commented Jul 10, 2017 via email •

edited by djmitche

Loading

escapewindow commented Jul 10, 2017 •

edited

Loading