Add random assignment of tasks across eligible workers. #219

phillbaker · 2017-11-02T19:59:36Z

This is an attempt to address concourse/concourse#1741 (and maybe allow concourse/concourse#675 to be addressed in the future).

It hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current bucketing + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization. All feedback welcome!

Added test coverage of the pool option, but not of the new ATC flag as the atc_command acceptance test only seemed to cover the non-trivial options.

cc @JohannesRudolph & thanks for pointing to the location of the scheduling code

jtarchie · 2017-11-02T21:31:54Z

@phillbaker, thanks! Have you done any testing of this on a real life deployment? It would be interesting to see the metric data points from the before and after.

phillbaker · 2017-11-03T03:21:09Z

@jtarchie thanks for taking a look.

Our deployment unfortunately doesn't capture the logs/metrics 😒, but it would be great to see a before/after of both job length times, resource utilization and job "gravity" over time. We're seeing issues build only after a couple of weeks, so I'm also afraid using our deployment would have a slow iteration cycle.

Any suggestions/examples for gathering that kind of data efficiently?

vito

Thanks for taking a swing at this!

I like the idea of exposing a flag, so we can opt in to this over time, and possibly experiment with other schedulers.

I have one idea to make the implementation better though: instead of injecting a bool, could we define a ContainerPlacementStrategy interface? The current strategy would then be a VolumeLocalityPlacementStrategy, and the new one would be a RandomPlacementStrategy. This would shrink the pool tests and allow these new strategies to be tested in isolation.

Then, command.go could switch on the name to construct the configured strategy.

Re: visualization, this probably really depends on the pipeline. I think there are particular flows that make the existing scheduler algorithm stress out particular workers, and while this new change may fix it, there may be interesting data around build time and network throughput between the ATC and workers instead, as it has to stream more data.

vito · 2017-11-06T15:49:18Z

atccmd/command.go

@@ -106,6 +106,7 @@ type ATCCommand struct {
 	ResourceCheckingInterval          time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."`
 	OldResourceGracePeriod            time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."`
 	ResourceCacheCleanupInterval      time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."`
+	WorkerSchedulingMethod            string        `long:"worker-scheduling-method" default:"bucket" description:"How jobs shoudld be scheduled across the worker pool, options: 'bucket' or 'random'."`


This could be configured as choice:"bucket" choice:"random" instead: https://godoc.org/github.com/jessevdk/go-flags

JohannesRudolph · 2017-11-06T18:25:38Z

@vito's suggestions make sense from a code perspective. I was wondering if/how it would be possible to add better visualization of actual task distribution to the pipeline/UI.

I know that fly reports the containers & worker assignment, but it would make issues a lot easier to debug if the UI could display on which worker a task is running (and by extension: total build time of a task would be great as well). Or maybe add a "workers" screen to the UI where one could in realtime see task distribution on workers (with running tasks and finished task containers visually separated)?

All of these are entirely separate issues though

phillbaker · 2017-11-07T04:52:24Z

I have one idea to make the implementation better though: instead of injecting a bool, could we define a ContainerPlacementStrategy interface?

👍

Took a stab at this, if this looks like a good start, I'll finish up moving over the tests as well.

phillbaker · 2017-11-08T15:02:57Z

Updated as per your suggestion @vito, and finished moving the tests over. Let me know what you think!

vito

Awesome, thanks! I think there's one final touch-up to the flag before we should merge it in. The new approach looks great though!

vito · 2017-11-09T00:44:41Z

atccmd/command.go

@@ -106,6 +106,7 @@ type ATCCommand struct {
 	ResourceCheckingInterval          time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."`
 	OldResourceGracePeriod            time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."`
 	ResourceCacheCleanupInterval      time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."`
+	WorkerSchedulingMethod            string        `long:"worker-scheduling-method" default:"bucket" choice:"bucket" choice:"random" description:"How jobs shoudld be scheduled across the worker pool."`


Could we rename this to ContainerPlacementStrategy, rename bucket to volume-locality, and have the description say "Method by which a worker is selected during container placement." or something?

I think that more precisely captures what it does.

Sounds good, updated!

This hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current buckting + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization. The algorithms are extracted to a new scheduling interface.

topherbullock

Looks good! I like the approach of adding a Container Placement Strategy, and where I can see this idea extending long term.

topherbullock · 2017-11-15T14:45:34Z

Thanks for the PR @phillbaker!

phillbaker force-pushed the add-pure-random-assignment branch from 9eabf53 to a1f36a6 Compare November 2, 2017 20:15

phillbaker force-pushed the add-pure-random-assignment branch 3 times, most recently from 60ef42d to 6a03965 Compare November 3, 2017 03:01

vito suggested changes Nov 6, 2017

View reviewed changes

phillbaker force-pushed the add-pure-random-assignment branch 2 times, most recently from 750cc7a to 8ef9542 Compare November 7, 2017 04:43

This was referenced Nov 7, 2017

Make web UI show task duration concourse/concourse#1788

Closed

Add debug info to step header concourse/concourse#1216

Closed

phillbaker force-pushed the add-pure-random-assignment branch 6 times, most recently from 09d10d9 to 3b83646 Compare November 8, 2017 15:02

vito suggested changes Nov 9, 2017

View reviewed changes

phillbaker force-pushed the add-pure-random-assignment branch from 3b83646 to fc52f6b Compare November 9, 2017 02:11

vito approved these changes Nov 9, 2017

View reviewed changes

vito requested a review from topherbullock November 9, 2017 14:04

topherbullock approved these changes Nov 13, 2017

View reviewed changes

topherbullock merged commit b109d39 into vmware-archive:master Nov 13, 2017

topherbullock mentioned this pull request Nov 13, 2017

Fix "worker gravity": Allow more random scheduling of jobs/tasks across workers concourse/concourse#1741

Closed

vito added this to the v3.7.0 milestone Nov 30, 2017

william-tran mentioned this pull request Jan 3, 2018

[stable/concourse] Upgrade to concourse 3.8.0 helm/charts#3203

Merged

marco-m mentioned this pull request May 21, 2019

Get rid of the random container placement strategy concourse/concourse#3888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add random assignment of tasks across eligible workers. #219

Add random assignment of tasks across eligible workers. #219

phillbaker commented Nov 2, 2017

jtarchie commented Nov 2, 2017

phillbaker commented Nov 3, 2017

vito left a comment

vito Nov 6, 2017

JohannesRudolph commented Nov 6, 2017

phillbaker commented Nov 7, 2017

phillbaker commented Nov 8, 2017

vito left a comment

vito Nov 9, 2017

phillbaker Nov 9, 2017

topherbullock left a comment

topherbullock commented Nov 15, 2017

Add random assignment of tasks across eligible workers. #219

Add random assignment of tasks across eligible workers. #219

Conversation

phillbaker commented Nov 2, 2017

jtarchie commented Nov 2, 2017

phillbaker commented Nov 3, 2017

vito left a comment

Choose a reason for hiding this comment

vito Nov 6, 2017

Choose a reason for hiding this comment

JohannesRudolph commented Nov 6, 2017

phillbaker commented Nov 7, 2017

phillbaker commented Nov 8, 2017

vito left a comment

Choose a reason for hiding this comment

vito Nov 9, 2017

Choose a reason for hiding this comment

phillbaker Nov 9, 2017

Choose a reason for hiding this comment

topherbullock left a comment

Choose a reason for hiding this comment

topherbullock commented Nov 15, 2017