Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.

Add random assignment of tasks across eligible workers. #219

Merged

Conversation

phillbaker
Copy link
Contributor

This is an attempt to address concourse/concourse#1741 (and maybe allow concourse/concourse#675 to be addressed in the future).

It hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current bucketing + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization. All feedback welcome!

Added test coverage of the pool option, but not of the new ATC flag as the atc_command acceptance test only seemed to cover the non-trivial options.

cc @JohannesRudolph & thanks for pointing to the location of the scheduling code

@jtarchie
Copy link
Contributor

jtarchie commented Nov 2, 2017

@phillbaker, thanks! Have you done any testing of this on a real life deployment? It would be interesting to see the metric data points from the before and after.

@phillbaker phillbaker force-pushed the add-pure-random-assignment branch 3 times, most recently from 60ef42d to 6a03965 Compare November 3, 2017 03:01
@phillbaker
Copy link
Contributor Author

@jtarchie thanks for taking a look.

Our deployment unfortunately doesn't capture the logs/metrics 😒, but it would be great to see a before/after of both job length times, resource utilization and job "gravity" over time. We're seeing issues build only after a couple of weeks, so I'm also afraid using our deployment would have a slow iteration cycle.

Any suggestions/examples for gathering that kind of data efficiently?

Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a swing at this!

I like the idea of exposing a flag, so we can opt in to this over time, and possibly experiment with other schedulers.

I have one idea to make the implementation better though: instead of injecting a bool, could we define a ContainerPlacementStrategy interface? The current strategy would then be a VolumeLocalityPlacementStrategy, and the new one would be a RandomPlacementStrategy. This would shrink the pool tests and allow these new strategies to be tested in isolation.

Then, command.go could switch on the name to construct the configured strategy.

Re: visualization, this probably really depends on the pipeline. I think there are particular flows that make the existing scheduler algorithm stress out particular workers, and while this new change may fix it, there may be interesting data around build time and network throughput between the ATC and workers instead, as it has to stream more data.

@@ -106,6 +106,7 @@ type ATCCommand struct {
ResourceCheckingInterval time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."`
OldResourceGracePeriod time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."`
ResourceCacheCleanupInterval time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."`
WorkerSchedulingMethod string `long:"worker-scheduling-method" default:"bucket" description:"How jobs shoudld be scheduled across the worker pool, options: 'bucket' or 'random'."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be configured as choice:"bucket" choice:"random" instead: https://godoc.org/github.com/jessevdk/go-flags

@JohannesRudolph
Copy link

@vito's suggestions make sense from a code perspective. I was wondering if/how it would be possible to add better visualization of actual task distribution to the pipeline/UI.

I know that fly reports the containers & worker assignment, but it would make issues a lot easier to debug if the UI could display on which worker a task is running (and by extension: total build time of a task would be great as well). Or maybe add a "workers" screen to the UI where one could in realtime see task distribution on workers (with running tasks and finished task containers visually separated)?

All of these are entirely separate issues though

@phillbaker phillbaker force-pushed the add-pure-random-assignment branch 2 times, most recently from 750cc7a to 8ef9542 Compare November 7, 2017 04:43
@phillbaker
Copy link
Contributor Author

I have one idea to make the implementation better though: instead of injecting a bool, could we define a ContainerPlacementStrategy interface?

👍

Took a stab at this, if this looks like a good start, I'll finish up moving over the tests as well.

@phillbaker
Copy link
Contributor Author

Updated as per your suggestion @vito, and finished moving the tests over. Let me know what you think!

Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! I think there's one final touch-up to the flag before we should merge it in. The new approach looks great though!

@@ -106,6 +106,7 @@ type ATCCommand struct {
ResourceCheckingInterval time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."`
OldResourceGracePeriod time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."`
ResourceCacheCleanupInterval time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."`
WorkerSchedulingMethod string `long:"worker-scheduling-method" default:"bucket" choice:"bucket" choice:"random" description:"How jobs shoudld be scheduled across the worker pool."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this to ContainerPlacementStrategy, rename bucket to volume-locality, and have the description say "Method by which a worker is selected during container placement." or something?

I think that more precisely captures what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated!

This hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current buckting + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization.

The algorithms are extracted to a new scheduling interface.
Copy link
Contributor

@topherbullock topherbullock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I like the approach of adding a Container Placement Strategy, and where I can see this idea extending long term.

@topherbullock
Copy link
Contributor

Thanks for the PR @phillbaker!

@vito vito added this to the v3.7.0 milestone Nov 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants