-
Notifications
You must be signed in to change notification settings - Fork 115
Add random assignment of tasks across eligible workers. #219
Add random assignment of tasks across eligible workers. #219
Conversation
9eabf53
to
a1f36a6
Compare
@phillbaker, thanks! Have you done any testing of this on a real life deployment? It would be interesting to see the metric data points from the before and after. |
60ef42d
to
6a03965
Compare
@jtarchie thanks for taking a look. Our deployment unfortunately doesn't capture the logs/metrics 😒, but it would be great to see a before/after of both job length times, resource utilization and job "gravity" over time. We're seeing issues build only after a couple of weeks, so I'm also afraid using our deployment would have a slow iteration cycle. Any suggestions/examples for gathering that kind of data efficiently? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a swing at this!
I like the idea of exposing a flag, so we can opt in to this over time, and possibly experiment with other schedulers.
I have one idea to make the implementation better though: instead of injecting a bool
, could we define a ContainerPlacementStrategy
interface? The current strategy would then be a VolumeLocalityPlacementStrategy
, and the new one would be a RandomPlacementStrategy
. This would shrink the pool
tests and allow these new strategies to be tested in isolation.
Then, command.go
could switch on the name to construct the configured strategy.
Re: visualization, this probably really depends on the pipeline. I think there are particular flows that make the existing scheduler algorithm stress out particular workers, and while this new change may fix it, there may be interesting data around build time and network throughput between the ATC and workers instead, as it has to stream more data.
atccmd/command.go
Outdated
@@ -106,6 +106,7 @@ type ATCCommand struct { | |||
ResourceCheckingInterval time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."` | |||
OldResourceGracePeriod time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."` | |||
ResourceCacheCleanupInterval time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."` | |||
WorkerSchedulingMethod string `long:"worker-scheduling-method" default:"bucket" description:"How jobs shoudld be scheduled across the worker pool, options: 'bucket' or 'random'."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be configured as choice:"bucket" choice:"random"
instead: https://godoc.org/github.com/jessevdk/go-flags
@vito's suggestions make sense from a code perspective. I was wondering if/how it would be possible to add better visualization of actual task distribution to the pipeline/UI. I know that All of these are entirely separate issues though |
750cc7a
to
8ef9542
Compare
👍 Took a stab at this, if this looks like a good start, I'll finish up moving over the tests as well. |
09d10d9
to
3b83646
Compare
Updated as per your suggestion @vito, and finished moving the tests over. Let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! I think there's one final touch-up to the flag before we should merge it in. The new approach looks great though!
atccmd/command.go
Outdated
@@ -106,6 +106,7 @@ type ATCCommand struct { | |||
ResourceCheckingInterval time.Duration `long:"resource-checking-interval" default:"1m" description:"Interval on which to check for new versions of resources."` | |||
OldResourceGracePeriod time.Duration `long:"old-resource-grace-period" default:"5m" description:"How long to cache the result of a get step after a newer version of the resource is found."` | |||
ResourceCacheCleanupInterval time.Duration `long:"resource-cache-cleanup-interval" default:"30s" description:"Interval on which to cleanup old caches of resources."` | |||
WorkerSchedulingMethod string `long:"worker-scheduling-method" default:"bucket" choice:"bucket" choice:"random" description:"How jobs shoudld be scheduled across the worker pool."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this to ContainerPlacementStrategy
, rename bucket
to volume-locality
, and have the description say "Method by which a worker is selected during container placement." or something?
I think that more precisely captures what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated!
This hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current buckting + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization. The algorithms are extracted to a new scheduling interface.
3b83646
to
fc52f6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I like the approach of adding a Container Placement Strategy, and where I can see this idea extending long term.
Thanks for the PR @phillbaker! |
This is an attempt to address concourse/concourse#1741 (and maybe allow concourse/concourse#675 to be addressed in the future).
It hides a new scheduling "algorithm" behind a new commandline flag which defaults to the current bucketing + random choice method. In the future this could be extended to to other algorithms like round robin or least resource utilization. All feedback welcome!
Added test coverage of the pool option, but not of the new ATC flag as the atc_command acceptance test only seemed to cover the non-trivial options.
cc @JohannesRudolph & thanks for pointing to the location of the scheduling code