-
Notifications
You must be signed in to change notification settings - Fork 64
Implemented host-placement group constraints #284
Conversation
1b06a13
to
c1d49bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made it all the way through this yet, but here are some initial comments
scheduler/docs/groups.md
Outdated
Example: | ||
```json | ||
{"groups": [{"uuid": "410198dd-c171-470e-aeb8-6ffa4e6a2ada", | ||
"host-placement": {"type": "unique"}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing one more }
before the ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
scheduler/docs/groups.md
Outdated
"host-placement": {"type": "unique"}], | ||
"jobs": [{"uuid": "831836ee-d5c3-47f7-a052-f4c3ce502495", | ||
"group": "410198dd-c171-470e-aeb8-6ffa4e6a2ada". ...}, | ||
...]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the 2 lines above, why not have only one set of ellipses? Also, there's an extra trailing }
it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the first one indicates there are more fields inside each job, and the second shows there can be more jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
scheduler/src/cook/mesos/util.clj
Outdated
[clojure.walk :as walk] | ||
[datomic.api :as d :refer (q)] | ||
[plumbing.core :refer (map-vals map-keys)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine this with the existing plumbing.core
require below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
|
||
(defn init-offer-cache | ||
[& init] | ||
(-> (if (nil? init) {} init) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe (or init {})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
@@ -67,7 +67,7 @@ | |||
;;Networking | |||
[clj-http "2.0.0"] | |||
[io.netty/netty "3.10.1.Final"] | |||
[cc.qbits/jet "0.5.7" :exclusions [org.eclipse.jetty/jetty-io | |||
[cc.qbits/jet "0.6.4" :exclusions [org.eclipse.jetty/jetty-io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the jet upgrade needed for the host placement feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was needed for testing
scheduler/docs/groups.md
Outdated
```json | ||
{"groups": [{"uuid": "410198dd-c171-470e-aeb8-6ffa4e6a2ada", | ||
"host-placement": {"type": "attribute-equals", | ||
"parameters": {"attribute": "HOSTNAME"}}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use availability zone since that's the example in the paragraph above? Also missing one more }
before the ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
scheduler/docs/groups.md
Outdated
"parameters": {"attribute": "HOSTNAME"}}], | ||
"jobs": [{"uuid": "831836ee-d5c3-47f7-a052-f4c3ce502495", | ||
"group": "410198dd-c171-470e-aeb8-6ffa4e6a2ada". ...}, | ||
...]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same ellipses comment, and extra trailing }
scheduler/src/cook/components.clj
Outdated
@@ -19,6 +19,7 @@ | |||
[clj-time.core :as t] | |||
[clojure.java.io :as io] | |||
[clojure.pprint :refer (pprint)] | |||
[clojure.core.cache :as cache] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep these sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
scheduler/src/cook/components.clj
Outdated
(cache/fifo-cache-factory :threshold max-size) | ||
(cache/ttl-cache-factory :ttl ttl-ms) | ||
atom)] | ||
cache)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: the let
is not really needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this giving me trouble without the let... Let me try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of let 👍
scheduler/src/cook/components.clj
Outdated
@@ -346,6 +354,11 @@ | |||
task-constraints)) | |||
:offer-incubate-time-ms (fnk [[:config [:scheduler {offer-incubate-ms 15000}]]] | |||
offer-incubate-ms) | |||
:offer-cache (fnk [[:config [:scheduler {offer-cache nil}]]] | |||
(merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be indented too far in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
@@ -250,12 +249,20 @@ | |||
"A schema for the parameters of a host placement with type attribute-equals" | |||
{:attribute s/Str}) | |||
|
|||
(def Balanced-Parameters | |||
"A schema for the parameters of a host placement with type balanced" | |||
{:attribute s/Str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the attribute really be any string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much: any string that is a valid mesos attribute name
(ns cook.mesos.constraints | ||
(:require [clojure.tools.logging :as log] | ||
[cook.mesos.util :as util] | ||
[cook.mesos.group :as group] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we sort these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
[cook.mesos.util :as util] | ||
[cook.mesos.group :as group] | ||
[swiss.arrows :refer :all]) | ||
(import com.netflix.fenzo.VirtualMachineLease)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:import
instead of import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
(defn get-class-name | ||
"Returns the name of a class without package information." | ||
[object] | ||
(last (clojure.string/split (str (type object)) #"\."))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do (.getSimpleName (type object))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, knew there must have been a better way for this. Now using .getSimpleName 👍
@@ -539,5 +568,5 @@ | |||
@(d/transact conn [{:db/id :rebalancer/config | |||
:rebalancer.config/min-utilization-threshold 0.0 | |||
:rebalancer.config/safe-dru-threshold 0.0 | |||
:rebalancer.config/min-dru-diff 0.0000000001 | |||
:rebalancer.config/min-dru-diff 0.000000001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change to min-dru-diff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inside a comment block, so it does not matter. But will change it back to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
(:value %)))) | ||
(into {})) | ||
cook-attributes {"HOSTNAME" (:hostname offer) | ||
"COOK_GPU?" (> (or (offer-resource-scalar offer "gpus") 0.0) 0)}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
(-> offer
(offer-resource-scalar "gpus")
(or 0.0)
(> 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is way better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with my pass
|
||
(ns cook.test.mesos.constraints | ||
(:use [clojure.test]) | ||
(:require [mesomatic.types :as mtypes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we sort these?
[cook.mesos.constraints :as constraints] | ||
[cook.test.testutil :refer (restore-fresh-database! create-dummy-group create-dummy-job create-dummy-instance)] | ||
[datomic.api :as d :refer (db)]) | ||
(import org.mockito.Mockito |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:import
instead of import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
scored->expected (zipmap scored expected)] | ||
(doall (for [[scored expected] scored->expected] | ||
(is (= scored expected))))) | ||
))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: close parens can go on previous line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
(:require [clojure.core.async :as async] | ||
[clojure.core.cache :as cache] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting off by a little
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
079ce5d
to
4b47f93
Compare
Multiple code review changes, including better docstrings, many variable name changes, and filtering host->scored-tasks. Created a new namespace for constraints Generalized the balanced constraint to any attribute, exposed expected count parameter Multiple code review changes: made unconstrained filtering in rebalancer simpler and more efficient, improved naming of cotask functions, added host placement explanation to groups.md and other small changes Made offer cache maximum size and ttl configurable. Added rebalancer tests for balanced constraint CR changes: clarifications in group.md, extracted common code from fenzo constraints into shared functions, cotasks are now both unknown and running instances, host attributes now extracted from offers, renamed cotask-getter functions, added test for make-guuid->juuids Added many docstrings. Made sure that order of considerable jobs is preserved before handing to scheduleOnce. Wrote a test to check that considerable jobs are passed in right order. compute-preemption-decision now considers preemptions done in the same cycle to validate constraints. Refactored constraints.clj. Now all logic for constraints is in the evaluation function, which is actually a multi-method. This multi-method can then be wrapped to make it fenzo-compatible (in make-fenzo-hp-constraint) or rebalancer-compatible (in make-rebalacer-hp-constaint). Added unique-host constraint to the rebalancer. Refactored constraints.clj to define JobConstraint and GroupConstraint protocols, which make the code more modular, easier to modify and harder to unintentionally break. Now synthetic instances in state have slave-id, which is used to find attributes of constraints. Fixed lack of nil check when placement constraint is 'all'. Multiple minor improvements to address CR by shams Small improvements as suggested in CR by Wil Removed gpu-host-constraint-helper Multiple small CR improvements as suggested by Wil. More CR fixes as suggested by Wil, mostly on constraints.clj, msotly small refactoring and renaming Signed-off-by: Diego Torres Quintanilla <diego.torresquintanilla@twosigma.com>
4b47f93
to
4fec554
Compare
@pschorf Are you planning on taking a look at this? |
scheduler/src/cook/components.clj
Outdated
@@ -19,6 +19,7 @@ | |||
[clj-time.core :as t] | |||
[clojure.java.io :as io] | |||
[clojure.pprint :refer (pprint)] | |||
[clojure.core.cache :as cache] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
[this target-vm-resources target-vm-attrs target-vm-tasks-assigned] | ||
"Evaluates whether a vm with resources 'target-vm-resources', attributes 'target-vm-attrs' | ||
and TaskRequests (optional) 'target-vm-tasks-assigned' passes the job's constraint. Must return a | ||
2-element vector, where the first element is a boolean (whether the constraint was passsed) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just returning the message if it failed, an nil otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning passes?
explicitly is convenient, as it encourages constraint code to name a variable called passes?
. Also, this is the format that Fenzo expects, so eventually both reason string and boolean will be in memory.
(concat running-cotasks assigned-cotasks))) | ||
|
||
(defn get-fenzo-cotasks | ||
"Given a group, a map of guuids to task-ids, a task-id and a Fenzo task-tracker-state, returns a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guuids->guids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guuid = Group UUID (I use it in many other places)
(group-constraint-name [self] (get-class-name self)) | ||
(group-constraint-type [self] (get-group-constraint-type (:group self))) | ||
(group-constraint-evaluate [this target-attr-map cohost-attr-maps] | ||
(let [group (:group this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well group
is reused in the next 2 lines and the last let definition.
Multiple code review changes, including better docstrings, many variable name changes, and filtering host->scored-tasks.
Created a new namespace for constraints
Generalized the balanced constraint to any attribute, exposed expected count parameter
Multiple code review changes: made unconstrained filtering in rebalancer simpler and more efficient, improved naming of cotask functions, added host placement explanation to groups.md and other small changes
Made offer cache maximum size and ttl configurable.
Added rebalancer tests for balanced constraint
CR changes: clarifications in group.md, extracted common code from fenzo constraints into shared functions, cotasks are now both unknown and running instances, host attributes now extracted from offers, renamed cotask-getter functions, added test for make-guuid->juuids
Added many docstrings. Made sure that order of considerable jobs is preserved before handing to scheduleOnce. Wrote a test to check that considerable jobs are passed in right order.
compute-preemption-decision now considers preemptions done in the same cycle to validate constraints.
Refactored constraints.clj. Now all logic for constraints is in the evaluation function, which is actually a multi-method. This multi-method can then be wrapped to make it fenzo-compatible (in make-fenzo-hp-constraint) or rebalancer-compatible (in make-rebalacer-hp-constaint).
Added unique-host constraint to the rebalancer. Refactored constraints.clj to define JobConstraint and GroupConstraint protocols, which make the code more modular, easier to modify and harder to unintentionally break.
Now synthetic instances in state have slave-id, which is used to find attributes of constraints. Fixed lack of nil check when placement constraint is 'all'.
Multiple minor improvements to address CR by shams
Small improvements as suggested in CR by Wil
Removed gpu-host-constraint-helper
Multiple small CR improvements as suggested by Wil.
More CR fixes as suggested by Wil, mostly on constraints.clj, msotly small refactoring and renaming
Signed-off-by: Diego Torres Quintanilla diego.torresquintanilla@twosigma.com