Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Add Fenzo for scheduling #84

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d703bf8
Initial fenzo work (untested)
dgrnbrg Oct 29, 2015
f5fae89
Fix typo
dgrnbrg Nov 2, 2015
77b59c0
Got fenzo prototype running
dgrnbrg Nov 12, 2015
b0f2e54
Improve fenzo robustness
dgrnbrg Nov 13, 2015
5031d5d
Remove thread safety condition for incubating offers view
dgrnbrg Nov 13, 2015
a8155fb
Thunkified jobs, no longer running jobs twice
dgrnbrg Nov 16, 2015
5b49ae9
Remove thunk because it wasn't the actual bug
dgrnbrg Nov 16, 2015
cd870c9
Add backfilling for fenzo
dgrnbrg Nov 16, 2015
879109e
Exponential backoff of considerable size in queue
dgrnbrg Nov 17, 2015
0dfe74e
Try to sync fenzo and make a bit more robust
dgrnbrg Nov 17, 2015
7caceb1
More comments
dgrnbrg Nov 18, 2015
df5872b
Fix priority update logic for preemptor
dgrnbrg Nov 18, 2015
b767bcb
Remove test for removed function prefixes
dgrnbrg Nov 18, 2015
ba5d82a
Fix broken tests
dgrnbrg Nov 19, 2015
087f09d
Remove excess debugging
dgrnbrg Nov 19, 2015
ff512b6
Continued cleanup and improvements
dgrnbrg Nov 20, 2015
536d9ce
Add support for ports with Fenzo
dgrnbrg Nov 20, 2015
e850b5c
merge
dgrnbrg Nov 20, 2015
7605a45
Catch exceptions if task unassignemnt fails
dgrnbrg Nov 20, 2015
583c846
Test backfill filler
dgrnbrg Nov 20, 2015
5476f15
Fix api tests
dgrnbrg Nov 23, 2015
d807b04
Ensure we propagate the proper default port count as zero
dgrnbrg Nov 23, 2015
28199cd
Alternative impl for backfill
dgrnbrg Nov 23, 2015
27da88d
Add really important bit for backfill DRU penalty back in
dgrnbrg Nov 23, 2015
930d3ec
Remove global preemption of backfilled tasks
dgrnbrg Nov 24, 2015
aee8074
Add test for backfill job upgrades
dgrnbrg Dec 3, 2015
7fa71c0
merge
dgrnbrg Dec 3, 2015
0126080
updates for Li's comments
dgrnbrg Dec 9, 2015
0b7accf
Actually, true doesn't sort higher than false
dgrnbrg Dec 9, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 7 additions & 3 deletions scheduler/project.clj
Original file line number Diff line number Diff line change
Expand Up @@ -31,17 +31,21 @@
[amalloy/ring-buffer "1.1"]
[lonocloud/synthread "1.0.4"]
[org.clojure/tools.namespace "0.2.4"]
[org.clojure/core.cache "0.6.3"]
[org.clojure/core.memoize "0.5.6"]
[org.clojure/core.cache "0.6.4"]
[org.clojure/core.memoize "0.5.8"]
[clj-time "0.9.0"]
[org.clojure/core.async "0.1.346.0-17112a-alpha"]
[org.clojure/core.async "0.2.374"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the newer core.async seems to require clojure 1.7, is this upgrade safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, moved to 1.7

[prismatic/schema "0.2.1"
:exclusions [potemkin]]
[clojure-miniprofiler "0.4.0"]
[jarohen/chime "0.1.6"]
[org.clojure/data.priority-map "0.0.5"]
[swiss-arrows "1.0.0"]
[riddley "0.1.10"]
[com.netflix.fenzo/fenzo-core "0.8.2"
:exclusions [org.apache.mesos/mesos
org.slf4j/slf4j-api
org.slf4j/slf4j-simple]]

;;Logging
[org.clojure/tools.logging "0.2.6"]
Expand Down
8 changes: 4 additions & 4 deletions scheduler/src/cook/mesos.clj
Original file line number Diff line number Diff line change
Expand Up @@ -81,21 +81,22 @@
datomic-report-chan (async/chan (async/sliding-buffer 4096))
mesos-pending-jobs-atom (atom [])
mesos-heartbeat-chan (async/chan (async/buffer 4096))
{:keys [scheduler view-incubating-offers view-mature-offers]}
current-driver (atom nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing this data here feels hacky...perhaps this file should be refactored?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is current-driver used for?

{:keys [scheduler view-incubating-offers]}
(sched/create-datomic-scheduler
mesos-datomic-conn
(fn set-or-create-framework-id [framework-id]
(curator/set-or-create
curator-framework
zk-framework-id
(.getBytes framework-id "UTF-8")))
current-driver
mesos-pending-jobs-atom
mesos-heartbeat-chan
offer-incubate-time-ms
task-constraints)
framework-id (when-let [bytes (curator/get-or-nil curator-framework zk-framework-id)]
(String. bytes))
current-driver (atom nil)
leader-selector (LeaderSelector.
curator-framework
zk-prefix
Expand Down Expand Up @@ -136,8 +137,7 @@
:driver driver
:mesos-master-hosts mesos-master-hosts
:pending-jobs-atom mesos-pending-jobs-atom
:view-incubating-offers view-incubating-offers
:view-mature-offers view-mature-offers}))
:view-incubating-offers view-incubating-offers}))
(counters/inc! mesos-leader)
(async/tap mesos-datomic-mult datomic-report-chan)
(let [kill-monitor (cook.mesos.scheduler/monitor-tx-report-queue datomic-report-chan mesos-datomic-conn current-driver)]
Expand Down
15 changes: 8 additions & 7 deletions scheduler/src/cook/mesos/api.clj
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
:max-retries (s/both s/Int (s/pred pos? 'pos?))
:max-runtime (s/both s/Int (s/pred pos? 'pos?))
(s/optional-key :uris) [Uri]
(s/optional-key :ports) [(s/pred zero? 'zero)] ;;TODO add to docs the limited uri/port support
(s/optional-key :ports) (s/pred #(not (neg? %)) 'nonnegative?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ports allowed to be a list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it is the number of ports.

(s/optional-key :env) {NonEmptyString s/Str}
:cpus PosDouble
:mem PosDouble
Expand All @@ -67,10 +67,8 @@
[conn jobs :- [Job]]
(doseq [{:keys [uuid command max-retries max-runtime priority cpus mem user name ports uris env]} jobs
:let [id (d/tempid :db.part/user)
ports (mapv (fn [port]
;;TODO this schema might not work b/c all ports are zero
[:db/add id :job/port port])
ports)
ports (when (and ports (not (zero? ports)))
[[:db/add id :job/port ports]])
uris (mapcat (fn [{:keys [value executable? cache? extract?]}]
(let [uri-id (d/tempid :db.part/user)
optional-params {:resource.uri/executable? executable?
Expand Down Expand Up @@ -144,7 +142,7 @@
:priority (or priority util/default-job-priority)
:max-retries max_retries
:max-runtime (or max_runtime Long/MAX_VALUE)
:ports (or ports [])
:ports (or ports 0)
:cpus (double cpus)
:mem (double mem)}
(when uris
Expand Down Expand Up @@ -239,7 +237,7 @@
:status (name (:job/state job))
:uris (:uris resources)
:env (util/job-ent->env job)
;;TODO include ports
:ports (:job/port job 0)
:instances
(map (fn [instance]
(let [hostname (:instance/hostname instance)
Expand All @@ -252,6 +250,9 @@
end (:instance/end-time instance)
base {:task_id (:instance/task-id instance)
:hostname hostname
;;TODO validate that these show up in API
:backfilled (:instance/backfilled? instance false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exposed in the Java API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

:preempted (:instance/preempted? instance false)
:slave_id (:instance/slave-id instance)
:executor_id (:instance/executor-id instance)
:status (name (:instance/status instance))}
Expand Down
8 changes: 4 additions & 4 deletions scheduler/src/cook/mesos/rebalancer.clj
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@
user->sorted-running-task-ents (->> running-task-ents
(group-by util/task-ent->user)
(map (fn [[user task-ents]]
[user (into (sorted-set-by util/same-user-task-comparator) task-ents)]))
[user (into (sorted-set-by (util/same-user-task-comparator true)) task-ents)]))
(into {}))
task->scored-task (dru/init-task->scored-task user->sorted-running-task-ents user->dru-divisors)]
(->State task->scored-task user->sorted-running-task-ents host->spare-resources user->dru-divisors)))
Expand All @@ -195,7 +195,7 @@
(reduce (fn [task-ents-by-user task-ent]
(let [user (util/task-ent->user task-ent)
f (if (= new-running-task-ent task-ent)
(fnil conj (sorted-set-by util/same-user-task-comparator))
(fnil conj (sorted-set-by (util/same-user-task-comparator true)))
disj)]
(update-in task-ents-by-user [user] f task-ent)))
user->sorted-running-task-ents
Expand All @@ -220,6 +220,7 @@
pending-job-ent]
(let [{pending-job-mem :mem pending-job-cpus :cpus} (util/job-ent->resources pending-job-ent)
pending-job-dru (compute-pending-job-dru state pending-job-ent)

;; This will preserve the ordering of task->scored-task
host->scored-tasks (->> task->scored-task
(vals)
Expand Down Expand Up @@ -302,7 +303,7 @@
(try
@(d/transact
conn
;; Make :instance/status and :instance/preempted consistent to simplify the state machine.
;; Make :instance/status and :instance/preempted? consistent to simplify the state machine.
;; We don't want to deal with {:instance/status :instance.stats/running, :instance/preempted? true}
;; all over the places.
(let [job-eid (:db/id (:job/_instance task-ent))
Expand Down Expand Up @@ -342,7 +343,6 @@
(fn [now]
(let [host->combined-offers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which offers is this seeing? Pre or post fenzo? If pre, change the name since we are no longer combining offers here.

(-<>> (view-incubating-offers)
(sched/combine-offers)
(map (fn [v]
[(:hostname v) (assoc v :time-observed now)]))
(into {}))]
Expand Down