Fenzo job placement, rebased and with fixes #145

mforsyth · 2016-06-22T14:17:38Z

No description provided.

wyegelwel · 2016-06-24T10:19:23Z

scheduler/src/cook/mesos/api.clj

@@ -243,7 +241,7 @@
                  :priority (or priority util/default-job-priority)
                  :max-retries max_retries
                  :max-runtime (or max_runtime Long/MAX_VALUE)
-                  :ports (or ports [])
+                  :ports (or ports 0)


What happens to clusters that previously used a list of ports? Will upgrading to fenzo cause db issues?

The original fenzo branch attempted to change the the type of the Datomic attribute "port", which did cause issues. This branch leaves the original attribute there (though it's no longer used), and introduces a new attribute "ports", which is what would populate this part of the API response.

mforsyth · 2016-06-27T14:19:09Z

@wyegelwel I've made adjustments based on your comments. Wrapped a block in timers/time! rather than using start & stop, and added dru-scale for metrics as a documented configuration option.

wyegelwel · 2016-06-28T17:44:23Z

scheduler/src/cook/mesos/scheduler.clj

+  (let [t (System/currentTimeMillis)
+        leases (mapv #(->VirtualMachineLeaseAdapter % t) offers)
+        requests (mapv (fn [job]
+                         (let [job-id (:db/id job)]


Where do we use this job-id?

Nowhere. It looks like leftover cruft, there was probably a log statement including it at some point.

wyegelwel · 2016-07-01T16:16:14Z

scheduler/src/cook/mesos/scheduler.clj

+   be accepted or rejected at the end of the function."
+  [conn driver ^TaskScheduler fenzo fid pending-jobs num-considerable offers-chan offers]
+  (log/debug "invoked handle-resource-offers!")
+  (let [offer-stash (atom nil)] ;; This is a way to ensure we never lose offers fenzo assigned if an errors occures in the middle of processing


Can you think of way to avoid making the offer-stash an atom?

The issue (and reason I believe an atom was used) is that a Try Block introduces its own lexical scope. If we want the exception handling of the try block to reset the offer stash with a value that is set inside the try block, an atom is the simplest way to do that, since the atom (unlike vars) can acquire a value that is available outside of the scope where it is set.

I believe we could avoid using an atom, but we would have to rearrange this whole function significantly, and partially change the way exceptions are handled in order to do so. We could have the parts that precede (reset! offer-stash) in one try block (maybe even a separate function), that returns a tuple of "considerable" and "matches" ... then declare the offer stash, then start a new try block that does the rest of what the existing function does (and handles exceptions in the same way).

wyegelwel · 2016-07-05T14:04:10Z

scheduler/docs/rebalancer-config.asc

+
+=== Significance of the parameters
+
+* safe-dru-threshold: Task with a DRU lower than safe-dru-threshold will not be preempted. If each DRU divisor is set to the corresponding per user share and safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted.


Cook currently sets the DRU divisor to the users share, correct? Or is this something different?

Correct - it's the user's share:
user->dru-divisors (->> all-users
(map (fn [user]
[user (share/get-share db user)]))
(into {}))

Then I think you can simplify that sentence to be:

"If safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted"

Hopefully this will avoid the situation where a persistent backfilled unknown instance causes basically everything to get marked as backfilled until it is reconciled.

…ntents

(Probably will only ever matter for unit tests).

(Helps for debugging).

(It was specifying incorrect behavior, where the head-of-considerable WAS matched, yet matched-head? was false).

Also, track the situation as it unfolds and log and record its progress via metrics.

dgrnbrg · 2016-07-28T19:42:51Z

Congrats!

wyegelwel reviewed Jun 24, 2016
View reviewed changes

mforsyth force-pushed the mf/sims_with_fenzo branch 3 times, most recently from 4e55aba to dc804bb Compare June 27, 2016 14:16

mforsyth force-pushed the mf/sims_with_fenzo branch from dc804bb to 183858e Compare June 27, 2016 15:54

wyegelwel reviewed Jun 28, 2016
View reviewed changes

mforsyth force-pushed the mf/sims_with_fenzo branch 3 times, most recently from 126ad23 to e715a08 Compare June 29, 2016 13:46

wyegelwel reviewed Jul 1, 2016
View reviewed changes

wyegelwel mentioned this pull request Jul 4, 2016

Add gpu support to Cook #152

Closed

mforsyth force-pushed the mf/sims_with_fenzo branch from ab75523 to 7f8fcc4 Compare July 5, 2016 10:58

wyegelwel reviewed Jul 5, 2016
View reviewed changes

mforsyth added 27 commits July 28, 2016 09:59

scheduler: Log error before re-piping offer stash.

4cd319d

scheduler: Describe Fenzo fallback policy

4e76f8c

scheduler: Don't double count offer sizes in histogram on error.

73c3de0

scheduler: Reuse job-allowed-to-start?

3a32350

scheduler: If handling offers throws, don't penalize Fenzo.

cca2fe4

scheduler: Make millis->second conversion more explicit.

3540704

scheduler: Make a separate function for comparing with backfill

50e2bb5

scheduler: docs: Add a section about resource shares.

d47ae19

scheduler: Describe that backfilled tasks don't make a job running

0b90950

scheduler: Document why we return true on handle-resource-offers! error

addcf3d

scheduler: Document why scheduler still cares about backfilled tasks.

fe49ef5

scheduler: Make fenzo scaleback values configurable.

752bee2

scheduler: Allow upgrading backfilled unknown instances.

4ddc550

Hopefully this will avoid the situation where a persistent backfilled unknown instance causes basically everything to get marked as backfilled until it is reconciled.

scheduler: Remove upgraded backfilled instances from new scheduler co…

9bb7c3a

…ntents

scheduler: Don't error when summing resources of incomplete jobs.

b77e37e

(Probably will only ever matter for unit tests).

scheduler: Optimize process-matches-for-backfill for readability.

c356e3d

scheduler: Add metrics for process-matches-for-backfill

b84bc90

scheduler: Store assigned ports for instances.

3aaf017

scheduler: Add missing docs for "ports" api request attribute

a8800f0

scheduler: Classify a job with running backfilled tasks as "running".

306ee16

scheduler: tests: Allow specifying job name in test fixtures.

6f6ea65

scheduler: tests: Add job names to text fixtures.

a41ef3d

(Helps for debugging).

scheduler: tests: Fix a test that was incorrectly passing before.

ab6beb3

(It was specifying incorrect behavior, where the head-of-considerable WAS matched, yet matched-head? was false).

scheduler: Explain why scheduler needs to consider backfilled instances

a31b7c7

scheduler: Don't penalize Fenzo when there are no matches at all.

f99e3b2

scheduler: Reset Fenzo considerable if it can't match top job.

659af43

Also, track the situation as it unfolds and log and record its progress via metrics.

scheduler: Document new fenzo-related scheduler config params.

97767cd

mforsyth force-pushed the mf/sims_with_fenzo branch from 2f0d3f8 to 97767cd Compare July 28, 2016 13:59

wyegelwel merged commit 7a49fbb into twosigma:master Jul 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fenzo job placement, rebased and with fixes #145

Fenzo job placement, rebased and with fixes #145

mforsyth commented Jun 22, 2016

wyegelwel Jun 24, 2016

mforsyth Jun 24, 2016

mforsyth commented Jun 27, 2016

wyegelwel Jun 28, 2016

mforsyth Jun 29, 2016

wyegelwel Jul 1, 2016

mforsyth Jul 5, 2016 •

edited

Loading

wyegelwel Jul 5, 2016

mforsyth Jul 5, 2016

wyegelwel Jul 5, 2016

dgrnbrg commented Jul 28, 2016


		=== Significance of the parameters

		* safe-dru-threshold: Task with a DRU lower than safe-dru-threshold will not be preempted. If each DRU divisor is set to the corresponding per user share and safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted.

Fenzo job placement, rebased and with fixes #145

Fenzo job placement, rebased and with fixes #145

Conversation

mforsyth commented Jun 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mforsyth commented Jun 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mforsyth Jul 5, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrnbrg commented Jul 28, 2016

mforsyth Jul 5, 2016 •

edited

Loading