-
Notifications
You must be signed in to change notification settings - Fork 63
Fenzo job placement, rebased and with fixes #145
Conversation
@@ -243,7 +241,7 @@ | |||
:priority (or priority util/default-job-priority) | |||
:max-retries max_retries | |||
:max-runtime (or max_runtime Long/MAX_VALUE) | |||
:ports (or ports []) | |||
:ports (or ports 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to clusters that previously used a list of ports? Will upgrading to fenzo cause db issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original fenzo branch attempted to change the the type of the Datomic attribute "port", which did cause issues. This branch leaves the original attribute there (though it's no longer used), and introduces a new attribute "ports", which is what would populate this part of the API response.
4e55aba
to
dc804bb
Compare
@wyegelwel I've made adjustments based on your comments. Wrapped a block in timers/time! rather than using start & stop, and added dru-scale for metrics as a documented configuration option. |
dc804bb
to
183858e
Compare
(let [t (System/currentTimeMillis) | ||
leases (mapv #(->VirtualMachineLeaseAdapter % t) offers) | ||
requests (mapv (fn [job] | ||
(let [job-id (:db/id job)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we use this job-id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere. It looks like leftover cruft, there was probably a log statement including it at some point.
126ad23
to
e715a08
Compare
be accepted or rejected at the end of the function." | ||
[conn driver ^TaskScheduler fenzo fid pending-jobs num-considerable offers-chan offers] | ||
(log/debug "invoked handle-resource-offers!") | ||
(let [offer-stash (atom nil)] ;; This is a way to ensure we never lose offers fenzo assigned if an errors occures in the middle of processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of way to avoid making the offer-stash an atom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue (and reason I believe an atom was used) is that a Try Block introduces its own lexical scope. If we want the exception handling of the try block to reset the offer stash with a value that is set inside the try block, an atom is the simplest way to do that, since the atom (unlike vars) can acquire a value that is available outside of the scope where it is set.
I believe we could avoid using an atom, but we would have to rearrange this whole function significantly, and partially change the way exceptions are handled in order to do so. We could have the parts that precede (reset! offer-stash) in one try block (maybe even a separate function), that returns a tuple of "considerable" and "matches" ... then declare the offer stash, then start a new try block that does the rest of what the existing function does (and handles exceptions in the same way).
ab75523
to
7f8fcc4
Compare
|
||
=== Significance of the parameters | ||
|
||
* safe-dru-threshold: Task with a DRU lower than safe-dru-threshold will not be preempted. If each DRU divisor is set to the corresponding per user share and safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cook currently sets the DRU divisor to the users share, correct? Or is this something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - it's the user's share:
user->dru-divisors (->> all-users
(map (fn [user]
[user (share/get-share db user)]))
(into {}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think you can simplify that sentence to be:
"If safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted"
Hopefully this will avoid the situation where a persistent backfilled unknown instance causes basically everything to get marked as backfilled until it is reconciled.
(Probably will only ever matter for unit tests).
(Helps for debugging).
(It was specifying incorrect behavior, where the head-of-considerable WAS matched, yet matched-head? was false).
Also, track the situation as it unfolds and log and record its progress via metrics.
2f0d3f8
to
97767cd
Compare
Congrats! |
No description provided.