Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Fenzo job placement, rebased and with fixes #145

Merged
merged 33 commits into from
Jul 28, 2016

Conversation

mforsyth
Copy link
Contributor

No description provided.

@@ -243,7 +241,7 @@
:priority (or priority util/default-job-priority)
:max-retries max_retries
:max-runtime (or max_runtime Long/MAX_VALUE)
:ports (or ports [])
:ports (or ports 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to clusters that previously used a list of ports? Will upgrading to fenzo cause db issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original fenzo branch attempted to change the the type of the Datomic attribute "port", which did cause issues. This branch leaves the original attribute there (though it's no longer used), and introduces a new attribute "ports", which is what would populate this part of the API response.

@mforsyth mforsyth force-pushed the mf/sims_with_fenzo branch 3 times, most recently from 4e55aba to dc804bb Compare June 27, 2016 14:16
@mforsyth
Copy link
Contributor Author

@wyegelwel I've made adjustments based on your comments. Wrapped a block in timers/time! rather than using start & stop, and added dru-scale for metrics as a documented configuration option.

(let [t (System/currentTimeMillis)
leases (mapv #(->VirtualMachineLeaseAdapter % t) offers)
requests (mapv (fn [job]
(let [job-id (:db/id job)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use this job-id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere. It looks like leftover cruft, there was probably a log statement including it at some point.

@mforsyth mforsyth force-pushed the mf/sims_with_fenzo branch 3 times, most recently from 126ad23 to e715a08 Compare June 29, 2016 13:46
be accepted or rejected at the end of the function."
[conn driver ^TaskScheduler fenzo fid pending-jobs num-considerable offers-chan offers]
(log/debug "invoked handle-resource-offers!")
(let [offer-stash (atom nil)] ;; This is a way to ensure we never lose offers fenzo assigned if an errors occures in the middle of processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of way to avoid making the offer-stash an atom?

Copy link
Contributor Author

@mforsyth mforsyth Jul 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue (and reason I believe an atom was used) is that a Try Block introduces its own lexical scope. If we want the exception handling of the try block to reset the offer stash with a value that is set inside the try block, an atom is the simplest way to do that, since the atom (unlike vars) can acquire a value that is available outside of the scope where it is set.

I believe we could avoid using an atom, but we would have to rearrange this whole function significantly, and partially change the way exceptions are handled in order to do so. We could have the parts that precede (reset! offer-stash) in one try block (maybe even a separate function), that returns a tuple of "considerable" and "matches" ... then declare the offer stash, then start a new try block that does the rest of what the existing function does (and handles exceptions in the same way).


=== Significance of the parameters

* safe-dru-threshold: Task with a DRU lower than safe-dru-threshold will not be preempted. If each DRU divisor is set to the corresponding per user share and safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cook currently sets the DRU divisor to the users share, correct? Or is this something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - it's the user's share:
user->dru-divisors (->> all-users
(map (fn [user]
[user (share/get-share db user)]))
(into {}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think you can simplify that sentence to be:

"If safe-dru-threshold is set to 1.0, then tasks that consume resources in aggregate less than the user resource share will not be preempted"

Hopefully this will avoid the situation where a persistent backfilled
unknown instance causes basically everything to get marked as
backfilled until it is reconciled.
(Probably will only ever matter for unit tests).
(It was specifying incorrect behavior, where the head-of-considerable
WAS matched, yet matched-head? was false).
Also, track the situation as it unfolds and log and record its progress
via metrics.
@wyegelwel wyegelwel merged commit 7a49fbb into twosigma:master Jul 28, 2016
@dgrnbrg
Copy link
Contributor

dgrnbrg commented Jul 28, 2016

Congrats!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants