-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only reserves resources for workers that have recently called get_work #1272
Conversation
I'm confused. I thought you could only hold on to a resource by running a task that contains resources. No? |
No, in the get_work we also do speculative scheduling and reserve resources for higher priority tasks. That way if you have a worker with a lot of low-priority tasks, it can't hold on to all the resources just because it's the first to ask when one of its jobs finishes. The problem that this fixes is that I was scheduling a very large backfill and it included some high priority tasks that blocked resource usage for the 30+ minutes that scheduling took. |
@@ -343,6 +343,7 @@ def test_do_not_lock_resources_when_not_ready(self): | |||
self.assertEqual('C', self.sch.get_work(worker='Y')['task_id']) | |||
|
|||
def test_lock_resources_when_one_of_multiple_workers_is_ready(self): | |||
self.sch.get_work(worker='X') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for these new lines why you are doing them?
Even with the context of this PR, I'm not certain. I suppose it's for making the worker "active"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, comment added
I wonder if this really is a good solution. Maybe it would be better to "just implement" that the scheduling and get_work phases happen in parallel. Otherwise this LGTM. |
I'd still want this solution because it helps when a worker stops requesting more jobs. For example, after a SIGUSR1. I think long-term we probably want a better solution for how the reservations work, but I think that would involve basing reservations on get work calls too. |
…k request This presents workers that are scheduling or being phased out from holding on to resources that they're not ready to use yet or will never request. To accomplish this, we only reserve resources for workers that have recently made a get_work request. We define recently by being within the worker_disconnect_delay. Really this should be the wait_interval, but the scheduler doesn't have access to that value.
57bc755
to
51a6c8e
Compare
Ah, the SIGUSR1 case makes sense. I'm off now but I'll review it again and potentially merge by tomorrow. |
Makes sense to me. Thanks @daveFNbuck! |
Only reserves resources for workers that have recently called get_work
This presents workers that are scheduling or being phased out from
holding on to resources that they're not ready to use yet or will never
request. To accomplish this, we only reserve resources for workers that
have recently made a get_work request. We define recently by being
within the worker_disconnect_delay. Really this should be the
wait_interval, but the scheduler doesn't have access to that value.