-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve: continue allocating if remain tasks is more than job's minMember #3430
Conversation
/assign @Monokaix @hwdef @william-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this PR does two things. But use one commit
Have splited them into two commits. |
@Monokaix @hwdef @wangyang0616 @william-wang please help to review it. thanks. |
I think the others are fine |
bfe1bb4
to
c4c8c74
Compare
The are 4 commits but the description has 2 commit. |
… tasks are not used up Signed-off-by: lowang-bh <lhui_wang@163.com>
…tring and has nothing to do with taskID Signed-off-by: lowang-bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com> rebase from master Signed-off-by: lowang-bh <lhui_wang@163.com>
A little doubt, in what case will a work pod has higher priority than ps? And what if pods in a vcjob have the same priority? |
worker with GPU request, can have a higher priority, so as to allocate gpu pod first. |
With no priority setting, the task order will be default compare stragety at volcano/pkg/controllers/job/helpers/helpers.go Lines 51 to 66 in 3483685
|
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
… fit error so as to save predicating time fix comments: use taskRole to store the value of 'volcano.sh/task-spec' in taskInfo Signed-off-by: lowang-bh <lhui_wang@163.com> Signed-off-by: lowang-bh <lhui_wang@163.com> Signed-off-by: lowang-bh <lhui_wang@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As PR #1459 has supported minAvailable at task level.
#3057 has improved and fixed the minResource calculation for a volcano job.
This PR continue to improve and fix the scheduling.
Consider this scene:
part of pods of a job is not schedulable, and left pods are schedulable, and the number can meet up the condition of min member.
example: cluster has only 4GPU and enough CP/MEM, a job with 1ps + 8worker, each worker request 1GPU and has higher priority than ps. So workers will be scheduled first and failed at 5th worker. After that, job will be failed scheduling and break out from allocate.
So this pr fix this issue: if allocated pod less than minMember and tasks are not used up, will continue on, untill the number of remained tasks can not meet up the condition of min member
commit 1: continue allocating and testcases about the feature change
commit 2: use string type to replace the taskID type, which indicate a role spec of each task in a job
commit 3: consider job's each task's min member when check if it can continue on allocating
commit 4: parse and store task spec in taskInfo, so that it doesn't need to parse it everywhere and whenerver it is used.
And add pre-check if task has fit error so as to save predicating time and avoid the useless computing. Testing with a 100master+100worker job has no obvious latancy comparing without this PR.
An Example Yaml Without Priority setting
with the following job yaml on a cluster with available CPU=10
Test Example Yaml
with the following job yaml on a cluster with available CPU<4
without this pr on master branch
with this pr