-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Please describe your problem in detail
Volcano version: v1.11
Resource: Two nvidia-a100,each node havs 8 gpus. Total resource is 16 gpus, cpu: 245, memory: 3950Gi
I want to config two queues online and offline under a parent queue. All resources in the parent queue have a higher priority for use by the online queue. The offline queue is used only when there are idle resources in the parent queue. If there are new job submited to the online queue and the job can be made to run by reclaim resources from the offline queue, some tasks in the offline queue will be evicted.
Problem:
During the progress of verifying reclaim action, first I submit Two jobs with 8 gpus/115c/1750Gi/1 replicas to offline queue and the offline queue used all resources in the parent queue. Then submit Two jobs with 2 gpus/30c/400Gi/1 replicas to offline queue. First two jobs are enqueued and running, second two jobs can't enqueue because resource quota insufficient, everything is normal up to here.
Then I submited a job with 1 gpus/15c/200Gi/1 replicas to online queue. One of the two tasks running in the offline queue will be terminated, then the job in the online queue will change to running. But two jobs with with 2 gpus/30c/400Gi/1 replicas in offline queue inqueue failed. Actually the parent queue has 7 gpus/115c/2000Gi is free.
My environment configurations are as follows:
three queues, ailab-cv-a100、ailab-cv-a100-online and ailab-cv-a100-offline. The three queues are configured as follows:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ailab-cv-a100
spec:
parent: queue-a100
capability:
nvidia.com/gpu: "16"
cpu: 245
memory: 3950Gi
#pods: 400
vke.volcengine.com/eni-ip: 400
deserved:
nvidia.com/gpu: "16"
cpu: 245
memory: 3950Gi
#pods: 400
vke.volcengine.com/eni-ip: 400
guarantee:
resource:
nvidia.com/gpu: "16"
cpu: 245
#pods: 400
vke.volcengine.com/eni-ip: 400
memory: 3950Gi
reclaimable: false
weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ailab-cv-a100-online
spec:
parent: ailab-cv-a100
deserved:
cpu: 243
memory: 3900Gi
nvidia.com/gpu: 16
#pods: 200
vke.volcengine.com/eni-ip: 200
reclaimable: false
weight: 1
priority: 100
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ailab-cv-a100-offline
spec:
parent: ailab-cv-a100
deserved:
cpu: 1
memory: 1Gi
nvidia.com/gpu: "0"
vke.volcengine.com/eni-ip: 200
#pods: 200
reclaimable: true
weight: 1
priority: 0
volcano-scheduler-configmap configuration:
# Source: volcano/templates/scheduler.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: drf
enablePreemptable: false
- name: predicates
arguments:
predicate.NodeAffinityEnable: true
predicate.NodePortsEnable: true
predicate.TaintTolerationEnable: true
predicate.PodAffinityEnable: true
predicate.NodeVolumeLimitsEnable: true
predicate.VolumeZoneEnable: true
predicate.PodTopologySpreadEnable: true
predicate.CacheEnable: true
predicate.ProportionalEnable: true
predicate.resources: nvidia.com/gpu
# - name: overcommit
# - name: proportion
- name: nodeorder
- name: binpack
arguments:
binpack.weight: 100
# cpu资源权重
binpack.cpu: 1
# memory资源权重
binpack.memory: 1
# gpu等其他资源类型
binpack.resources: "nvidia.com/gpu"
# gpu等其他资源权重配置
binpack.resources.nvidia.com/gpu: 98
- name: capacity
enableHierarchy: true
Any other relevant information
logs for scheduler:
I0425 02:36:10.144463 1 enqueue.go:45] Enter Enqueue ...
I0425 02:36:10.144470 1 enqueue.go:63] Added Queue <ailab-cv-a100-offline> for Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247>
I0425 02:36:10.144480 1 enqueue.go:74] Added Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144486 1 enqueue.go:74] Added Job <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144503 1 priority.go:70] Priority JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10
I0425 02:36:10.144514 1 gang.go:118] Gang JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false
I0425 02:36:10.144528 1 drf.go:325] DRF JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0
I0425 02:36:10.144537 1 enqueue.go:74] Added Job <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144541 1 priority.go:70] Priority JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10
I0425 02:36:10.144547 1 gang.go:118] Gang JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false
I0425 02:36:10.144551 1 drf.go:325] DRF JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0
I0425 02:36:10.144557 1 enqueue.go:63] Added Queue <ailab-cv-a100-online> for Job <training-job/test-06cot-66e8b266-f8f7-404d-8c04-1755938b04eb>
I0425 02:36:10.144564 1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I0425 02:36:10.144568 1 priority.go:70] Priority JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10
I0425 02:36:10.144584 1 gang.go:118] Gang JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false
I0425 02:36:10.144587 1 drf.go:325] DRF JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0
I0425 02:36:10.144615 1 capacity.go:229] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 min resource <cpu 15000.00, memory 214748364800.00, nvidia.com/gpu 1000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, pods 1.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144623 1 capacity.go:235] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 inqueue false
I0425 02:36:10.144677 1 capacity.go:229] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 min resource <cpu 60000.00, memory 966367641600.00, nvidia.com/gpu 4000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144686 1 capacity.go:235] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 inqueue false
I0425 02:36:10.144714 1 capacity.go:229] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 min resource <cpu 30000.00, memory 429496729600.00, pods 1.00, nvidia.com/gpu 2000.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144736 1 capacity.go:235] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 inqueue false
I0425 02:36:10.144747 1 enqueue.go:104] Leaving Enqueue ...
According to https://github.com/volcano-sh/volcano/blob/release-1.11/pkg/scheduler/plugins/capacity/capacity.go#L224
r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
gpu=2(request)+8(allocated)+8(inqueue)-0=18 > total gpu 16
in the case,are the values of allocated and inqueue duplicated?