Skip to content

The queue resources are sufficient but enqueue failed using the capacity component #4235

@leizhenjie-yfd

Description

@leizhenjie-yfd

Please describe your problem in detail

Volcano version: v1.11
Resource: Two nvidia-a100,each node havs 8 gpus. Total resource is 16 gpus, cpu: 245, memory: 3950Gi

I want to config two queues online and offline under a parent queue. All resources in the parent queue have a higher priority for use by the online queue. The offline queue is used only when there are idle resources in the parent queue. If there are new job submited to the online queue and the job can be made to run by reclaim resources from the offline queue, some tasks in the offline queue will be evicted.

Problem:
During the progress of verifying reclaim action, first I submit Two jobs with 8 gpus/115c/1750Gi/1 replicas to offline queue and the offline queue used all resources in the parent queue. Then submit Two jobs with 2 gpus/30c/400Gi/1 replicas to offline queue. First two jobs are enqueued and running, second two jobs can't enqueue because resource quota insufficient, everything is normal up to here.
Then I submited a job with 1 gpus/15c/200Gi/1 replicas to online queue. One of the two tasks running in the offline queue will be terminated, then the job in the online queue will change to running. But two jobs with with 2 gpus/30c/400Gi/1 replicas in offline queue inqueue failed. Actually the parent queue has 7 gpus/115c/2000Gi is free.

My environment configurations are as follows:
three queues, ailab-cv-a100ailab-cv-a100-online and ailab-cv-a100-offline. The three queues are configured as follows:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100
spec:
  parent: queue-a100
  capability:
    nvidia.com/gpu: "16"
    cpu: 245
    memory: 3950Gi
    #pods: 400
    vke.volcengine.com/eni-ip: 400
  deserved:
    nvidia.com/gpu: "16"
    cpu: 245
    memory: 3950Gi
    #pods: 400
    vke.volcengine.com/eni-ip: 400
  guarantee:
    resource:
      nvidia.com/gpu: "16"
      cpu: 245
      #pods: 400
      vke.volcengine.com/eni-ip: 400
      memory: 3950Gi
  reclaimable: false
  weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100-online
spec:
  parent: ailab-cv-a100
  deserved:
    cpu: 243
    memory: 3900Gi
    nvidia.com/gpu: 16
    #pods: 200
    vke.volcengine.com/eni-ip: 200
  reclaimable: false
  weight: 1
  priority: 100
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100-offline
spec:
  parent: ailab-cv-a100
  deserved:
    cpu: 1
    memory: 1Gi
    nvidia.com/gpu: "0"
    vke.volcengine.com/eni-ip: 200
    #pods: 200
  reclaimable: true
  weight: 1
  priority: 0

volcano-scheduler-configmap configuration:

# Source: volcano/templates/scheduler.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
        arguments:
          predicate.NodeAffinityEnable: true
          predicate.NodePortsEnable: true
          predicate.TaintTolerationEnable: true
          predicate.PodAffinityEnable: true
          predicate.NodeVolumeLimitsEnable: true
          predicate.VolumeZoneEnable: true
          predicate.PodTopologySpreadEnable: true
          predicate.CacheEnable: true
          predicate.ProportionalEnable: true
          predicate.resources: nvidia.com/gpu
      # - name: overcommit
      # - name: proportion
      - name: nodeorder
      - name: binpack
        arguments:
          binpack.weight: 100
          # cpu资源权重    
          binpack.cpu: 1
          # memory资源权重    
          binpack.memory: 1
          # gpu等其他资源类型    
          binpack.resources: "nvidia.com/gpu"
          # gpu等其他资源权重配置  
          binpack.resources.nvidia.com/gpu: 98
      - name: capacity
        enableHierarchy: true

Any other relevant information

logs for scheduler:

I0425 02:36:10.144463       1 enqueue.go:45] Enter Enqueue ...
I0425 02:36:10.144470       1 enqueue.go:63] Added Queue <ailab-cv-a100-offline> for Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247>
I0425 02:36:10.144480       1 enqueue.go:74] Added Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144486       1 enqueue.go:74] Added Job <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144503       1 priority.go:70] Priority JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10
I0425 02:36:10.144514       1 gang.go:118] Gang JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false
I0425 02:36:10.144528       1 drf.go:325] DRF JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0
I0425 02:36:10.144537       1 enqueue.go:74] Added Job <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144541       1 priority.go:70] Priority JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10
I0425 02:36:10.144547       1 gang.go:118] Gang JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false
I0425 02:36:10.144551       1 drf.go:325] DRF JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0
I0425 02:36:10.144557       1 enqueue.go:63] Added Queue <ailab-cv-a100-online> for Job <training-job/test-06cot-66e8b266-f8f7-404d-8c04-1755938b04eb>
I0425 02:36:10.144564       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I0425 02:36:10.144568       1 priority.go:70] Priority JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10
I0425 02:36:10.144584       1 gang.go:118] Gang JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false
I0425 02:36:10.144587       1 drf.go:325] DRF JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0
I0425 02:36:10.144615       1 capacity.go:229] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 min resource <cpu 15000.00, memory 214748364800.00, nvidia.com/gpu 1000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, pods 1.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144623       1 capacity.go:235] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 inqueue false
I0425 02:36:10.144677       1 capacity.go:229] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 min resource <cpu 60000.00, memory 966367641600.00, nvidia.com/gpu 4000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144686       1 capacity.go:235] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 inqueue false
I0425 02:36:10.144714       1 capacity.go:229] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 min resource <cpu 30000.00, memory 429496729600.00, pods 1.00, nvidia.com/gpu 2000.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144736       1 capacity.go:235] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 inqueue false
I0425 02:36:10.144747       1 enqueue.go:104] Leaving Enqueue ...

According to https://github.com/volcano-sh/volcano/blob/release-1.11/pkg/scheduler/plugins/capacity/capacity.go#L224
r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
gpu=2(request)+8(allocated)+8(inqueue)-0=18 > total gpu 16
in the case,are the values of allocated and inqueue duplicated?

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/questionCategorizes issue related to a new question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions