The queue resources are sufficient but enqueue failed using the capacity component

### Please describe your problem in detail

Volcano version: v1.11
Resource: Two nvidia-a100，each node havs 8 gpus. Total resource is 16 gpus, cpu: 245, memory: 3950Gi

I want to config two queues _online_ and _offline_ under a parent queue. All resources in the parent queue have a higher priority for use by the _online_ queue. The _offline_ queue is used only when there are idle resources in the parent queue. If there are new job submited to the _online_ queue and the job can be made to run by reclaim resources from the _offline_ queue, some tasks in the _offline_ queue will be evicted. 

Problem：
During the progress of verifying _reclaim_ action, first I submit Two jobs with _8 gpus/115c/1750Gi/1 replicas_ to _offline_ queue and the _offline_ queue used all resources in the parent queue. Then submit Two jobs with _2 gpus/30c/400Gi/1 replicas_ to offline queue. First two jobs are enqueued and running, second two jobs can't enqueue because resource quota insufficient, everything is normal up to here.
Then I submited a job with _1 gpus/15c/200Gi/1 replicas_ to _online_ queue. One of the two tasks running in the _offline_ queue will be terminated, then the job in the _online_ queue will change to running. **But two jobs with with _2 gpus/30c/400Gi/1 replicas_ in _offline_ queue inqueue failed**. Actually the parent queue has 7 gpus/115c/2000Gi is free.

My environment configurations are as follows:
three queues, _ailab-cv-a100_、_ailab-cv-a100-online_ and _ailab-cv-a100-offline_. The three queues are configured as follows:
```
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100
spec:
  parent: queue-a100
  capability:
    nvidia.com/gpu: "16"
    cpu: 245
    memory: 3950Gi
    #pods: 400
    vke.volcengine.com/eni-ip: 400
  deserved:
    nvidia.com/gpu: "16"
    cpu: 245
    memory: 3950Gi
    #pods: 400
    vke.volcengine.com/eni-ip: 400
  guarantee:
    resource:
      nvidia.com/gpu: "16"
      cpu: 245
      #pods: 400
      vke.volcengine.com/eni-ip: 400
      memory: 3950Gi
  reclaimable: false
  weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100-online
spec:
  parent: ailab-cv-a100
  deserved:
    cpu: 243
    memory: 3900Gi
    nvidia.com/gpu: 16
    #pods: 200
    vke.volcengine.com/eni-ip: 200
  reclaimable: false
  weight: 1
  priority: 100
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ailab-cv-a100-offline
spec:
  parent: ailab-cv-a100
  deserved:
    cpu: 1
    memory: 1Gi
    nvidia.com/gpu: "0"
    vke.volcengine.com/eni-ip: 200
    #pods: 200
  reclaimable: true
  weight: 1
  priority: 0
```
volcano-scheduler-configmap configuration:
```
# Source: volcano/templates/scheduler.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
        arguments:
          predicate.NodeAffinityEnable: true
          predicate.NodePortsEnable: true
          predicate.TaintTolerationEnable: true
          predicate.PodAffinityEnable: true
          predicate.NodeVolumeLimitsEnable: true
          predicate.VolumeZoneEnable: true
          predicate.PodTopologySpreadEnable: true
          predicate.CacheEnable: true
          predicate.ProportionalEnable: true
          predicate.resources: nvidia.com/gpu
      # - name: overcommit
      # - name: proportion
      - name: nodeorder
      - name: binpack
        arguments:
          binpack.weight: 100
          # cpu资源权重    
          binpack.cpu: 1
          # memory资源权重    
          binpack.memory: 1
          # gpu等其他资源类型    
          binpack.resources: "nvidia.com/gpu"
          # gpu等其他资源权重配置  
          binpack.resources.nvidia.com/gpu: 98
      - name: capacity
        enableHierarchy: true
```



### Any other relevant information

logs for scheduler:
```
I0425 02:36:10.144463       1 enqueue.go:45] Enter Enqueue ...
I0425 02:36:10.144470       1 enqueue.go:63] Added Queue <ailab-cv-a100-offline> for Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247>
I0425 02:36:10.144480       1 enqueue.go:74] Added Job <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144486       1 enqueue.go:74] Added Job <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144503       1 priority.go:70] Priority JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10
I0425 02:36:10.144514       1 gang.go:118] Gang JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false
I0425 02:36:10.144528       1 drf.go:325] DRF JobOrderFn: <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0, <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0
I0425 02:36:10.144537       1 enqueue.go:74] Added Job <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> into Queue <ailab-cv-a100-offline>
I0425 02:36:10.144541       1 priority.go:70] Priority JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> priority: 10
I0425 02:36:10.144547       1 gang.go:118] Gang JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> is ready: false
I0425 02:36:10.144551       1 drf.go:325] DRF JobOrderFn: <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0, <training-job/test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804> share state: 0
I0425 02:36:10.144557       1 enqueue.go:63] Added Queue <ailab-cv-a100-online> for Job <training-job/test-06cot-66e8b266-f8f7-404d-8c04-1755938b04eb>
I0425 02:36:10.144564       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I0425 02:36:10.144568       1 priority.go:70] Priority JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> priority: 10, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> priority: 10
I0425 02:36:10.144584       1 gang.go:118] Gang JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> is ready: false, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> is ready: false
I0425 02:36:10.144587       1 drf.go:325] DRF JobOrderFn: <training-job/test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247> share state: 0, <training-job/test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7> share state: 0
I0425 02:36:10.144615       1 capacity.go:229] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 min resource <cpu 15000.00, memory 214748364800.00, nvidia.com/gpu 1000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, pods 1.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144623       1 capacity.go:235] job test-90rbf-4c85612b-0642-4f76-8557-d6e89d5a3804 inqueue false
I0425 02:36:10.144677       1 capacity.go:229] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 min resource <cpu 60000.00, memory 966367641600.00, nvidia.com/gpu 4000.00, pods 1.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144686       1 capacity.go:235] job test-20ayg-4459d1aa-7767-4bb3-9d78-d8d2bf110ed7 inqueue false
I0425 02:36:10.144714       1 capacity.go:229] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 min resource <cpu 30000.00, memory 429496729600.00, pods 1.00, nvidia.com/gpu 2000.00>, queue ailab-cv-a100-offline capability <cpu 245000.00, memory 4241280204800.00, hugepages-2Mi 0.00, nvidia.com/gpu 16000.00, pods 1055.00, vke.volcengine.com/eni-ip 400000.00, ephemeral-storage 9653232098914000.00, hugepages-1Gi 0.00> allocated <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, vke.volcengine.com/eni-ip 1000.00, pods 1.00> inqueue <cpu 115000.00, memory 1879048192000.00, nvidia.com/gpu 8000.00, pods 1.00> elastic <cpu 0.00, memory 0.00, vke.volcengine.com/eni-ip 1000.00>
I0425 02:36:10.144736       1 capacity.go:235] job test-93cjh-19ed638a-488a-4f92-b16b-f2fce49cb247 inqueue false
I0425 02:36:10.144747       1 enqueue.go:104] Leaving Enqueue ...
```

According to https://github.com/volcano-sh/volcano/blob/release-1.11/pkg/scheduler/plugins/capacity/capacity.go#L224
r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
gpu=2(request)+8(allocated)+8(inqueue)-0=18  > total gpu 16
in the case，are the values of allocated and inqueue duplicated?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The queue resources are sufficient but enqueue failed using the capacity component #4235

Please describe your problem in detail

Any other relevant information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The queue resources are sufficient but enqueue failed using the capacity component #4235

Description

Please describe your problem in detail

Any other relevant information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions