在gpu-beta版本中提交到yarn申请资源超时 #27

zhanglistar · 2018-04-12T06:42:36Z

集群是有资源的，
AM log：
18/04/12 10:53:03 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE 18/04/12 10:53:04 INFO Utilities: input path: hdfs://gpu1:8020/user/tmp/data/tensorflow 18/04/12 10:53:04 INFO ApplicationMaster: XLearning application needs 2 worker and 1 ps containers in fact 18/04/12 10:53:04 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:2, yarn.io/gpu: 2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Create ps container request: Capability[<memory:4096, vCores:2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Try to allocate 1 ps/server containers 18/04/12 10:53:05 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000002 on host gpu3 , with the resource <memory:4096, vCores:2> 18/04/12 10:53:05 INFO RMCallbackHandler: Current acquired worker container 0 / 2 ps container 1 / 1 18/04/12 10:53:06 INFO ApplicationMaster: Total 1 ps containers has allocated. 18/04/12 10:53:06 INFO ApplicationMaster: Try to allocate 2 worker containers 18/04/12 10:53:07 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000003 on host gpu3 , with the resource <memory:4096, vCores:2, yarn.io/gpu: 2> 18/04/12 10:53:07 INFO RMCallbackHandler: Current acquired worker container 1 / 2 ps container 1 / 1 18/04/12 11:03:08 INFO ApplicationMaster: Container waiting except the allocated expiry time. Maybe the Cluster available resources are not satisfied the user need. Please resubmit ! 18/04/12 11:03:08 INFO ApplicationMaster: Unregister Application 18/04/12 11:03:08 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 11:03:08 INFO ApplicationMaster: Application failed.
ResourceManager log：
clusterResource=<memory:400000, vCores:36, yarn.io/gpu: 8>

run.sh:
--worker-memory 4G \ --worker-num 2 \ --worker-cores 2 \ --worker-gcores 2 \ --ps-memory 4G \ --ps-num 1 \ --ps-cores 2 \

The text was updated successfully, but these errors were encountered:

zhanglistar · 2018-04-12T07:11:21Z

找到原因了是yarn本身的调度限制策略。yarn.scheduler.capacity.A.minimum-user-limit-percent

zhanglistar closed this as completed Apr 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在gpu-beta版本中提交到yarn申请资源超时 #27

在gpu-beta版本中提交到yarn申请资源超时 #27

zhanglistar commented Apr 12, 2018

zhanglistar commented Apr 12, 2018 •

edited

Loading

在gpu-beta版本中提交到yarn申请资源超时 #27

在gpu-beta版本中提交到yarn申请资源超时 #27

Comments

zhanglistar commented Apr 12, 2018

zhanglistar commented Apr 12, 2018 • edited Loading

zhanglistar commented Apr 12, 2018 •

edited

Loading