Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在gpu-beta版本中提交到yarn申请资源超时 #27

Closed
zhanglistar opened this issue Apr 12, 2018 · 1 comment
Closed

在gpu-beta版本中提交到yarn申请资源超时 #27

zhanglistar opened this issue Apr 12, 2018 · 1 comment

Comments

@zhanglistar
Copy link

集群是有资源的,
AM log:
18/04/12 10:53:03 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE 18/04/12 10:53:04 INFO Utilities: input path: hdfs://gpu1:8020/user/tmp/data/tensorflow 18/04/12 10:53:04 INFO ApplicationMaster: XLearning application needs 2 worker and 1 ps containers in fact 18/04/12 10:53:04 INFO ApplicationMaster: Create worker container request: Capability[<memory:4096, vCores:2, yarn.io/gpu: 2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Create ps container request: Capability[<memory:4096, vCores:2>]Priority[3]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null] 18/04/12 10:53:04 INFO ApplicationMaster: Try to allocate 1 ps/server containers 18/04/12 10:53:05 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000002 on host gpu3 , with the resource <memory:4096, vCores:2> 18/04/12 10:53:05 INFO RMCallbackHandler: Current acquired worker container 0 / 2 ps container 1 / 1 18/04/12 10:53:06 INFO ApplicationMaster: Total 1 ps containers has allocated. 18/04/12 10:53:06 INFO ApplicationMaster: Try to allocate 2 worker containers 18/04/12 10:53:07 INFO RMCallbackHandler: Acquired container container_1523451270416_0005_01_000003 on host gpu3 , with the resource <memory:4096, vCores:2, yarn.io/gpu: 2> 18/04/12 10:53:07 INFO RMCallbackHandler: Current acquired worker container 1 / 2 ps container 1 / 1 18/04/12 11:03:08 INFO ApplicationMaster: Container waiting except the allocated expiry time. Maybe the Cluster available resources are not satisfied the user need. Please resubmit ! 18/04/12 11:03:08 INFO ApplicationMaster: Unregister Application 18/04/12 11:03:08 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 11:03:08 INFO ApplicationMaster: Application failed.
ResourceManager log:
clusterResource=<memory:400000, vCores:36, yarn.io/gpu: 8>

run.sh:
--worker-memory 4G \ --worker-num 2 \ --worker-cores 2 \ --worker-gcores 2 \ --ps-memory 4G \ --ps-num 1 \ --ps-cores 2 \

@zhanglistar
Copy link
Author

zhanglistar commented Apr 12, 2018

找到原因了是yarn本身的调度限制策略。yarn.scheduler.capacity.A.minimum-user-limit-percent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant