Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with resource deadlock warning. #5790

Closed
robertnishihara opened this issue Sep 26, 2019 · 5 comments
Closed

Problems with resource deadlock warning. #5790

robertnishihara opened this issue Sep 26, 2019 · 5 comments
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@robertnishihara
Copy link
Collaborator

robertnishihara commented Sep 26, 2019

The resource deadlock warning will fire spuriously (some fraction of the time) during the following workload.

import ray

ray.init(num_gpus=2, num_cpus=2)

@ray.remote(num_gpus=1) 
def f(): 
    return 

@ray.remote(num_gpus=1) 
def g(): 
    ray.get(f.remote())

ray.get([g.remote() for _ in range(2)])

The reason is that it is waiting for a worker to start up.

If we then subsequently run something which actually does deadlock, like

@ray.remote(num_gpus=1) 
def f(): 
    return 

@ray.remote(num_gpus=1) 
def g(): 
    ray.get(f.remote())

ray.get([g.remote() for _ in range(2)])

it may not print any warning because the warning happens at most once every debug_dump_period_ milliseconds.

@robertnishihara
Copy link
Collaborator Author

#5789 is related.

@robertnishihara
Copy link
Collaborator Author

cc @edoakes

@ericl
Copy link
Contributor

ericl commented Sep 26, 2019

This is working as designed -- a warning will eventually show up. But feel free to ignore them if the cluster is still scaling up.

We could add another delay though to reduce the false positives.

@stale
Copy link

stale bot commented Nov 14, 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2020
@robertnishihara
Copy link
Collaborator Author

The behavior seems to have changed here. Now if I run

import ray

ray.init(num_gpus=2, num_cpus=2)

@ray.remote(num_gpus=1) 
def f(): 
    return 

@ray.remote(num_gpus=1) 
def g(): 
    ray.get(f.remote())

ray.get([g.remote() for _ in range(2)])

it deadlocks (as it should), but there is no warning (either in stdout or in the dashboard).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

3 participants