Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ray as a task scheduler gives a warning #7256

Closed
MihaMuskinja opened this issue Feb 21, 2020 · 1 comment
Closed

Using ray as a task scheduler gives a warning #7256

MihaMuskinja opened this issue Feb 21, 2020 · 1 comment
Labels
P3 Issue moderate in impact or severity question Just a question :)

Comments

@MihaMuskinja
Copy link

I have a simple ray application to schedule tasks on multiple nodes on Cori Haswell HPC. Typically I run ~10 nodes with 32 worker processes each, but sometimes I get the following warning. Would this be expected behavior in this case? I can observe that it takes a while for all tasks to start. Please see below for some details about the implementation.

2020-02-20 17:27:03,853%WARNING worker.py:1063 -- The actor or task with ID f209dbd4313af752ffffffff0300 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:10.128.2.121: 1.000000}, {CPU: 32.000000}, {memory: 103.906250 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Tasks are just command line executables which are executed remotely via a subprocess.popen call like this:

@ray.remote
def execute(job):
    p = subprocess.call(job)
    return p

I construct a ray cluster manually with srun commands, e.g. something like this

# start ray head node
srun -N1 -n1 -w $SLURMD_NODENAME \
    ray start --head --redis-port=$RAY_REDIS_PORT --redis-password=$RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &

... wait for head node to start ...

# start ray on all other nodes
srun -x $SLURMD_NODENAME -N`expr $SLURM_JOB_NUM_NODES - 1` -n`expr $SLURM_JOB_NUM_NODES - 1` \
    ray start --address $RAY_HEAD_IP:$RAY_REDIS_PORT --redis-password $RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &

... wait for ray to start on all nodes ...

In my driver application I execute tasks in a for loop:

    remaining_ids = []
    for task in tasks:
        remaining_ids += [exe.remote(task)]

Ray version and other system information: Python 3.7, ray 0.8.1, Cori Haswell HPC

@MihaMuskinja MihaMuskinja added the question Just a question :) label Feb 21, 2020
@ericl ericl added the P3 Issue moderate in impact or severity label Mar 5, 2020
@richardliaw
Copy link
Contributor

This is a duplicate of #15933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Issue moderate in impact or severity question Just a question :)
Projects
None yet
Development

No branches or pull requests

3 participants