Using ray as a task scheduler gives a warning #7256

MihaMuskinja · 2020-02-21T01:34:03Z

I have a simple ray application to schedule tasks on multiple nodes on Cori Haswell HPC. Typically I run ~10 nodes with 32 worker processes each, but sometimes I get the following warning. Would this be expected behavior in this case? I can observe that it takes a while for all tasks to start. Please see below for some details about the implementation.

2020-02-20 17:27:03,853%WARNING worker.py:1063 -- The actor or task with ID f209dbd4313af752ffffffff0300 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:10.128.2.121: 1.000000}, {CPU: 32.000000}, {memory: 103.906250 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Tasks are just command line executables which are executed remotely via a subprocess.popen call like this:

@ray.remote
def execute(job):
    p = subprocess.call(job)
    return p

I construct a ray cluster manually with srun commands, e.g. something like this

# start ray head node
srun -N1 -n1 -w $SLURMD_NODENAME \
    ray start --head --redis-port=$RAY_REDIS_PORT --redis-password=$RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &

... wait for head node to start ...

# start ray on all other nodes
srun -x $SLURMD_NODENAME -N`expr $SLURM_JOB_NUM_NODES - 1` -n`expr $SLURM_JOB_NUM_NODES - 1` \
    ray start --address $RAY_HEAD_IP:$RAY_REDIS_PORT --redis-password $RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &

... wait for ray to start on all nodes ...

In my driver application I execute tasks in a for loop:

    remaining_ids = []
    for task in tasks:
        remaining_ids += [exe.remote(task)]

Ray version and other system information: Python 3.7, ray 0.8.1, Cori Haswell HPC

The text was updated successfully, but these errors were encountered:

richardliaw · 2021-07-02T02:12:01Z

This is a duplicate of #15933

MihaMuskinja added the question Just a question :) label Feb 21, 2020

ericl added the P3 Issue moderate in impact or severity label Mar 5, 2020

richardliaw closed this as completed Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ray as a task scheduler gives a warning #7256

Using ray as a task scheduler gives a warning #7256

MihaMuskinja commented Feb 21, 2020

richardliaw commented Jul 2, 2021

Using ray as a task scheduler gives a warning #7256

Using ray as a task scheduler gives a warning #7256

Comments

MihaMuskinja commented Feb 21, 2020

richardliaw commented Jul 2, 2021