You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a simple ray application to schedule tasks on multiple nodes on Cori Haswell HPC. Typically I run ~10 nodes with 32 worker processes each, but sometimes I get the following warning. Would this be expected behavior in this case? I can observe that it takes a while for all tasks to start. Please see below for some details about the implementation.
2020-02-20 17:27:03,853%WARNING worker.py:1063 -- The actor or task with ID f209dbd4313af752ffffffff0300 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:10.128.2.121: 1.000000}, {CPU: 32.000000}, {memory: 103.906250 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
Tasks are just command line executables which are executed remotely via a subprocess.popen call like this:
@ray.remote
def execute(job):
p = subprocess.call(job)
return p
I construct a ray cluster manually with srun commands, e.g. something like this
# start ray head node
srun -N1 -n1 -w $SLURMD_NODENAME \
ray start --head --redis-port=$RAY_REDIS_PORT --redis-password=$RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &
... wait for head node to start ...
# start ray on all other nodes
srun -x $SLURMD_NODENAME -N`expr $SLURM_JOB_NUM_NODES - 1` -n`expr $SLURM_JOB_NUM_NODES - 1` \
ray start --address $RAY_HEAD_IP:$RAY_REDIS_PORT --redis-password $RAY_REDIS_PASSWORD --num-cpus=$RAY_NWORKERS --block &
... wait for ray to start on all nodes ...
In my driver application I execute tasks in a for loop:
remaining_ids = []
for task in tasks:
remaining_ids += [exe.remote(task)]
Ray version and other system information: Python 3.7, ray 0.8.1, Cori Haswell HPC
The text was updated successfully, but these errors were encountered:
I have a simple ray application to schedule tasks on multiple nodes on Cori Haswell HPC. Typically I run ~10 nodes with 32 worker processes each, but sometimes I get the following warning. Would this be expected behavior in this case? I can observe that it takes a while for all tasks to start. Please see below for some details about the implementation.
2020-02-20 17:27:03,853%WARNING worker.py:1063 -- The actor or task with ID f209dbd4313af752ffffffff0300 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:10.128.2.121: 1.000000}, {CPU: 32.000000}, {memory: 103.906250 GiB}, {object_store_memory: 12.841797 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
Tasks are just command line executables which are executed remotely via a subprocess.popen call like this:
I construct a ray cluster manually with srun commands, e.g. something like this
In my driver application I execute tasks in a for loop:
Ray version and other system information: Python 3.7, ray 0.8.1, Cori Haswell HPC
The text was updated successfully, but these errors were encountered: