ray Aborted on a cluster managed by slurm #14426

chegrane · 2021-03-01T20:41:45Z

What is the problem?

ray is Aborted just after the function ray.init()

Sys: Linux cedar1.cedar.computecanada.ca 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 GNU/Linux
Python 3.8
ray version : '1.1.0'

Reproduction (REQUIRED)

(VENV_py3.8) [ibra@cedar1 src_code]$ python
Python 3.8.2 (default, May 15 2020, 20:21:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.__version__
'1.1.0'
>>> ray.init()
2021-03-01 12:40:20,400 INFO services.py:1171 -- View the Ray dashboard at http://127.0.0.1:8265
Aborted
(VENV_py3.8) [ibra@cedar1 src_code]$

note:
It may have relation with slurm, since it is a cluster. But according to the cluster information, they say that we can execute small program without using the job scheduler.

edit: I used slurm to lunch my program as a job, but I get the same problem

(VENV_py3.8) [ibra@cedar1 src_code]$ bash ./lunch_ray_cluster_cedar.sh
2021-03-01 13:02:12,500 INFO services.py:1171 -- View the Ray dashboard at http://127.0.0.1:8265
./lunch_ray_cluster_cedar.sh: line 7:  4663 Aborted                 python test_ray_one_node.py 1

my script is as follow:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-xxxxxxxxx_name_of_prof_xxxxxx
#SBATCH --output=%x___.out
#SBATCH --mincpus 32
#SBATCH --mem-per-cpu=1024M 
python test_ray_one_node.py 1

The text was updated successfully, but these errors were encountered:

rkooo567 · 2021-03-01T22:16:30Z

Can you show us logs inside /tmp/ray/session_latest/logs/raylet.err?

chegrane · 2021-03-02T14:40:18Z

@rkooo567
The following log is when I use python directly from command line :

[2021-03-02 06:37:44,220 E 53311 53360] dlmalloc.cc:112: mmap failed with error: Cannot allocate memory
[2021-03-02 06:37:44,220 E 53311 53360] logging.cc:414: *** Aborted at 1614695864 (unix time) try "date -d @1614695864" if you are using GNU date ***
[2021-03-02 06:37:44,221 E 53311 53360] logging.cc:414: PC: @ 0x0 (unknown)
[2021-03-02 06:37:44,222 E 53311 53360] logging.cc:414: *** SIGSEGV (@0x18) received by PID 53311 (TID 0x7f8900f96700) from PID 24; stack trace: ***
[2021-03-02 06:37:44,222 E 53311 53360] logging.cc:414: @ 0x55f9c7a5ba0f google::(anonymous namespace)::FailureSignalHandler()
[2021-03-02 06:37:44,223 E 53311 53360] logging.cc:414: @ 0x7f8905e47e90 (unknown)
[2021-03-02 06:37:44,224 E 53311 53360] logging.cc:414: @ 0x55f9c76d5db9 dlmalloc
[2021-03-02 06:37:44,225 E 53311 53360] logging.cc:414: @ 0x55f9c76d6790 plasma::internal_memalign()
[2021-03-02 06:37:44,226 E 53311 53360] logging.cc:414: @ 0x55f9c76d0e71 plasma::PlasmaAllocator::Memalign()
[2021-03-02 06:37:44,227 E 53311 53360] logging.cc:414: @ 0x55f9c76cc0e2 plasma::PlasmaStoreRunner::Start()
[2021-03-02 06:37:44,228 E 53311 53360] logging.cc:414: @ 0x55f9c766dc3c std::thread::_State_impl<>::_M_run()
[2021-03-02 06:37:44,228 E 53311 53360] logging.cc:414: @ 0x55f9c7dcb660 execute_native_thread_routine
[2021-03-02 06:37:44,230 E 53311 53360] logging.cc:414: @ 0x7f8905e3e1f4 start_thread
[2021-03-02 06:37:44,232 E 53311 53360] logging.cc:414: @ 0x7f890508716f __GI___clone

And the following log, when using slurm sbatch:

[2021-03-02 06:41:44,781 E 621 655] dlmalloc.cc:112: mmap failed with error: Cannot allocate memory
[2021-03-02 06:41:44,781 E 621 655] logging.cc:414: *** Aborted at 1614696104 (unix time) try "date -d @1614696104" if you are using GNU date ***
[2021-03-02 06:41:44,781 E 621 655] logging.cc:414: PC: @ 0x0 (unknown)
[2021-03-02 06:41:44,782 E 621 655] logging.cc:414: *** SIGSEGV (@0x18) received by PID 621 (TID 0x7f4962ffd700) from PID 24; stack trace: ***
[2021-03-02 06:41:44,783 E 621 655] logging.cc:414: @ 0x55f49c67ca0f google::(anonymous namespace)::FailureSignalHandler()
[2021-03-02 06:41:44,784 E 621 655] logging.cc:414: @ 0x7f496bf98e90 (unknown)
[2021-03-02 06:41:44,785 E 621 655] logging.cc:414: @ 0x55f49c2f6db9 dlmalloc
[2021-03-02 06:41:44,785 E 621 655] logging.cc:414: @ 0x55f49c2f7790 plasma::internal_memalign()
[2021-03-02 06:41:44,786 E 621 655] logging.cc:414: @ 0x55f49c2f1e71 plasma::PlasmaAllocator::Memalign()
[2021-03-02 06:41:44,787 E 621 655] logging.cc:414: @ 0x55f49c2ed0e2 plasma::PlasmaStoreRunner::Start()
[2021-03-02 06:41:44,788 E 621 655] logging.cc:414: @ 0x55f49c28ec3c std::thread::_State_impl<>::_M_run()
[2021-03-02 06:41:44,789 E 621 655] logging.cc:414: @ 0x55f49c9ec660 execute_native_thread_routine
[2021-03-02 06:41:44,789 E 621 655] logging.cc:414: @ 0x7f496bf8f1f4 start_thread
[2021-03-02 06:41:44,793 E 621 655] logging.cc:414: @ 0x7f496b1d816f __GI___clone

chegrane · 2021-03-03T14:40:15Z

Dear @rkooo567

I find where was the problem.

In fact, as i suspect in my first post, that, it may have a a relation with slurm, since I am using a cluster.
To confirm or deny this relation, I tested with submitting a job to slurm with the script provided above in my first post.

The error was in the command line to submit a job :

bash ./lunch_ray_cluster_cedar.sh , which is wrong
sbatch ./lunch_ray_cluster_cedar.sh is correct

The program worked fine without any problem.

As a result, it looks like in a cluster managed by Slurm, we have to go through it for ray to work properly.

But the problem remain in the case where we want to launch a small program without submitting a job to Slurm.

stale · 2021-07-01T16:38:09Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

chegrane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 1, 2021

chegrane changed the title ~~ray Aborted~~ ray Aborted on a cluster managed by slurm Mar 3, 2021

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 1, 2021

richardliaw closed this as completed Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray Aborted on a cluster managed by slurm #14426

ray Aborted on a cluster managed by slurm #14426

chegrane commented Mar 1, 2021 •

edited

rkooo567 commented Mar 1, 2021

chegrane commented Mar 2, 2021 •

edited

chegrane commented Mar 3, 2021

stale bot commented Jul 1, 2021

ray Aborted on a cluster managed by slurm #14426

ray Aborted on a cluster managed by slurm #14426

Comments

chegrane commented Mar 1, 2021 • edited

What is the problem?

Reproduction (REQUIRED)

rkooo567 commented Mar 1, 2021

chegrane commented Mar 2, 2021 • edited

chegrane commented Mar 3, 2021

stale bot commented Jul 1, 2021

chegrane commented Mar 1, 2021 •

edited

chegrane commented Mar 2, 2021 •

edited