Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray Aborted on a cluster managed by slurm #14426

Closed
chegrane opened this issue Mar 1, 2021 · 4 comments
Closed

ray Aborted on a cluster managed by slurm #14426

chegrane opened this issue Mar 1, 2021 · 4 comments
Labels
bug Something that is supposed to be working; but isn't stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@chegrane
Copy link

chegrane commented Mar 1, 2021

What is the problem?

ray is Aborted just after the function ray.init()

Sys: Linux cedar1.cedar.computecanada.ca 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 GNU/Linux
Python 3.8
ray version : '1.1.0'

Reproduction (REQUIRED)

(VENV_py3.8) [ibra@cedar1 src_code]$ python
Python 3.8.2 (default, May 15 2020, 20:21:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.__version__
'1.1.0'
>>> ray.init()
2021-03-01 12:40:20,400 INFO services.py:1171 -- View the Ray dashboard at http://127.0.0.1:8265
Aborted
(VENV_py3.8) [ibra@cedar1 src_code]$

note:
It may have relation with slurm, since it is a cluster. But according to the cluster information, they say that we can execute small program without using the job scheduler.

edit: I used slurm to lunch my program as a job, but I get the same problem

(VENV_py3.8) [ibra@cedar1 src_code]$ bash ./lunch_ray_cluster_cedar.sh
2021-03-01 13:02:12,500 INFO services.py:1171 -- View the Ray dashboard at http://127.0.0.1:8265
./lunch_ray_cluster_cedar.sh: line 7:  4663 Aborted                 python test_ray_one_node.py 1

my script is as follow:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-xxxxxxxxx_name_of_prof_xxxxxx
#SBATCH --output=%x___.out
#SBATCH --mincpus 32
#SBATCH --mem-per-cpu=1024M 
python test_ray_one_node.py 1

@chegrane chegrane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 1, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 1, 2021

Can you show us logs inside /tmp/ray/session_latest/logs/raylet.err?

@chegrane
Copy link
Author

chegrane commented Mar 2, 2021

@rkooo567
The following log is when I use python directly from command line :

[2021-03-02 06:37:44,220 E 53311 53360] dlmalloc.cc:112: mmap failed with error: Cannot allocate memory
[2021-03-02 06:37:44,220 E 53311 53360] logging.cc:414: *** Aborted at 1614695864 (unix time) try "date -d @1614695864" if you are using GNU date ***
[2021-03-02 06:37:44,221 E 53311 53360] logging.cc:414: PC: @ 0x0 (unknown)
[2021-03-02 06:37:44,222 E 53311 53360] logging.cc:414: *** SIGSEGV (@0x18) received by PID 53311 (TID 0x7f8900f96700) from PID 24; stack trace: ***
[2021-03-02 06:37:44,222 E 53311 53360] logging.cc:414: @ 0x55f9c7a5ba0f google::(anonymous namespace)::FailureSignalHandler()
[2021-03-02 06:37:44,223 E 53311 53360] logging.cc:414: @ 0x7f8905e47e90 (unknown)
[2021-03-02 06:37:44,224 E 53311 53360] logging.cc:414: @ 0x55f9c76d5db9 dlmalloc
[2021-03-02 06:37:44,225 E 53311 53360] logging.cc:414: @ 0x55f9c76d6790 plasma::internal_memalign()
[2021-03-02 06:37:44,226 E 53311 53360] logging.cc:414: @ 0x55f9c76d0e71 plasma::PlasmaAllocator::Memalign()
[2021-03-02 06:37:44,227 E 53311 53360] logging.cc:414: @ 0x55f9c76cc0e2 plasma::PlasmaStoreRunner::Start()
[2021-03-02 06:37:44,228 E 53311 53360] logging.cc:414: @ 0x55f9c766dc3c std::thread::_State_impl<>::_M_run()
[2021-03-02 06:37:44,228 E 53311 53360] logging.cc:414: @ 0x55f9c7dcb660 execute_native_thread_routine
[2021-03-02 06:37:44,230 E 53311 53360] logging.cc:414: @ 0x7f8905e3e1f4 start_thread
[2021-03-02 06:37:44,232 E 53311 53360] logging.cc:414: @ 0x7f890508716f __GI___clone

And the following log, when using slurm sbatch:

[2021-03-02 06:41:44,781 E 621 655] dlmalloc.cc:112: mmap failed with error: Cannot allocate memory
[2021-03-02 06:41:44,781 E 621 655] logging.cc:414: *** Aborted at 1614696104 (unix time) try "date -d @1614696104" if you are using GNU date ***
[2021-03-02 06:41:44,781 E 621 655] logging.cc:414: PC: @ 0x0 (unknown)
[2021-03-02 06:41:44,782 E 621 655] logging.cc:414: *** SIGSEGV (@0x18) received by PID 621 (TID 0x7f4962ffd700) from PID 24; stack trace: ***
[2021-03-02 06:41:44,783 E 621 655] logging.cc:414: @ 0x55f49c67ca0f google::(anonymous namespace)::FailureSignalHandler()
[2021-03-02 06:41:44,784 E 621 655] logging.cc:414: @ 0x7f496bf98e90 (unknown)
[2021-03-02 06:41:44,785 E 621 655] logging.cc:414: @ 0x55f49c2f6db9 dlmalloc
[2021-03-02 06:41:44,785 E 621 655] logging.cc:414: @ 0x55f49c2f7790 plasma::internal_memalign()
[2021-03-02 06:41:44,786 E 621 655] logging.cc:414: @ 0x55f49c2f1e71 plasma::PlasmaAllocator::Memalign()
[2021-03-02 06:41:44,787 E 621 655] logging.cc:414: @ 0x55f49c2ed0e2 plasma::PlasmaStoreRunner::Start()
[2021-03-02 06:41:44,788 E 621 655] logging.cc:414: @ 0x55f49c28ec3c std::thread::_State_impl<>::_M_run()
[2021-03-02 06:41:44,789 E 621 655] logging.cc:414: @ 0x55f49c9ec660 execute_native_thread_routine
[2021-03-02 06:41:44,789 E 621 655] logging.cc:414: @ 0x7f496bf8f1f4 start_thread
[2021-03-02 06:41:44,793 E 621 655] logging.cc:414: @ 0x7f496b1d816f __GI___clone

@chegrane chegrane changed the title ray Aborted ray Aborted on a cluster managed by slurm Mar 3, 2021
@chegrane
Copy link
Author

chegrane commented Mar 3, 2021

Dear @rkooo567

I find where was the problem.

In fact, as i suspect in my first post, that, it may have a a relation with slurm, since I am using a cluster.
To confirm or deny this relation, I tested with submitting a job to slurm with the script provided above in my first post.

The error was in the command line to submit a job :

  • bash ./lunch_ray_cluster_cedar.sh , which is wrong
  • sbatch ./lunch_ray_cluster_cedar.sh is correct

The program worked fine without any problem.

As a result, it looks like in a cluster managed by Slurm, we have to go through it for ray to work properly.

But the problem remain in the case where we want to launch a small program without submitting a job to Slurm.

@stale
Copy link

stale bot commented Jul 1, 2021

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants