Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] use fallocate for fallback allocation to avoid SIGBUS #16824

Merged
merged 9 commits into from
Jul 7, 2021

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Jul 1, 2021

Recently we had a SIGBUS bug that fallback allocation crashes with SIGBUS error when /tmp is full. Ideally we'd like to throw some OOM error instead of SIGBUS.

To address the problem, this PR uses fallocate which guarantees that the follow up write access won't fail if fallocate call succeed. Note that this only works for linux.

Another note is there is a posix_fallocate which falls back to emulation on non-linux systems. This is also worth considering but it may ends up with different performance behavior when fallocate/emulation is used. There is some additional caveat:

   In the glibc implementation, posix_fallocate() is implemented
   using the fallocate(2) system call, which is MT-safe.  If the
   underlying filesystem does not support fallocate(2), then the
   operation is emulated with the following caveats:

   * The emulation is inefficient.

   * There is a race condition where concurrent writes from another
     thread or process could be overwritten with null bytes.

   * There is a race condition where concurrent file size increases
     by another thread or process could result in a file whose size
     is smaller than expected.

   * If fd has been opened with the O_APPEND or O_WRONLY flags, the
     function fails with the error EBADF.

Test with shuffler:

(pid=36628) Epoch 0 done on consumer 1.
(pid=36628) Starting epoch 1 on consumer 1.
2021-07-06 03:37:57,350	ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Consumer.consume() (pid=36616, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 493, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 514, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 384, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::shuffle_reduce() (pid=38802, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 563, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1670, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 152, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 31ffddd68e69d833ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 1466584036 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=36616) Epoch 0 done on consumer 3.
(pid=36616) Starting epoch 1 on consumer 3.
2021-07-06 03:38:02,257	ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Consumer.consume() (pid=36625, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 493, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 514, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 384, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::shuffle_reduce() (pid=38830, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 563, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1670, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 152, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 0fb939667ee64409ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 1466027204 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=36625) Consuming batch on consumer 2 for epoch 0.

in raylet.out log:

[2021-07-06 03:37:33,758 I 36462 36479] dlmalloc.cc:122: create_and_mmap_buffer(1466028040, /tmp/ray/plasmaXXXXXX)
[2021-07-06 03:37:33,758 D 36462 36479] dlmalloc.cc:158: Preallocating fallback allocation using fallocate
[2021-07-06 03:37:34,540 E 36462 36479] dlmalloc.cc:167: Out of disk space with fallocate error: No space left on device

@scv119 scv119 linked an issue Jul 1, 2021 that may be closed by this pull request
if (allocated_once && RayConfig::instance().plasma_unlimited()) {
if (!MAP_POPULATE) {
RAY_LOG(WARNING)
<< "Fallback allocation: MAP_POPULATE is not available on this platform.";
Copy link
Contributor

@ericl ericl Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too spammy since we will call this very often, consider LOG_DEBUG

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit difficult to unit test, but could you at least manually test and check what happens if the disk is full and we trigger a fallback allocation?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 1, 2021
@ericl
Copy link
Contributor

ericl commented Jul 1, 2021

LGTM pending testing (it would be also great to add the workload to nightly tests, though I'm not sure how easy it would be to replicate the out of disk issue. We might consider an alternate workload that specifically stresses disk space):

# First fill up memory
...

# Allocate from filesystem until out of disk
with pytest.raises(OutOfMemoryError):
  while True:
     refs.append(ray.put(big_array))

RAY_LOG(DEBUG) << "Enable MAP_POPULATE for fallback allocation.";
flags |= MAP_POPULATE;
}

*pointer = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, *fd, 0);
if (*pointer == MAP_FAILED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could change this to say out of disk space?

@scv119
Copy link
Contributor Author

scv119 commented Jul 2, 2021

huh SIGBUS still happens with this fix. investigating.

@scv119
Copy link
Contributor Author

scv119 commented Jul 2, 2021

confirmed SIGBUS still happens even if we have MAP_POPULATE set:

[2021-07-02 03:18:08,731 I 27428 27457] dlmalloc.cc:114: create_and_mmap_buffer(29335560, /tmp/ray/plasmaXXXXXX)
[2021-07-02 03:18:08,731 D 27428 27457] dlmalloc.cc:153: Enable MAP_POPULATE for fallback allocation.
[2021-07-02 03:18:08,745 D 27428 27457] dlmalloc.cc:202: 0x7f4e56891008 = fake_mmap(29335560)
[2021-07-02 03:18:08,745 D 27428 27457] store.cc:373: create object c361e4b8895ba43cffffffffffffffffffffffff0100000023000000 succeeded

and later on worker node:

[2021-07-02 03:18:25,925 D 45086 45254] gcs_server_address_updater.cc:53: Getting gcs server address from raylet.
[2021-07-02 03:18:26,593 E 45086 96377] logging.cc:440: *** Aborted at 1625195906 (unix time) try "date -d @1625195906" if you are using GNU date ***
[2021-07-02 03:18:26,694 E 45086 96377] logging.cc:440: PC: @                0x0 (unknown)
[2021-07-02 03:18:26,711 E 45086 96377] logging.cc:440: *** SIGBUS (@0x7f444b62e000) received by PID 45086 (TID 0x7f45657fc700) from

the proper fix might be installing signal handler...

@scv119 scv119 added the do-not-merge Do not merge this PR! label Jul 2, 2021
@ericl
Copy link
Contributor

ericl commented Jul 2, 2021 via email

@scv119
Copy link
Contributor Author

scv119 commented Jul 2, 2021

That's a good point. Actually SIGBUS happens on the worker(client) side, where does MMAP too without the MAP_POPULATE flag. I can try to enable that to see what happens.

Update: SIGBUS happens even if mmap succeeded with MAP_POPULATE flag.

@scv119
Copy link
Contributor Author

scv119 commented Jul 3, 2021

Doesn't work as expected

@scv119 scv119 closed this Jul 3, 2021
@scv119 scv119 reopened this Jul 5, 2021
@scv119 scv119 changed the title [core] best effort enable MAP_POPULATE for fallback allocation [core] use fallocate for fallback allocation to avoid SIGBUS Jul 5, 2021
@scv119
Copy link
Contributor Author

scv119 commented Jul 5, 2021

Using fallocate instead

@scv119 scv119 removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jul 6, 2021
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can we add a test that fills up disk space and checks the error return?

A clever way to do this nondestructively would be to set the fallback dir to /dev/shm as well.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2021
@clarkzinzow
Copy link
Contributor

Also @scv119 have you confirmed that this fixes #16540?

@scv119
Copy link
Contributor Author

scv119 commented Jul 6, 2021

@clarkzinzow this turns SIGBUS crash to OOM exceptions, but I'm not sure if shuffle loader could recover from this OOM. I can confirm ray-project/ray_shuffling_data_loader#14 fixed the issue.

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 6, 2021

The fix is to use the large disk for spilling. I think this will help you figuring out the problem earlier (but we shouldn't raise OOM, but out of disk space error instead).

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 6, 2021

@ericl @scv119 Btw, how are we deleting the fallback allocated files now? (and is the behavior same after using the fallocat?)

@scv119
Copy link
Contributor Author

scv119 commented Jul 6, 2021

yup I think the behavior is the same.

@ericl
Copy link
Contributor

ericl commented Jul 7, 2021

Tests seem to be failing on





============================= test session starts ==============================
  | platform linux -- Python 3.6.13, pytest-5.4.3, py-1.10.0, pluggy-0.13.1 -- /opt/miniconda/bin/python3
  | cachedir: .pytest_cache
  | rootdir: /root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/bazel-out/k8-opt/bin/python/ray/tests/test_plasma_unlimited.runfiles/com_github_ray_project_ray
  | plugins: asyncio-0.15.1, rerunfailures-10.1, sugar-0.9.4, timeout-1.4.2
  | collecting ... collected 8 items
  |  
  | ::test_fallback_when_spilling_impossible_on_put PASSED                   [ 12%]
  | ::test_spilling_when_possible_on_put PASSED                              [ 25%]
  | ::test_fallback_when_spilling_impossible_on_get PASSED                   [ 37%]
  | ::test_spilling_when_possible_on_get PASSED                              [ 50%]
  | ::test_task_unlimited PASSED                                             [ 62%]
  | ::test_task_unlimited_multiget_args PASSED                               [ 75%]
  | ::test_fd_reuse_no_memory_corruption PASSED                              [ 87%]
  | ::test_fallback_allocation_failure ================================================================================
 

<br class="Apple-interchange-newline">```

@scv119 scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 7, 2021
@ericl ericl merged commit 0421fa1 into ray-project:master Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Plasma SIGBUS on shuffling data loader workload
4 participants