[core] use fallocate for fallback allocation to avoid SIGBUS #16824

scv119 · 2021-07-01T22:35:28Z

Recently we had a SIGBUS bug that fallback allocation crashes with SIGBUS error when /tmp is full. Ideally we'd like to throw some OOM error instead of SIGBUS.

To address the problem, this PR uses fallocate which guarantees that the follow up write access won't fail if fallocate call succeed. Note that this only works for linux.

Another note is there is a posix_fallocate which falls back to emulation on non-linux systems. This is also worth considering but it may ends up with different performance behavior when fallocate/emulation is used. There is some additional caveat:

   In the glibc implementation, posix_fallocate() is implemented
   using the fallocate(2) system call, which is MT-safe.  If the
   underlying filesystem does not support fallocate(2), then the
   operation is emulated with the following caveats:

   * The emulation is inefficient.

   * There is a race condition where concurrent writes from another
     thread or process could be overwritten with null bytes.

   * There is a race condition where concurrent file size increases
     by another thread or process could result in a file whose size
     is smaller than expected.

   * If fd has been opened with the O_APPEND or O_WRONLY flags, the
     function fails with the error EBADF.

Test with shuffler:

(pid=36628) Epoch 0 done on consumer 1.
(pid=36628) Starting epoch 1 on consumer 1.
2021-07-06 03:37:57,350	ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Consumer.consume() (pid=36616, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 493, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 514, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 384, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::shuffle_reduce() (pid=38802, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 563, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1670, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 152, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 31ffddd68e69d833ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 1466584036 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=36616) Epoch 0 done on consumer 3.
(pid=36616) Starting epoch 1 on consumer 3.
2021-07-06 03:38:02,257	ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::Consumer.consume() (pid=36625, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 493, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 514, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 384, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::shuffle_reduce() (pid=38830, ip=172.31.10.69)
  File "python/ray/_raylet.pyx", line 563, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1670, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 152, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 0fb939667ee64409ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 1466027204 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=36625) Consuming batch on consumer 2 for epoch 0.

in raylet.out log:

[2021-07-06 03:37:33,758 I 36462 36479] dlmalloc.cc:122: create_and_mmap_buffer(1466028040, /tmp/ray/plasmaXXXXXX)
[2021-07-06 03:37:33,758 D 36462 36479] dlmalloc.cc:158: Preallocating fallback allocation using fallocate
[2021-07-06 03:37:34,540 E 36462 36479] dlmalloc.cc:167: Out of disk space with fallocate error: No space left on device

ericl · 2021-07-01T23:19:30Z

src/ray/object_manager/plasma/dlmalloc.cc

+  if (allocated_once && RayConfig::instance().plasma_unlimited()) {
+    if (!MAP_POPULATE) {
+      RAY_LOG(WARNING)
+          << "Fallback allocation: MAP_POPULATE is not available on this platform.";


This might be too spammy since we will call this very often, consider LOG_DEBUG

ericl

It might be a bit difficult to unit test, but could you at least manually test and check what happens if the disk is full and we trigger a fallback allocation?

ericl · 2021-07-01T23:20:47Z

LGTM pending testing (it would be also great to add the workload to nightly tests, though I'm not sure how easy it would be to replicate the out of disk issue. We might consider an alternate workload that specifically stresses disk space):

# First fill up memory
...

# Allocate from filesystem until out of disk
with pytest.raises(OutOfMemoryError):
  while True:
     refs.append(ray.put(big_array))

ericl · 2021-07-01T23:22:53Z

src/ray/object_manager/plasma/dlmalloc.cc

+    RAY_LOG(DEBUG) << "Enable MAP_POPULATE for fallback allocation.";
+    flags |= MAP_POPULATE;
+  }
+
  *pointer = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, *fd, 0);
  if (*pointer == MAP_FAILED) {


Maybe we could change this to say out of disk space?

scv119 · 2021-07-02T02:56:02Z

huh SIGBUS still happens with this fix. investigating.

scv119 · 2021-07-02T03:24:19Z

confirmed SIGBUS still happens even if we have MAP_POPULATE set:

[2021-07-02 03:18:08,731 I 27428 27457] dlmalloc.cc:114: create_and_mmap_buffer(29335560, /tmp/ray/plasmaXXXXXX)
[2021-07-02 03:18:08,731 D 27428 27457] dlmalloc.cc:153: Enable MAP_POPULATE for fallback allocation.
[2021-07-02 03:18:08,745 D 27428 27457] dlmalloc.cc:202: 0x7f4e56891008 = fake_mmap(29335560)
[2021-07-02 03:18:08,745 D 27428 27457] store.cc:373: create object c361e4b8895ba43cffffffffffffffffffffffff0100000023000000 succeeded

and later on worker node:

[2021-07-02 03:18:25,925 D 45086 45254] gcs_server_address_updater.cc:53: Getting gcs server address from raylet.
[2021-07-02 03:18:26,593 E 45086 96377] logging.cc:440: *** Aborted at 1625195906 (unix time) try "date -d @1625195906" if you are using GNU date ***
[2021-07-02 03:18:26,694 E 45086 96377] logging.cc:440: PC: @                0x0 (unknown)
[2021-07-02 03:18:26,711 E 45086 96377] logging.cc:440: *** SIGBUS (@0x7f444b62e000) received by PID 45086 (TID 0x7f45657fc700) from

the proper fix might be installing signal handler...

ericl · 2021-07-02T03:29:47Z

Ah, we must be getting sigbus during the populate then, is that right? Otherwise I don't see how you can get sigbus if the pages are already allocated.

…

On Thu, Jul 1, 2021, 8:24 PM Chen Shen ***@***.***> wrote: confirmed SIGBUS still happens even if we have MAP_POPULATE set: [2021-07-02 03:18:08,731 I 27428 27457] dlmalloc.cc:114: create_and_mmap_buffer(29335560, /tmp/ray/plasmaXXXXXX) [2021-07-02 03:18:08,731 D 27428 27457] dlmalloc.cc:153: Enable MAP_POPULATE for fallback allocation. [2021-07-02 03:18:08,745 D 27428 27457] dlmalloc.cc:202: 0x7f4e56891008 = fake_mmap(29335560) [2021-07-02 03:18:08,745 D 27428 27457] store.cc:373: create object c361e4b8895ba43cffffffffffffffffffffffff0100000023000000 succeeded and later on worker node: [2021-07-02 03:18:25,925 D 45086 45254] gcs_server_address_updater.cc:53: Getting gcs server address from raylet. [2021-07-02 03:18:26,593 E 45086 96377] logging.cc:440: *** Aborted at 1625195906 (unix time) try "date -d @1625195906" if you are using GNU date *** [2021-07-02 03:18:26,694 E 45086 96377] logging.cc:440: PC: @ 0x0 (unknown) [2021-07-02 03:18:26,711 E 45086 96377] logging.cc:440: *** SIGBUS ***@***.***) received by PID 45086 (TID 0x7f45657fc700) from the proper fix might be installing signal handler... — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#16824 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSRUV4WAW5M5QO7K6LTTVUWO5ANCNFSM47VSSP6A> .

scv119 · 2021-07-02T04:00:56Z

That's a good point. Actually SIGBUS happens on the worker(client) side, where does MMAP too without the MAP_POPULATE flag. I can try to enable that to see what happens.

Update: SIGBUS happens even if mmap succeeded with MAP_POPULATE flag.

scv119 · 2021-07-03T01:32:56Z

Doesn't work as expected

scv119 · 2021-07-05T23:25:43Z

Using fallocate instead

ericl

LGTM. Can we add a test that fills up disk space and checks the error return?

A clever way to do this nondestructively would be to set the fallback dir to /dev/shm as well.

clarkzinzow · 2021-07-06T18:54:28Z

Also @scv119 have you confirmed that this fixes #16540?

scv119 · 2021-07-06T19:19:18Z

@clarkzinzow this turns SIGBUS crash to OOM exceptions, but I'm not sure if shuffle loader could recover from this OOM. I can confirm ray-project/ray_shuffling_data_loader#14 fixed the issue.

rkooo567 · 2021-07-06T19:25:24Z

The fix is to use the large disk for spilling. I think this will help you figuring out the problem earlier (but we shouldn't raise OOM, but out of disk space error instead).

rkooo567 · 2021-07-06T19:34:43Z

@ericl @scv119 Btw, how are we deleting the fallback allocated files now? (and is the behavior same after using the fallocat?)

scv119 · 2021-07-06T21:33:36Z

yup I think the behavior is the same.

ericl · 2021-07-07T17:06:07Z

Tests seem to be failing on





============================= test session starts ==============================
  | platform linux -- Python 3.6.13, pytest-5.4.3, py-1.10.0, pluggy-0.13.1 -- /opt/miniconda/bin/python3
  | cachedir: .pytest_cache
  | rootdir: /root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/bazel-out/k8-opt/bin/python/ray/tests/test_plasma_unlimited.runfiles/com_github_ray_project_ray
  | plugins: asyncio-0.15.1, rerunfailures-10.1, sugar-0.9.4, timeout-1.4.2
  | collecting ... collected 8 items
  |  
  | ::test_fallback_when_spilling_impossible_on_put PASSED                   [ 12%]
  | ::test_spilling_when_possible_on_put PASSED                              [ 25%]
  | ::test_fallback_when_spilling_impossible_on_get PASSED                   [ 37%]
  | ::test_spilling_when_possible_on_get PASSED                              [ 50%]
  | ::test_task_unlimited PASSED                                             [ 62%]
  | ::test_task_unlimited_multiget_args PASSED                               [ 75%]
  | ::test_fd_reuse_no_memory_corruption PASSED                              [ 87%]
  | ::test_fallback_allocation_failure ================================================================================
 

<br class="Apple-interchange-newline">```

scv119 linked an issue Jul 1, 2021 that may be closed by this pull request

[Core] Plasma SIGBUS on shuffling data loader workload #16540

Closed

scv119 assigned scv119, ericl and rkooo567 Jul 1, 2021

ericl reviewed Jul 1, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 1, 2021

ericl reviewed Jul 1, 2021

View reviewed changes

scv119 added the do-not-merge Do not merge this PR! label Jul 2, 2021

scv119 closed this Jul 3, 2021

scv119 reopened this Jul 5, 2021

scv119 force-pushed the sigbus branch from 1e450eb to df1b223 Compare July 5, 2021 23:06

scv119 changed the title ~~[core] best effort enable MAP_POPULATE for fallback allocation~~ [core] use fallocate for fallback allocation to avoid SIGBUS Jul 5, 2021

scv119 removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jul 6, 2021

ericl approved these changes Jul 6, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2021

rkooo567 approved these changes Jul 6, 2021

View reviewed changes

scv119 added 8 commits July 7, 2021 01:54

best effort enable MAP_POPULATE for fallback allocation

8ca6533

more logging

65b762c

fix macros

4b86801

fix macros

e43a9e6

better macro

ff3f0c5

linter

0c7905b

fix bug

28d0436

error handling

4c2da69

scv119 force-pushed the sigbus branch from aa43528 to 8de3b3d Compare July 7, 2021 08:55

add tests

83095cf

scv119 force-pushed the sigbus branch from 8de3b3d to 83095cf Compare July 7, 2021 19:32

scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 7, 2021

ericl merged commit 0421fa1 into ray-project:master Jul 7, 2021

jiaodong pushed a commit that referenced this pull request Jul 11, 2021

[core] use fallocate for fallback allocation to avoid SIGBUS (#16824)

e6a8e1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] use fallocate for fallback allocation to avoid SIGBUS #16824

[core] use fallocate for fallback allocation to avoid SIGBUS #16824

scv119 commented Jul 1, 2021 •

edited

Loading

ericl Jul 1, 2021 •

edited

Loading

ericl left a comment

ericl commented Jul 1, 2021 •

edited

Loading

ericl Jul 1, 2021

scv119 commented Jul 2, 2021

scv119 commented Jul 2, 2021

ericl commented Jul 2, 2021 via email

scv119 commented Jul 2, 2021 •

edited

Loading

scv119 commented Jul 3, 2021

scv119 commented Jul 5, 2021

ericl left a comment

clarkzinzow commented Jul 6, 2021

scv119 commented Jul 6, 2021

rkooo567 commented Jul 6, 2021

rkooo567 commented Jul 6, 2021

scv119 commented Jul 6, 2021

ericl commented Jul 7, 2021

[core] use fallocate for fallback allocation to avoid SIGBUS #16824

[core] use fallocate for fallback allocation to avoid SIGBUS #16824

Conversation

scv119 commented Jul 1, 2021 • edited Loading

ericl Jul 1, 2021 • edited Loading

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl commented Jul 1, 2021 • edited Loading

ericl Jul 1, 2021

Choose a reason for hiding this comment

scv119 commented Jul 2, 2021

scv119 commented Jul 2, 2021

ericl commented Jul 2, 2021 via email

scv119 commented Jul 2, 2021 • edited Loading

scv119 commented Jul 3, 2021

scv119 commented Jul 5, 2021

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 6, 2021

scv119 commented Jul 6, 2021

rkooo567 commented Jul 6, 2021

rkooo567 commented Jul 6, 2021

scv119 commented Jul 6, 2021

ericl commented Jul 7, 2021

scv119 commented Jul 1, 2021 •

edited

Loading

ericl Jul 1, 2021 •

edited

Loading

ericl commented Jul 1, 2021 •

edited

Loading

scv119 commented Jul 2, 2021 •

edited

Loading