[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

sjoshi6 · 2024-02-26T19:11:26Z

Why are these changes needed?

Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage.

A burst of inbound network traffic occurs at the beginning of each ray.get(obj_refs) call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next ray.get(obj_refs) call.

This looks like a processing bottleneck when payloads are received over the network on the Pull side.

We dug into the code in object_manager and found the following:

Objects are transferred in chunks of size: 5 MiB
When a PushRequest is received for a chunk, it is processed by ObjectManager::HandlePush
Which internally calls the ObjectManager::ReceiveObjectChunk. This results in a call to the function ObjectBufferPool::WriteChunk.
The WriteChunk function is mutex guarded throughout its execution.
This includes the std::memcpy call for a 5MiB payload
This pool_mutex_ lock is shared by all object_ids being received over the network

Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage

Changes

Write Chunk

Transition the chunk from REFERENCED to SEALED before releasing the lock
Increment / Decrement num_inflight_copies before / after the copy
Perform an unguarded memcpy of the chunk into the buffer
Reacquire the mutex lock and perform object_id level Seal and Release decisions

AbortCreate

Wait to ensure num_inflight_copies == 0 before allowing the Release & Abort calls for the object_id
This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing

Tests

Before

Sampled network at 1 second frequency

speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0

Sampled network at 100 millisecond frequency

speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0

After

Sampled network at 1 second frequency
speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0

Sampled network at 100ms frequency
speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0

Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None
	ray.wait(fetch_local=True) Gbps: 54.67178658153866
	ray.wait(fetch_local=True) total time s: 15.711823463439941
	ray.get() Gbps: 186301.24111362518
	ray.get() total time s: 0.004610776901245117

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com>

sjoshi6 · 2024-02-28T08:45:20Z

@iycheng , @kevin85421 , @jjyao

…ct#43434) Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com> ## Why are these changes needed? Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage. A burst of inbound network traffic occurs at the beginning of each `ray.get(obj_refs)` call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next `ray.get(obj_refs) call`. This looks like a processing bottleneck when payloads are received over the network on the Pull side. We dug into the code in [object_manager](https://github.com/ray-project/ray/tree/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager) and found the following: - Objects are transferred in chunks of size: 5 MiB - When a PushRequest is received for a chunk, it is processed by [ObjectManager::HandlePush](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L562) - Which internally calls the [ObjectManager::ReceiveObjectChunk](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L623). This results in a call to the function [ObjectBufferPool::WriteChunk](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L128). - The WriteChunk function is [mutex guarded](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L122) throughout its execution. - This includes the [std::memcpy](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L139) call for a 5MiB payload - This [pool_mutex_](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L215) lock is shared by all object_ids being received over the network Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage ### Changes **Write Chunk** - Transition the chunk from `REFERENCED` to `SEALED` before releasing the lock - Increment / Decrement `num_inflight_copies` before / after the copy - Perform an unguarded memcpy of the chunk into the buffer - Reacquire the mutex lock and perform object_id level `Seal` and `Release` decisions **AbortCreate** - Wait to ensure `num_inflight_copies == 0` before allowing the `Release` & `Abort` calls for the `object_id` - This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing ### Tests #### Before - Sampled network at 1 second frequency `speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0` <img width="590" alt="Screenshot 2024-02-13 at 10 26 33 AM" src="https://github.com/ray-project/ray/assets/8691593/7a5497dd-b87d-4bec-a51c-62d629c06c58"> - Sampled network at 100 millisecond frequency `speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0 ` <img width="656" alt="Screenshot 2024-02-13 at 10 26 37 AM" src="https://github.com/ray-project/ray/assets/8691593/5ac229e6-4777-4820-b9e7-ebbd63cafcc9"> #### After - Sampled network at 1 second frequency `speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0` <img width="620" alt="Screenshot 2024-02-21 at 1 22 08 PM" src="https://github.com/ray-project/ray/assets/8691593/f15196d6-01f7-4d3e-b56c-ba09c58be455"> - Sampled network at 100ms frequency `speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0` <img width="1674" alt="Screenshot 2024-02-21 at 1 19 10 PM" src="https://github.com/ray-project/ray/assets/8691593/584f64b9-a39f-4441-b034-86234aff4283"> ``` Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None ray.wait(fetch_local=True) Gbps: 54.67178658153866 ray.wait(fetch_local=True) total time s: 15.711823463439941 ray.get() Gbps: 186301.24111362518 ray.get() total time s: 0.004610776901245117 ``` ## Related issue number

…ray-project#43150) (#1) Backport the mutex lock fix in main Ray repo. See original PR for details: ray-project#43434 whl: https://ray-ci-artifact-pr-public.s3.amazonaws.com/7692d63f2d32c4b0da4166cf759190ba7f056bee/tmp/artifacts/.whl/.whl/ray-2.9.3-cp38-cp38-manylinux2014_x86_64.whl Co-authored-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com>

[object_buffer_pool] reduce mutex lock scope in WriteChunk

33996a9

Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com>

fishbone assigned jjyao, kevin85421 and fishbone Feb 28, 2024

fishbone approved these changes Feb 28, 2024

View reviewed changes

fishbone merged commit 3167329 into ray-project:master Feb 29, 2024
8 of 10 checks passed

iamyangchen mentioned this pull request Mar 25, 2024

[backport] [object_buffer_pool] reduce mutex lock scope in WriteChunk (#43150) pinterest/ray#1

Merged

xyg-coder mentioned this pull request Apr 18, 2024

[backport] race condition fix + reduce mutex lock scope pinterest/ray#3

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

sjoshi6 commented Feb 26, 2024

sjoshi6 commented Feb 28, 2024

[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

Conversation

sjoshi6 commented Feb 26, 2024

Why are these changes needed?

Changes

Tests

Before

After

Related issue number

Checks

sjoshi6 commented Feb 28, 2024