Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object_buffer_pool] reduce mutex lock scope in WriteChunk #43434

Merged
merged 1 commit into from Feb 29, 2024

Conversation

sjoshi6
Copy link
Contributor

@sjoshi6 sjoshi6 commented Feb 26, 2024

Why are these changes needed?

Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage.

A burst of inbound network traffic occurs at the beginning of each ray.get(obj_refs) call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next ray.get(obj_refs) call.

This looks like a processing bottleneck when payloads are received over the network on the Pull side.

We dug into the code in object_manager and found the following:

Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage

Changes

Write Chunk

  • Transition the chunk from REFERENCED to SEALED before releasing the lock
  • Increment / Decrement num_inflight_copies before / after the copy
  • Perform an unguarded memcpy of the chunk into the buffer
  • Reacquire the mutex lock and perform object_id level Seal and Release decisions

AbortCreate

  • Wait to ensure num_inflight_copies == 0 before allowing the Release & Abort calls for the object_id
  • This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing

Tests

Before

  • Sampled network at 1 second frequency

speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0
Screenshot 2024-02-13 at 10 26 33 AM

  • Sampled network at 100 millisecond frequency

speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0
Screenshot 2024-02-13 at 10 26 37 AM

After

  • Sampled network at 1 second frequency
    speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0
Screenshot 2024-02-21 at 1 22 08 PM
  • Sampled network at 100ms frequency
    speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0
Screenshot 2024-02-21 at 1 19 10 PM
Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None
	ray.wait(fetch_local=True) Gbps: 54.67178658153866
	ray.wait(fetch_local=True) total time s: 15.711823463439941
	ray.get() Gbps: 186301.24111362518
	ray.get() total time s: 0.004610776901245117

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com>
@sjoshi6
Copy link
Contributor Author

sjoshi6 commented Feb 28, 2024

@iycheng , @kevin85421 , @jjyao

@fishbone fishbone merged commit 3167329 into ray-project:master Feb 29, 2024
8 of 10 checks passed
iamyangchen pushed a commit to pinterest/ray that referenced this pull request Mar 25, 2024
…ct#43434)

Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com>

## Why are these changes needed?
Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage. 

A burst of inbound network traffic occurs at the beginning of each `ray.get(obj_refs)` call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next `ray.get(obj_refs) call`.   

This looks like a processing bottleneck when payloads are received over the network on the Pull side. 

We dug into the code in [object_manager](https://github.com/ray-project/ray/tree/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager) and found the following:

- Objects are transferred in chunks of size: 5 MiB
- When a PushRequest is received for a chunk, it is processed by [ObjectManager::HandlePush](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L562)
- Which internally calls the [ObjectManager::ReceiveObjectChunk](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L623). This results in a call to the function [ObjectBufferPool::WriteChunk](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L128). 
- The WriteChunk function is [mutex guarded](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L122) throughout its execution.  
- This includes the [std::memcpy](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L139) call for a 5MiB payload
- This [pool_mutex_](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L215) lock is shared by all object_ids being received over the network 

Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage

### Changes

**Write Chunk**
- Transition the chunk from `REFERENCED` to `SEALED` before releasing the lock
- Increment / Decrement `num_inflight_copies` before / after the copy
- Perform an unguarded memcpy of the chunk into the buffer
- Reacquire the mutex lock and perform object_id level `Seal` and `Release` decisions

**AbortCreate**
- Wait to ensure `num_inflight_copies == 0` before allowing the `Release` & `Abort` calls for the `object_id`
- This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing


### Tests

#### Before

- Sampled network at 1 second frequency

`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="590" alt="Screenshot 2024-02-13 at 10 26 33 AM" src="https://github.com/ray-project/ray/assets/8691593/7a5497dd-b87d-4bec-a51c-62d629c06c58">

- Sampled network at 100 millisecond frequency

`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0 `
<img width="656" alt="Screenshot 2024-02-13 at 10 26 37 AM" src="https://github.com/ray-project/ray/assets/8691593/5ac229e6-4777-4820-b9e7-ebbd63cafcc9">


#### After

- Sampled network at 1 second frequency
`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="620" alt="Screenshot 2024-02-21 at 1 22 08 PM" src="https://github.com/ray-project/ray/assets/8691593/f15196d6-01f7-4d3e-b56c-ba09c58be455">


- Sampled network at 100ms frequency
`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0` 
<img width="1674" alt="Screenshot 2024-02-21 at 1 19 10 PM" src="https://github.com/ray-project/ray/assets/8691593/584f64b9-a39f-4441-b034-86234aff4283">

```
Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None
	ray.wait(fetch_local=True) Gbps: 54.67178658153866
	ray.wait(fetch_local=True) total time s: 15.711823463439941
	ray.get() Gbps: 186301.24111362518
	ray.get() total time s: 0.004610776901245117
```

## Related issue number
iamyangchen added a commit to pinterest/ray that referenced this pull request Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants