Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime env] Fix segfault by sending RPC reply only after we're done accessing the request #28409

Merged
merged 2 commits into from
Sep 11, 2022

Conversation

architkulkarni
Copy link
Contributor

Why are these changes needed?

Debugged together with @iycheng.

In the current design, the owner of the gRPC request object is the gRPC call, so when the gRPC reply is sent the request object is destroyed. This caused a race condition in the runtime env gRPC handler where sometimes a request object was being accessed after it was being destroyed, causing a segfault. This PR moves the gRPC reply after the request object is used.

As a followup, we should improve the design so the developer doesn't need to be aware of these memory issues.

Related issue number

Closes #28198

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119
Copy link
Contributor

scv119 commented Sep 9, 2022

hmm is it possible to add a test..or better prevent this from happening again. cc @iycheng

@simon-mo
Copy link
Contributor

simon-mo commented Sep 9, 2022

This is a classic hidden use-after-free. (not for this PR) Can the GCS_RPC_SEND_REPLY macro potentially take the request and drop it? So this kind of situation will not happen anymore.

@fishbone
Copy link
Contributor

Yeah, we need improve this. I'll create a ticket

@fishbone fishbone merged commit c8249f9 into ray-project:master Sep 11, 2022
@architkulkarni architkulkarni deleted the fix-runtime-env-segfault branch September 12, 2022 15:41
ilee300a pushed a commit to ilee300a/ray that referenced this pull request Sep 12, 2022
… accessing the request (ray-project#28409)

In the current design, the owner of the gRPC request object is the gRPC call, so when the gRPC reply is sent the request object is destroyed.  This caused a race condition in the runtime env gRPC handler where sometimes a request object was being accessed after it was being destroyed, causing a segfault.  This PR moves the gRPC reply after the request object is used.

As a followup, we should improve the design so the developer doesn't need to be aware of these memory issues.

Signed-off-by: ilee300a <ilee300@anyscale.com>
justinvyu pushed a commit to justinvyu/ray that referenced this pull request Sep 14, 2022
… accessing the request (ray-project#28409)

In the current design, the owner of the gRPC request object is the gRPC call, so when the gRPC reply is sent the request object is destroyed.  This caused a race condition in the runtime env gRPC handler where sometimes a request object was being accessed after it was being destroyed, causing a segfault.  This PR moves the gRPC reply after the request object is used. 

As a followup, we should improve the design so the developer doesn't need to be aware of these memory issues.
matthewdeng pushed a commit that referenced this pull request Sep 15, 2022
… accessing the request (#28409)

In the current design, the owner of the gRPC request object is the gRPC call, so when the gRPC reply is sent the request object is destroyed.  This caused a race condition in the runtime env gRPC handler where sometimes a request object was being accessed after it was being destroyed, causing a segfault.  This PR moves the gRPC reply after the request object is used. 

As a followup, we should improve the design so the developer doesn't need to be aware of these memory issues.
fishbone added a commit that referenced this pull request Sep 27, 2022
…on. (#28661)

It's easy to use the request after the call is finished which will lead to a segment fault. We've already got crashes due to this #28409
The root cause is that right now the lifecycle of the request is aligned with the call but not the handler. If in the handler, we called the reply function, the request is going to be invalid.

This unintuitive behavior is very easy to produce bugs, and this PR just fixes it:

The const ref is change to a value and the data is moved to the value. In this way, the lifecycle of the request is decoupled with the call and no perf regression is expected.
scv119 pushed a commit that referenced this pull request Sep 29, 2022
… accessing the request (#28409)

In the current design, the owner of the gRPC request object is the gRPC call, so when the gRPC reply is sent the request object is destroyed.  This caused a race condition in the runtime env gRPC handler where sometimes a request object was being accessed after it was being destroyed, causing a segfault.  This PR moves the gRPC reply after the request object is used. 

As a followup, we should improve the design so the developer doesn't need to be aware of these memory issues.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…on. (ray-project#28661)

It's easy to use the request after the call is finished which will lead to a segment fault. We've already got crashes due to this ray-project#28409
The root cause is that right now the lifecycle of the request is aligned with the call but not the handler. If in the handler, we called the reply function, the request is going to be invalid.

This unintuitive behavior is very easy to produce bugs, and this PR just fixes it:

The const ref is change to a value and the data is moved to the value. In this way, the lifecycle of the request is decoupled with the call and no perf regression is expected.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] gcs_server SIGSEGV with ray 2.0.0
4 participants