Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the lifetime of InferPayload #241

Merged
merged 12 commits into from
May 17, 2023
Merged

Fix the lifetime of InferPayload #241

merged 12 commits into from
May 17, 2023

Conversation

krishung5
Copy link
Contributor

@krishung5 krishung5 commented May 12, 2023

For the BLS non-decoupled case, there is a race condtion:

In Thread 1 ExecuteBLSRequest:
The line infer_response = response_future.get() is wait on Thread 2 to complete the future's promise.

In Thread 2 InferResponseComplete:
When the line p->SetValueForPrevPromise(std::move(infer_response)); is reached, the promise is completed inside the function at p->SetValueForPrevPromise(std::move(infer_response));. At this moment, Thread 1 is unblocked, and infer_payload will be destroyed at the end of the block scope in Thread 1. This makes it possible that the following objects in Thread 2 might be used after infer_payload being destroyed, which causes segfaults.

Thread 1 ExecuteBLSRequest:
{
std::shared_ptr<InferPayload> infer_payload = ...;
auto response_future = request_executor_->Infer(infer_request, infer_payload);
infer_response = response_future.get(); // Waits on Thread 2 to complete the future's promise
}

Thread 2 InferResponseComplete:
auto p = reinterpret_cast<InferPayload*>(userp); // here p is same infer_payload as in Thread 1
p->SetValueForPrevPromise(std::move(infer_response)):
   prev_promise_->set_value(std::move(infer_response)); // Unblocks Thread 1
   prev_promise_.reset(); // Use after free race with Thread 1 destroying infer_payload
   is_promise_set_ = true; // Use after free race with Thread 1 destroying infer_payload

Order of execution for lifetime race:
Thread 1 blocks on promise future
Thread 2 completes promise and unblocks Thread 1
Thread 1 destroys infer_payload at the end of the block scope
Thread 2 accesses infer_payload after it was destroyed

This PR fixed the lifetime of InferPayload to avoid the race condition.

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.

src/infer_payload.cc Outdated Show resolved Hide resolved
src/infer_payload.cc Outdated Show resolved Hide resolved
src/infer_payload.cc Outdated Show resolved Hide resolved
src/infer_payload.cc Outdated Show resolved Hide resolved
src/infer_payload.cc Outdated Show resolved Hide resolved
Tabrizian
Tabrizian previously approved these changes May 12, 2023
Comment on lines +80 to +81
auto linfer_payload = reinterpret_cast<InferPayload*>(userp);
std::shared_ptr<InferPayload> infer_payload = linfer_payload->GetPtr();
Copy link
Contributor

@rmccorm4 rmccorm4 May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it valid to call raw_ptr->GetPtr() here?

I think it is because under the hood we know that userp was infer_payload.get(), where infer_payload was a shared_ptr, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.

It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).

enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).

ref:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you summarize in words the current issue and what this PR is solving?

The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing shared_ptr through (C++ <-> C API <-> C++) to automatically manage the lifetime on both sides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description based on my understanding. @Tabrizian Please make any changes if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you summarize in words the current issue and what this PR is solving?

The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing shared_ptr through (C++ <-> C API <-> C++) to automatically manage the lifetime on both sides.

Given the current design of the infer_payload being created as a shared pointer by python_be and then passed to request_executor, I believe this is cleanest approach. The basic problem was that the lifetime was managed by the shared pointer but the callback wasn't using a shared pointer and thus never increased the reference count.

That being said - I think if we were able to refactor to have the request_executor manage the creation and lifetime of infer_payload that may be cleaner (it doesn't seem at first glance that python_be needs to hang on to a reference - just the future). The creation and destruction ideally would be all handled internal to request_executor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it valid to call raw_ptr->GetPtr() here?

I think it is because under the hood we know that userp was infer_payload.get(), where infer_payload was a shared_ptr, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.

It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).
enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).

ref:

@Tabrizian also had an alternative that involved dynamically allocating a shared pointer which would be passed instead of the raw infer_payload. That method also maintains the lifetime of the object. In that case we would have a pointer to a shared pointer pointing to a infer_payload. For me that was more confusing - but open to reconsider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to use the enable_shared_from_this since it is cleaner. Added a comment to warn the API user about this caveat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krishung5 for updating the description. Looks accurate to me!

@krishung5 krishung5 changed the title Add mutex for InferPayload to make sure it's thread-safe during callback Fix the lifetime of InferPayload May 16, 2023
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future we think we can redesign to simplify the lifetime and callback management better.

@Tabrizian Tabrizian merged commit 6fa88ce into main May 17, 2023
2 checks passed
@Tabrizian Tabrizian deleted the krish-fix-bls-segfault branch May 17, 2023 20:54
Tabrizian added a commit that referenced this pull request May 17, 2023
* Add mutex for InferPayload to make sure it's thread-safe during callback

* Remove reset for the promise

* Address comment

* Remove destructor

* Fix lifetime of infer payload

* Make sure the mutex is unlocked before promise.set_value

* Revert "Make sure the mutex is unlocked before promise.set_value"

This reverts commit 2eb5c32.

* fix leak

* Serialize all the responses in decoupled BLS

* use enable_shared_from_this

* Add a warning about using "GetPtr"

* Remove the callback from mutex lock

---------

Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com>
Tabrizian added a commit that referenced this pull request May 17, 2023
* Add mutex for InferPayload to make sure it's thread-safe during callback

* Remove reset for the promise

* Address comment

* Remove destructor

* Fix lifetime of infer payload

* Make sure the mutex is unlocked before promise.set_value

* Revert "Make sure the mutex is unlocked before promise.set_value"

This reverts commit 2eb5c32.

* fix leak

* Serialize all the responses in decoupled BLS

* use enable_shared_from_this

* Add a warning about using "GetPtr"

* Remove the callback from mutex lock

---------

Co-authored-by: Kris Hung <krish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants