-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the lifetime of InferPayload #241
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.
6e3bf07
to
aadc477
Compare
f0cf5d1
to
8da1d50
Compare
fd261c3
to
505833f
Compare
auto linfer_payload = reinterpret_cast<InferPayload*>(userp); | ||
std::shared_ptr<InferPayload> infer_payload = linfer_payload->GetPtr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid to call raw_ptr->GetPtr()
here?
I think it is because under the hood we know that userp
was infer_payload.get()
, where infer_payload
was a shared_ptr
, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.
It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).
enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).
ref:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you summarize in words the current issue and what this PR is solving?
The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing shared_ptr
through (C++ <-> C API <-> C++
) to automatically manage the lifetime on both sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR description based on my understanding. @Tabrizian Please make any changes if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you summarize in words the current issue and what this PR is solving?
The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing
shared_ptr
through (C++ <-> C API <-> C++
) to automatically manage the lifetime on both sides.
Given the current design of the infer_payload being created as a shared pointer by python_be and then passed to request_executor, I believe this is cleanest approach. The basic problem was that the lifetime was managed by the shared pointer but the callback wasn't using a shared pointer and thus never increased the reference count.
That being said - I think if we were able to refactor to have the request_executor manage the creation and lifetime of infer_payload that may be cleaner (it doesn't seem at first glance that python_be needs to hang on to a reference - just the future). The creation and destruction ideally would be all handled internal to request_executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid to call
raw_ptr->GetPtr()
here?I think it is because under the hood we know that
userp
wasinfer_payload.get()
, whereinfer_payload
was ashared_ptr
, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).
enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).ref:
@Tabrizian also had an alternative that involved dynamically allocating a shared pointer which would be passed instead of the raw infer_payload. That method also maintains the lifetime of the object. In that case we would have a pointer to a shared pointer pointing to a infer_payload. For me that was more confusing - but open to reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to use the enable_shared_from_this
since it is cleaner. Added a comment to warn the API user about this caveat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @krishung5 for updating the description. Looks accurate to me!
bddfe4b
to
2b8898c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future we think we can redesign to simplify the lifetime and callback management better.
* Add mutex for InferPayload to make sure it's thread-safe during callback * Remove reset for the promise * Address comment * Remove destructor * Fix lifetime of infer payload * Make sure the mutex is unlocked before promise.set_value * Revert "Make sure the mutex is unlocked before promise.set_value" This reverts commit 2eb5c32. * fix leak * Serialize all the responses in decoupled BLS * use enable_shared_from_this * Add a warning about using "GetPtr" * Remove the callback from mutex lock --------- Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add mutex for InferPayload to make sure it's thread-safe during callback * Remove reset for the promise * Address comment * Remove destructor * Fix lifetime of infer payload * Make sure the mutex is unlocked before promise.set_value * Revert "Make sure the mutex is unlocked before promise.set_value" This reverts commit 2eb5c32. * fix leak * Serialize all the responses in decoupled BLS * use enable_shared_from_this * Add a warning about using "GetPtr" * Remove the callback from mutex lock --------- Co-authored-by: Kris Hung <krish@nvidia.com>
For the BLS non-decoupled case, there is a race condtion:
In Thread 1 ExecuteBLSRequest:
The line infer_response = response_future.get() is wait on Thread 2 to complete the future's promise.
In Thread 2 InferResponseComplete:
When the line p->SetValueForPrevPromise(std::move(infer_response)); is reached, the promise is completed inside the function at p->SetValueForPrevPromise(std::move(infer_response));. At this moment, Thread 1 is unblocked, and
infer_payload
will be destroyed at the end of the block scope in Thread 1. This makes it possible that the following objects in Thread 2 might be used afterinfer_payload
being destroyed, which causes segfaults.This PR fixed the lifetime of
InferPayload
to avoid the race condition.