Fix the lifetime of InferPayload #241

krishung5 · 2023-05-12T04:43:20Z

For the BLS non-decoupled case, there is a race condtion:

In Thread 1 ExecuteBLSRequest:
The line infer_response = response_future.get() is wait on Thread 2 to complete the future's promise.

In Thread 2 InferResponseComplete:
When the line p->SetValueForPrevPromise(std::move(infer_response)); is reached, the promise is completed inside the function at p->SetValueForPrevPromise(std::move(infer_response));. At this moment, Thread 1 is unblocked, and infer_payload will be destroyed at the end of the block scope in Thread 1. This makes it possible that the following objects in Thread 2 might be used after infer_payload being destroyed, which causes segfaults.

Thread 1 ExecuteBLSRequest:
{
std::shared_ptr<InferPayload> infer_payload = ...;
auto response_future = request_executor_->Infer(infer_request, infer_payload);
infer_response = response_future.get(); // Waits on Thread 2 to complete the future's promise
}

Thread 2 InferResponseComplete:
auto p = reinterpret_cast<InferPayload*>(userp); // here p is same infer_payload as in Thread 1
p->SetValueForPrevPromise(std::move(infer_response)):
   prev_promise_->set_value(std::move(infer_response)); // Unblocks Thread 1
   prev_promise_.reset(); // Use after free race with Thread 1 destroying infer_payload
   is_promise_set_ = true; // Use after free race with Thread 1 destroying infer_payload

Order of execution for lifetime race:
Thread 1 blocks on promise future
Thread 2 completes promise and unblocks Thread 1
Thread 1 destroys infer_payload at the end of the block scope
Thread 2 accesses infer_payload after it was destroyed

This PR fixed the lifetime of InferPayload to avoid the race condition.

Tabrizian

I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.

Tabrizian

I don't understand why a mutex is required. Can you explain where concurrent access to any of the class objects happen? I can see that we may need the mutex for decoupled response but I don't see why it is required in non-decoupled response.

src/infer_payload.cc

This reverts commit 2eb5c32.

rmccorm4 · 2023-05-16T22:15:00Z

src/request_executor.cc

+  auto linfer_payload = reinterpret_cast<InferPayload*>(userp);
+  std::shared_ptr<InferPayload> infer_payload = linfer_payload->GetPtr();


Is it valid to call raw_ptr->GetPtr() here?

I think it is because under the hood we know that userp was infer_payload.get(), where infer_payload was a shared_ptr, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.

It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).

enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).

ref:

https://en.cppreference.com/w/cpp/memory/enable_shared_from_this

https://stackoverflow.com/a/5548314

Also can you summarize in words the current issue and what this PR is solving?

The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing shared_ptr through (C++ <-> C API <-> C++) to automatically manage the lifetime on both sides.

Updated the PR description based on my understanding. @Tabrizian Please make any changes if needed.

Also can you summarize in words the current issue and what this PR is solving?

The issue appears to be some lifetime issues passing the payload back and forth, and it looks like we're trying to solve this by passing shared_ptr through (C++ <-> C API <-> C++) to automatically manage the lifetime on both sides.

Given the current design of the infer_payload being created as a shared pointer by python_be and then passed to request_executor, I believe this is cleanest approach. The basic problem was that the lifetime was managed by the shared pointer but the callback wasn't using a shared pointer and thus never increased the reference count.

That being said - I think if we were able to refactor to have the request_executor manage the creation and lifetime of infer_payload that may be cleaner (it doesn't seem at first glance that python_be needs to hang on to a reference - just the future). The creation and destruction ideally would be all handled internal to request_executor.

Is it valid to call raw_ptr->GetPtr() here?

I think it is because under the hood we know that userp was infer_payload.get(), where infer_payload was a shared_ptr, but I think it might a little sketchy/prone to future error if the caller changes the type from shared_ptr to something to unique or raw or something else in the future. If this is indeed correct and don't have a clean alternative, I think we should document the constraints with a comment or two here.

It is permitted to call shared_from_this only on a previously shared object, i.e. on an object managed by std::shared_ptr. Otherwise the behavior is undefined (until C++17)std::bad_weak_ptr is thrown (by the shared_ptr constructor from a default-constructed weak_this) (since C++17).
enable_shared_from_this provides the safe alternative to an expression like std::shared_ptr(this), which is likely to result in this being destructed more than once by multiple owners that are unaware of each other (see example below).

ref:

https://en.cppreference.com/w/cpp/memory/enable_shared_from_this

https://stackoverflow.com/a/5548314

@Tabrizian also had an alternative that involved dynamically allocating a shared pointer which would be passed instead of the raw infer_payload. That method also maintains the lifetime of the object. In that case we would have a pointer to a shared pointer pointing to a infer_payload. For me that was more confusing - but open to reconsider.

I'm inclined to use the enable_shared_from_this since it is cleaner. Added a comment to warn the API user about this caveat.

Thanks @krishung5 for updating the description. Looks accurate to me!

nnshah1

For the future we think we can redesign to simplify the lifetime and callback management better.

* Add mutex for InferPayload to make sure it's thread-safe during callback * Remove reset for the promise * Address comment * Remove destructor * Fix lifetime of infer payload * Make sure the mutex is unlocked before promise.set_value * Revert "Make sure the mutex is unlocked before promise.set_value" This reverts commit 2eb5c32. * fix leak * Serialize all the responses in decoupled BLS * use enable_shared_from_this * Add a warning about using "GetPtr" * Remove the callback from mutex lock --------- Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add mutex for InferPayload to make sure it's thread-safe during callback * Remove reset for the promise * Address comment * Remove destructor * Fix lifetime of infer payload * Make sure the mutex is unlocked before promise.set_value * Revert "Make sure the mutex is unlocked before promise.set_value" This reverts commit 2eb5c32. * fix leak * Serialize all the responses in decoupled BLS * use enable_shared_from_this * Add a warning about using "GetPtr" * Remove the callback from mutex lock --------- Co-authored-by: Kris Hung <krish@nvidia.com>

Add mutex for InferPayload to make sure it's thread-safe during callback

f2acd4e

krishung5 requested review from Tabrizian and nnshah1 May 12, 2023 04:43

Tabrizian reviewed May 12, 2023

View reviewed changes

nnshah1 reviewed May 12, 2023

View reviewed changes