Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

Merged
merged 9 commits into from
Jun 6, 2024

Conversation

oandreeva-nv
Copy link
Contributor

What does the PR do?

This PR fixes an issue with a gRPC ModelInferHandler stopping accepting requests after a number of cancellation received.

Main root-cause and pre-conditions before hand:
At some point cancellation notification is received at step START. When this happens, we never skip this block

Since, we never went skip it, we don't create new state for future incoming requests, i.e. call StartNewRequest

Thus, completion queue becomes exhausted at some point. gRPC requests come, but there is nothing that accepts it on Triton's side.

Introduced changes make sure we create new request handler in those situations.
Added test logic:

I start server and send large amount of inference requests and cancel them right away. Pre fix, the clear identification that there are no ModelInferHandler's for any in-coming request is the server stops logging "New request handler for ModelInfer", i.e
grep -c "New request handler for ModelInfer" doesn't change. In all pre-fix scenarious it happens after
"Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START" was logged 4 times, 2 times for 1 request. Since we start 2 ModelInferHandler's threads initially, that make sense, as no new Infer handlers were created to handle incoming requests and Triton just keeps processing what it already has.

After fix, StartNewRequest is called properly and "New request handler for ModelInfer" keeps growing, as well as "Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START".

Checklist`

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:
    pre-fix: 15596577
    post-fix: 15596692

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

@oandreeva-nv oandreeva-nv added module: server Issues related to the server core and frontends PR: fix A bug fix labels Jun 5, 2024
@oandreeva-nv oandreeva-nv requested review from kthui, tanmayv25 and nnshah1 and removed request for kthui June 5, 2024 18:49
@@ -694,6 +694,16 @@ ModelInferHandler::Process(InferHandler::State* state, bool rpc_ok)
// Handle notification for cancellation which can be raised
// asynchronously if detected on the network.
if (state->IsGrpcContextCancelled()) {
if (rpc_ok && (state->step_ == Steps::START) &&
(state->context_->step_ != Steps::CANCELLED)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what does the second clause here imply?

!= Steps::Cancelled ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid calling StartNewRequest twice, at first we fall into HandleCancellation and go through this block, which returns true for resume, so we will go into if (state->IsGrpcContextCancelled()) loop for the second time but this time state->context_->step_ is CANCELLED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to the game, but what is the reasoning of not moving the original "StartNewRequest() if at START" to before handling the cancellation? Although I think other code needs to be moved around as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% aware of all underlying processes, meaning state->step_ and state->context_->step_ combinations. This change helps to address the bug with known symptoms. Refactoring if the Process logic needs proper time and testing IMHO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kthui thoughts? If feasible, this can be done as follow-up and by someone else. Want to make sure if there is room for improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there is definitely room for improvement/refactoring, i.e. I think the if (shutdown) { ... } could also be moved into the if (state->step_ == Steps::START) { ... } else ... block, so all procedures for Steps::START would be inside the if (state->step_ == Steps::START) { ... } block, but it can be done as a follow-up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jira ticket: DLIS-6831

nnshah1
nnshah1 previously approved these changes Jun 5, 2024
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - had question on the condition

@tanmayv25
Copy link
Contributor

I think we would need similar fix for ModelStreamInferHandler.

@oandreeva-nv oandreeva-nv merged commit 42742a3 into main Jun 6, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: server Issues related to the server core and frontends PR: fix A bug fix
Development

Successfully merging this pull request may close these issues.

None yet

5 participants