Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check, make async Engine more robust #3015

Merged
merged 8 commits into from
Mar 4, 2024

Conversation

Yard1
Copy link
Collaborator

@Yard1 Yard1 commented Feb 23, 2024

For production usecases, we want to be able to detect Engine failures, especially ones that can happen silently (eg. due to NCCL timeouts). This PR adds a health check method (currently only checking the health of Ray workers) and makes the Async engine more robust by adding a timeout for each iteration as well as better error reporting.

@zhuohan123 zhuohan123 self-assigned this Mar 2, 2024
Copy link
Collaborator

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! In general LGTM. Left some small questions.

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
Comment on lines 42 to 44
finally:
if exception:
error_callback(exception)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We raise errors in both try branch and except branch. Then what does the finally here do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to run the error callback even after we re-raise an exception in except

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do this in the except block though before re-raising (things will still run in the same order)

Comment on lines 174 to 178
async def wait_for_new_requests(self, clear: bool):
if not self.has_new_requests():
await self.new_requests_event.wait()
if clear:
self.new_requests_event.clear()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we always clear this flag?

Suggested change
async def wait_for_new_requests(self, clear: bool):
if not self.has_new_requests():
await self.new_requests_event.wait()
if clear:
self.new_requests_event.clear()
async def wait_for_new_requests(self):
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reason behind this change? Why do we need to move the clear call from get_new_and_finished_requests to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can always clear it.

The reason is to ensure the event is cleared as soon as we have new requests

Yard1 and others added 4 commits March 4, 2024 11:00
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Copy link
Collaborator

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yard1 this looks great

vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
Comment on lines 42 to 44
finally:
if exception:
error_callback(exception)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do this in the except block though before re-raising (things will still run in the same order)

Comment on lines +175 to +177
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to only clear before waiting

Suggested change
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()
if not self.has_new_requests():
self.new_requests_event.clear()
if not self.has_new_requests():
await self.new_requests_event.wait()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you explain why we should do it like that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid flip-flopping the event - it only needs to be cleared when you're actually about to wait on it. But I guess with python/asyncio it doesn't matter anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it should be fine

@Yard1 Yard1 enabled auto-merge (squash) March 4, 2024 21:44
@Yard1 Yard1 merged commit ff578ca into vllm-project:main Mar 4, 2024
22 checks passed
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants