Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] Adding functionality to reset the node's GPUs before processing. #4213

Merged
merged 10 commits into from
Apr 25, 2024

Conversation

Alexei-V-Ivanov-AMD
Copy link
Contributor

It helps with the "HIP out-of-memory" situation while running the test.

In the essence the test script is requesting and waiting for the complete reset of all GPUs on the node it runs on.

That is a measure to completely decouple separate tests and prevent undefined termination of the previous jobs
to affect the performance of the node in the current one.

The reset process is not immediate, thus, it is essential to wait checking on the "clean" condition of the GPUs.

@simon-mo simon-mo merged commit 7ee82be into vllm-project:main Apr 25, 2024
18 of 19 checks passed
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants