-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Add gpu memory wait before test_async_tp #28893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d654505 to
fae3d94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
fae3d94 to
716a483
Compare
|
Seems like this works, can you add it to E2E tests as well? Or maybe better add a test util file that just calls this function, so that we can insert it into the test pipeline and not pollute the tests? |
9770b6d to
f7e0570
Compare
f7e0570 to
f1a4aa5
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the envvar approach! I think you should take into account the CUDA_VISIBLE_DEVICES variable, as I think that these tests should only be running on 2xH200 at a time, and the CI failure implies that the fixture is waiting on all 8 GPUs on the node.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: angelayi <yiangela7@gmail.com>
f1a4aa5 to
f377621
Compare
Signed-off-by: angelayi <yiangela7@gmail.com>
2b4f712 to
0ae9032
Compare
|
Okay this looks good now! We finally fixed the failure with e2e model tests. There's now a DBO failure which I assume snuck in from Lucas's refactor PR because this test failed before the DBO tests could run |
Attempting to fix test failure