-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proto] Enable GPU tests on prototype #6665
Conversation
Also there might be an opportunity to simplify this using the generic Linux jobs (with GPU support) that @seemethere has built. You can find the documentation here: https://github.com/pytorch/test-infra/wiki/Writing-generic-linux-jobs |
Ohh, this is nice and a much better option. TIL |
@huydhn @osalpekar thanks for the review, I'll try to add |
It is expected to have failures in cuda vs cpu tests, I'll xfail them in the PR that enables cuda tests and we'll fix the inconsistency in a follow-up PR |
Unless I have missed something, there are all just closeness related. Thus, we only have to adjust the tolerances in our test suite. Or to put it differently: there is likely no bug in our implementation. I'm ok with doing that in a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there are very few differences between the GPU workflow and the CPU one, can we maybe merge the files?
try: | ||
assert_close(output_cuda, output_cpu, check_device=False, **info.closeness_kwargs) | ||
except AssertionError: | ||
pytest.xfail("CUDA vs CPU tolerance issue to be fixed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively disables this test. Either we should add proper xfails to the KernelInfo
's or simply comment out this test with a FXIME
note. Otherwise we are wasting resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary fix with 3 lines. What you suggest if I understand correctly is to mark specific tests which can vary on GPU etc. Taking into account that you wanted to fix the problem we can keep things like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, fixing the individual tests is overkill here. But as is, this test is running with no information gain. assert_close
will either pass or raise an AssertionError
. Since we catch that and turn it into an xfail
, there is no way this test can fail at all. Thus, we are better off just disabling the test completely, e.g. by commenting it out as I suggested, to get the same information but without wasted CI resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point but I think it is ok to keep like as here as it still shows that majority of ops are passing on cuda.
As for wasted resources, run cuda_vs_cpu tests takes around 7 seconds.
IMO, this is complicated to refactor configuration for self-hosted and GHA runners. Can be done by someone else with better GHA knowledge. |
Agreed. I can take that up in a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping to unblock.
Hey @vfdev-5! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
@osalpekar I merged this PR to enable GPU tests for prototype module and now on another PR I have an issue with starting the container with
Do you have any ideas why this could happen ? Thanks |
Summary: * [proto][WIP] Enable GPU tests on prototype * Update prototype-tests.yml * tests on gpu as separate file * Removed matrix setup * Update prototype-tests-gpu.yml * Update prototype-tests-gpu.yml * Added --gpus=all flag * Added xfail for cuda vs cpu tolerance issue * Update prototype-tests-gpu.yml Reviewed By: YosuaMichael Differential Revision: D40588168 fbshipit-source-id: 884a4045b343f93517b27cc3303c5eb6131a8895
cc @seemethere @bjuncek @pmeier