Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proto] Enable GPU tests on prototype #6665

Merged
merged 14 commits into from
Oct 21, 2022
Merged

[proto] Enable GPU tests on prototype #6665

merged 14 commits into from
Oct 21, 2022

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Sep 29, 2022

@vfdev-5 vfdev-5 marked this pull request as draft September 29, 2022 08:08
@osalpekar
Copy link
Member

Also there might be an opportunity to simplify this using the generic Linux jobs (with GPU support) that @seemethere has built. You can find the documentation here: https://github.com/pytorch/test-infra/wiki/Writing-generic-linux-jobs

@huydhn
Copy link
Contributor

huydhn commented Oct 17, 2022

Also there might be an opportunity to simplify this using the generic Linux jobs (with GPU support) that @seemethere has built. You can find the documentation here: https://github.com/pytorch/test-infra/wiki/Writing-generic-linux-jobs

Ohh, this is nice and a much better option. TIL

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 17, 2022

@huydhn @osalpekar thanks for the review, I'll try to add --gpus=all flag to see if this enables the CI. Using generic linux jobs could be potentially fine, the only limitation I see from the example is that we should now create a single script vs multiple steps...

@vfdev-5 vfdev-5 changed the title [proto][WIP] Enable GPU tests on prototype [proto] Enable GPU tests on prototype Oct 18, 2022
@vfdev-5 vfdev-5 marked this pull request as ready for review October 18, 2022 07:54
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 18, 2022

It is expected to have failures in cuda vs cpu tests, I'll xfail them in the PR that enables cuda tests and we'll fix the inconsistency in a follow-up PR

@pmeier
Copy link
Collaborator

pmeier commented Oct 18, 2022

Unless I have missed something, there are all just closeness related. Thus, we only have to adjust the tolerances in our test suite. Or to put it differently: there is likely no bug in our implementation. I'm ok with doing that in a follow-up.

@vfdev-5 vfdev-5 requested a review from pmeier October 18, 2022 08:18
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are very few differences between the GPU workflow and the CPU one, can we maybe merge the files?

Comment on lines +174 to +177
try:
assert_close(output_cuda, output_cpu, check_device=False, **info.closeness_kwargs)
except AssertionError:
pytest.xfail("CUDA vs CPU tolerance issue to be fixed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively disables this test. Either we should add proper xfails to the KernelInfo's or simply comment out this test with a FXIME note. Otherwise we are wasting resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary fix with 3 lines. What you suggest if I understand correctly is to mark specific tests which can vary on GPU etc. Taking into account that you wanted to fix the problem we can keep things like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, fixing the individual tests is overkill here. But as is, this test is running with no information gain. assert_close will either pass or raise an AssertionError. Since we catch that and turn it into an xfail, there is no way this test can fail at all. Thus, we are better off just disabling the test completely, e.g. by commenting it out as I suggested, to get the same information but without wasted CI resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but I think it is ok to keep like as here as it still shows that majority of ops are passing on cuda.
As for wasted resources, run cuda_vs_cpu tests takes around 7 seconds.

.github/workflows/prototype-tests-gpu.yml Outdated Show resolved Hide resolved
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 18, 2022

Given that there are very few differences between the GPU workflow and the CPU one, can we maybe merge the files?

IMO, this is complicated to refactor configuration for self-hosted and GHA runners. Can be done by someone else with better GHA knowledge.

@pmeier
Copy link
Collaborator

pmeier commented Oct 18, 2022

IMO, this is complicated to refactor configuration for self-hosted and GHA runners. Can be done by someone else with better GHA knowledge.

Agreed. I can take that up in a follow-up.

@vfdev-5 vfdev-5 requested a review from pmeier October 20, 2022 10:46
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping to unblock.

@vfdev-5 vfdev-5 merged commit d0de55d into main Oct 21, 2022
@vfdev-5 vfdev-5 deleted the proto-enable-gpu-tests branch October 21, 2022 11:28
@github-actions
Copy link

Hey @vfdev-5!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 21, 2022

@osalpekar I merged this PR to enable GPU tests for prototype module and now on another PR I have an issue with starting the container with --gpus=all option:

Do you have any ideas why this could happen ? Thanks

vfdev-5 added a commit that referenced this pull request Oct 21, 2022
vfdev-5 added a commit that referenced this pull request Oct 21, 2022
facebook-github-bot pushed a commit that referenced this pull request Oct 21, 2022
Summary:
* [proto][WIP] Enable GPU tests on prototype

* Update prototype-tests.yml

* tests on gpu as separate file

* Removed matrix setup

* Update prototype-tests-gpu.yml

* Update prototype-tests-gpu.yml

* Added --gpus=all flag

* Added xfail for cuda vs cpu tolerance issue

* Update prototype-tests-gpu.yml

Reviewed By: YosuaMichael

Differential Revision: D40588168

fbshipit-source-id: 884a4045b343f93517b27cc3303c5eb6131a8895
facebook-github-bot pushed a commit that referenced this pull request Oct 21, 2022
Summary: This reverts commit d0de55d.

Reviewed By: YosuaMichael

Differential Revision: D40588158

fbshipit-source-id: 877a172d7dd807b0c90255bd14129a90768bcc76
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants