[ROCm] torch.cuda.is_bf16_supported() returns True #80410

pruthvistony · 2022-06-28T00:32:23Z

torch.cuda.is_bf16_supported() return False on ROCm which is not correct, since BF16 is supported on all AMD GPU arch - gfx906, gfx908 and gfx90a.

cc @jithunnair-amd

facebook-github-bot · 2022-06-28T00:32:28Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/80410
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 390f75c (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torch/cuda/__init__.py

pruthvistony · 2022-06-28T05:47:38Z

This is NOT good :( , I dont know how I missed it. Thanks @jeffdaily.

pruthvistony · 2022-06-28T18:58:06Z

This issue was report by an internal user trying to run a model which is FP32 and converting it to BF16 and encountered an error using above API. Didnt get any UT actively using this API for ROCm case and it is used in TEST_CUDA scenario.
However this API needs update for ROCm backend.

jeffdaily

LGTM. CI failures not related to this change. Rebase requested to aid in merging.

jeffdaily · 2022-08-01T18:01:18Z

@pytorchbot rebase

pytorch-bot · 2022-08-01T18:01:21Z

You don't have permissions to rebase this PR, only the PR author and pytorch organization members may rebase this PR.

pruthvistony · 2022-08-01T22:21:06Z

@pytorchbot rebase

pytorchmergebot · 2022-08-01T22:23:10Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-08-01T22:23:13Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch push -f https://github.com/ROCmSoftwarePlatform/pytorch.git pull/80410/head:rocm_bf16_check returned non-zero exit code 128

remote: Permission to ROCmSoftwarePlatform/pytorch.git denied to pytorchmergebot.
fatal: unable to access 'https://github.com/ROCmSoftwarePlatform/pytorch.git/': The requested URL returned error: 403

Raised by https://github.com/pytorch/pytorch/actions/runs/2778258239

pruthvistony · 2022-08-02T16:10:59Z

@malfet ,
Can you please review and help in merging this issue.

malfet · 2022-08-03T01:17:16Z

@pytorchbot merge

pytorchmergebot · 2022-08-03T01:18:31Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-03T01:19:17Z

Hey @pruthvistony.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: `torch.cuda.is_bf16_supported()` return False on ROCm which is not correct, since BF16 is supported on all AMD GPU arch - gfx906, gfx908 and gfx90a. cc jithunnair-amd Pull Request resolved: #80410 Approved by: https://github.com/jeffdaily, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/b57188760be857a9d4c49b5dfa2efd1f78c06af8 Reviewed By: kit1980 Differential Revision: D38394982 fbshipit-source-id: 036dbaa9eb1b3e62ca3dcaf0b61127dc4d981f32

@jithunnair-amd