Skip to content

Conversation

jeffdaily
Copy link
Collaborator

@jeffdaily jeffdaily commented Jan 13, 2021

nvcc's --fmad=false is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jan 13, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 13, 2021

💊 CI failures summary and remediations

As of commit 7e2fb17 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@jeffdaily jeffdaily added the module: rocm AMD GPU support for Pytorch label Jan 13, 2021
@jeffdaily
Copy link
Collaborator Author

How am I supposed to resolve the clang-format error?

@jeffdaily
Copy link
Collaborator Author

ROCm CI failures are not related to this change. Will retest.

@jeffdaily
Copy link
Collaborator Author

@pytorchbot retest this please

@jeffdaily
Copy link
Collaborator Author

Again, ROCm CI failures are not related to this change. There was a JIT-related commit in master that broke ROCm that was recently reverted. Will retest.

@jeffdaily
Copy link
Collaborator Author

@pytorchbot retest this please

@mrshenli mrshenli added the module: cuda Related to torch.cuda, and CUDA support in general label Jan 15, 2021
@mrshenli mrshenli requested review from ngimel and suo January 15, 2021 03:17
@mrshenli
Copy link
Contributor

How am I supposed to resolve the clang-format error?

Both the lint and test failures seem irrelevant to your change. I am trying landing this PR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel ngimel removed their request for review January 20, 2021 03:54
@mrshenli
Copy link
Contributor

Hey @jeffdaily, are the test errors real?

Jan 14 03:43:43 ======================================================================
Jan 14 03:43:43 ERROR [0.125s]: test_lerp (__main__.TestTEFuser)
Jan 14 03:43:43 ----------------------------------------------------------------------
Jan 14 03:43:43 Traceback (most recent call last):
Jan 14 03:43:43   File "test_jit_fuser_te.py", line 611, in test_lerp
Jan 14 03:43:43     start = torch.randn(4, 1, dtype=torch.float, device=device)
Jan 14 03:43:43 RuntimeError: CUDA error: an illegal memory access was encountered
Jan 14 03:43:43 
Jan 14 03:43:43 ======================================================================
Jan 14 03:43:43 ERROR [0.827s]: test_lstm (__main__.TestTEFuser)
Jan 14 03:43:43 ----------------------------------------------------------------------
Jan 14 03:43:43 Traceback (most recent call last):
Jan 14 03:43:43   File "test_jit_fuser_te.py", line 913, in test_lstm
Jan 14 03:43:43     inputs = get_lstm_inputs(device, training=True)
Jan 14 03:43:43   File "/var/lib/jenkins/workspace/test/test_jit.py", line 229, in get_lstm_inputs
Jan 14 03:43:43     input = torch.randn(*input_shape, dtype=torch.float, device=device, requires_grad=training)
Jan 14 03:43:43 RuntimeError: CUDA error: an illegal memory access was encountered

@jeffdaily
Copy link
Collaborator Author

@mrshenli where did those errors come from? The last ROCm CI jobs in this PR showed green? Unless... does our ROCm CI not cover the JIT tests? I was leaning heavily on our CI here, and I wouldn't have allowed this PR for review had it not first passed CI.

@mrshenli
Copy link
Contributor

Hey @jeffdaily It's this one below. The failed CI tests are from pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2, which is not ROCm, but I am not sure if this is related.

https://app.circleci.com/jobs/github/pytorch/pytorch/10148068?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

@jeffdaily
Copy link
Collaborator Author

The only code change in this PR was protected by an ifdef; CUDA failures shouldn't be related. However, I have confirmed that the change in this PR is not covered currently by ROCm CI. test_jit_cuda_fuser is the only test to cover this change, and it falls into the category of JIT_EXECUTOR_TESTS excluded by default for both test1 and test2 groups.

That said, I am going to revise this PR. Our internal teams do need to skip unsupported compiler flags, but updating our CI scripts to cover this test is outside the scope of this PR. (Meaning, there could be other unrelated JIT test failures if we enabled all of them for ROCm CI.) Since the ROCm change to use -ffp-contract=off will not be properly tested by CI runs, I will remove the use of the flag. That will satisfy the immediate need of our developers until a proper fix is in place, and JIT CI is fully enabled for ROCm.

@jeffdaily jeffdaily changed the title PYTORCH_CUDA_FUSER_DISABLE_FMA on ROCm uses -ffp-contract=off [ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA Jan 26, 2021
@jeffdaily
Copy link
Collaborator Author

Accidentally committed the revision in a hipified file. Fixed in 12f0b4a.

@jeffdaily jeffdaily requested review from mrshenli and removed request for suo January 26, 2021 18:41
@mrshenli
Copy link
Contributor

Hey @jeffdaily, sorry that I dropped the ball on this. Is this still relevant? If yes, I will land when all tests pass.

@jeffdaily
Copy link
Collaborator Author

jeffdaily commented Feb 15, 2021

@mrshenli no worries. Yes, this PR is still needed for upcoming HIP compiler changes.

I merged upstream hoping it would resolve the CI failures, but note that all CI failures were unrelated to this change.

@jeffdaily
Copy link
Collaborator Author

@mrshenli there is no way all these new CI failures are due to this PR. tensorpipe, sccache failed to connect, fbgemm. ROCm CI failed to build. It's like CI gets worse every time I rebase.

@mrshenli
Copy link
Contributor

Hey @jeffdaily, looks like this PR unintentionally included third-party changes (e.g., fbgemm, kineto, tensorpipe, etc.). Could you please remove those from this PR?

@jeffdaily jeffdaily force-pushed the rocm_fuser_disable_fma branch from d63242f to 7e2fb17 Compare February 15, 2021 21:58
@jeffdaily
Copy link
Collaborator Author

facepalm. I can't wait to put this 5-line change behind us.

@codecov
Copy link

codecov bot commented Feb 16, 2021

Codecov Report

Merging #50508 (7e2fb17) into master (df837d0) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #50508   +/-   ##
=======================================
  Coverage   80.63%   80.63%           
=======================================
  Files        1959     1959           
  Lines      214878   214878           
=======================================
+ Hits       173269   173271    +2     
+ Misses      41609    41607    -2     

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 4df8e77.

xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler.  Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

Pull Request resolved: pytorch#50508

Reviewed By: albanD

Differential Revision: D25920291

Pulled By: mrshenli

fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler.  Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

Pull Request resolved: pytorch/pytorch#50508

Reviewed By: albanD

Differential Revision: D25920291

Pulled By: mrshenli

fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler.  Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

Pull Request resolved: pytorch/pytorch#50508

Reviewed By: albanD

Differential Revision: D25920291

Pulled By: mrshenli

fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general module: rocm AMD GPU support for Pytorch oncall: jit Add this issue/PR to JIT oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants