[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA #50508

jeffdaily · 2021-01-13T23:05:29Z

nvcc's --fmad=false is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

facebook-github-bot · 2021-01-13T23:05:38Z

💊 CI failures summary and remediations

As of commit 7e2fb17 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

jeffdaily · 2021-01-13T23:22:03Z

How am I supposed to resolve the clang-format error?

jeffdaily · 2021-01-14T01:21:40Z

ROCm CI failures are not related to this change. Will retest.

jeffdaily · 2021-01-14T01:21:49Z

@pytorchbot retest this please

jeffdaily · 2021-01-14T17:57:48Z

Again, ROCm CI failures are not related to this change. There was a JIT-related commit in master that broke ROCm that was recently reverted. Will retest.

jeffdaily · 2021-01-14T17:57:58Z

@pytorchbot retest this please

mrshenli · 2021-01-15T03:20:30Z

How am I supposed to resolve the clang-format error?

Both the lint and test failures seem irrelevant to your change. I am trying landing this PR.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2021-01-26T16:10:29Z

Hey @jeffdaily, are the test errors real?

Jan 14 03:43:43 ======================================================================
Jan 14 03:43:43 ERROR [0.125s]: test_lerp (__main__.TestTEFuser)
Jan 14 03:43:43 ----------------------------------------------------------------------
Jan 14 03:43:43 Traceback (most recent call last):
Jan 14 03:43:43   File "test_jit_fuser_te.py", line 611, in test_lerp
Jan 14 03:43:43     start = torch.randn(4, 1, dtype=torch.float, device=device)
Jan 14 03:43:43 RuntimeError: CUDA error: an illegal memory access was encountered
Jan 14 03:43:43 
Jan 14 03:43:43 ======================================================================
Jan 14 03:43:43 ERROR [0.827s]: test_lstm (__main__.TestTEFuser)
Jan 14 03:43:43 ----------------------------------------------------------------------
Jan 14 03:43:43 Traceback (most recent call last):
Jan 14 03:43:43   File "test_jit_fuser_te.py", line 913, in test_lstm
Jan 14 03:43:43     inputs = get_lstm_inputs(device, training=True)
Jan 14 03:43:43   File "/var/lib/jenkins/workspace/test/test_jit.py", line 229, in get_lstm_inputs
Jan 14 03:43:43     input = torch.randn(*input_shape, dtype=torch.float, device=device, requires_grad=training)
Jan 14 03:43:43 RuntimeError: CUDA error: an illegal memory access was encountered

jeffdaily · 2021-01-26T16:43:58Z

@mrshenli where did those errors come from? The last ROCm CI jobs in this PR showed green? Unless... does our ROCm CI not cover the JIT tests? I was leaning heavily on our CI here, and I wouldn't have allowed this PR for review had it not first passed CI.

mrshenli · 2021-01-26T16:48:23Z

Hey @jeffdaily It's this one below. The failed CI tests are from pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2, which is not ROCm, but I am not sure if this is related.

https://app.circleci.com/jobs/github/pytorch/pytorch/10148068?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

jeffdaily · 2021-01-26T18:12:32Z

The only code change in this PR was protected by an ifdef; CUDA failures shouldn't be related. However, I have confirmed that the change in this PR is not covered currently by ROCm CI. test_jit_cuda_fuser is the only test to cover this change, and it falls into the category of JIT_EXECUTOR_TESTS excluded by default for both test1 and test2 groups.

That said, I am going to revise this PR. Our internal teams do need to skip unsupported compiler flags, but updating our CI scripts to cover this test is outside the scope of this PR. (Meaning, there could be other unrelated JIT test failures if we enabled all of them for ROCm CI.) Since the ROCm change to use -ffp-contract=off will not be properly tested by CI runs, I will remove the use of the flag. That will satisfy the immediate need of our developers until a proper fix is in place, and JIT CI is fully enabled for ROCm.

jeffdaily · 2021-01-26T18:40:27Z

Accidentally committed the revision in a hipified file. Fixed in 12f0b4a.

mrshenli · 2021-02-12T17:52:30Z

Hey @jeffdaily, sorry that I dropped the ball on this. Is this still relevant? If yes, I will land when all tests pass.

jeffdaily · 2021-02-15T16:12:13Z

@mrshenli no worries. Yes, this PR is still needed for upcoming HIP compiler changes.

I merged upstream hoping it would resolve the CI failures, but note that all CI failures were unrelated to this change.

jeffdaily · 2021-02-15T18:40:25Z

@mrshenli there is no way all these new CI failures are due to this PR. tensorpipe, sccache failed to connect, fbgemm. ROCm CI failed to build. It's like CI gets worse every time I rebase.

mrshenli · 2021-02-15T21:19:48Z

Hey @jeffdaily, looks like this PR unintentionally included third-party changes (e.g., fbgemm, kineto, tensorpipe, etc.). Could you please remove those from this PR?

jeffdaily · 2021-02-15T22:00:00Z

facepalm. I can't wait to put this 5-line change behind us.

codecov · 2021-02-16T02:06:58Z

Codecov Report

Merging #50508 (7e2fb17) into master (df837d0) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #50508   +/-   ##
=======================================
  Coverage   80.63%   80.63%           
=======================================
  Files        1959     1959           
  Lines      214878   214878           
=======================================
+ Hits       173269   173271    +2     
+ Misses      41609    41607    -2

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-02-16T16:12:46Z

@mrshenli merged this pull request in 4df8e77.

Summary: nvcc's `--fmad=false` is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error. Pull Request resolved: pytorch#50508 Reviewed By: albanD Differential Revision: D25920291 Pulled By: mrshenli fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c

Summary: nvcc's `--fmad=false` is not valid for the HIP compiler. Upcoming ROCm releases will start treating unrecognized compiler flags as an error. Pull Request resolved: pytorch/pytorch#50508 Reviewed By: albanD Differential Revision: D25920291 Pulled By: mrshenli fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jan 13, 2021

jeffdaily added the module: rocm AMD GPU support for Pytorch label Jan 13, 2021

pytorchbot added the open source label Jan 13, 2021

mrshenli added the module: cuda Related to torch.cuda, and CUDA support in general label Jan 15, 2021

mrshenli requested review from ngimel and suo January 15, 2021 03:17

mrshenli approved these changes Jan 15, 2021

View reviewed changes

facebook-github-bot reviewed Jan 15, 2021

View reviewed changes

ngimel removed their request for review January 20, 2021 03:54

jeffdaily changed the title ~~PYTORCH_CUDA_FUSER_DISABLE_FMA on ROCm uses -ffp-contract=off~~ [ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA Jan 26, 2021

jeffdaily requested review from mrshenli and removed request for suo January 26, 2021 18:41

[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA

7e2fb17

jeffdaily force-pushed the rocm_fuser_disable_fma branch from d63242f to 7e2fb17 Compare February 15, 2021 21:58

facebook-github-bot reviewed Feb 16, 2021

View reviewed changes

mrshenli approved these changes Feb 16, 2021

View reviewed changes

facebook-github-bot closed this in 4df8e77 Feb 16, 2021

facebook-github-bot added the Merged label Feb 16, 2021

[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA #50508

[ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA #50508

Uh oh!

Conversation

jeffdaily commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

jeffdaily commented Jan 13, 2021

Uh oh!

jeffdaily commented Jan 14, 2021

Uh oh!

jeffdaily commented Jan 14, 2021

Uh oh!

jeffdaily commented Jan 14, 2021

Uh oh!

jeffdaily commented Jan 14, 2021

Uh oh!

mrshenli commented Jan 15, 2021

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jan 26, 2021

Uh oh!

jeffdaily commented Jan 26, 2021

Uh oh!

mrshenli commented Jan 26, 2021

Uh oh!

jeffdaily commented Jan 26, 2021

Uh oh!

jeffdaily commented Jan 26, 2021

Uh oh!

mrshenli commented Feb 12, 2021

Uh oh!

jeffdaily commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Feb 15, 2021

Uh oh!

mrshenli commented Feb 15, 2021

Uh oh!

jeffdaily commented Feb 15, 2021

Uh oh!

codecov bot commented Feb 16, 2021

Codecov Report

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffdaily commented Jan 13, 2021 •

edited

Loading

facebook-github-bot commented Jan 13, 2021 •

edited

Loading

jeffdaily commented Feb 15, 2021 •

edited

Loading