Add backward check for test_memory_format #106104

CaoE · 2023-07-27T03:01:03Z

Add backward check for test_memory_format.

pytorch-bot · 2023-07-27T03:01:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106104

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit b2924b1 with merge base a0cfaf0 ():

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

For the cases where the backward memory format checks are disabled, are they issues that need to be fixed?

CaoE · 2023-08-01T07:16:31Z

For the cases where the backward memory format checks are disabled, are they issues that need to be fixed?

Will collect such issues. If there are no related issues, I will create corresponding github issues.

CaoE · 2023-08-15T13:33:39Z

@mikaylagawarecki For test breakings on MPS and CUDA, I created corresponding issues #107214, #107199, and #107201.

I have no mac machine to reproduce test breaking on MPS, so I'm waiting replies if it is real issue #107214. Then this PR should be ready.
From my perspective, the code is ready to be reviewed. Do you mind review this draft PR first ?

mikaylagawarecki

@CaoE Thank you very much for your hard work with this PR as well as filing issues for the failing tests!!

Regarding the failing mac test, I think the issue is sufficient and it is ok to skip this test when landing this PR.

Separately, what do you think of removing the corresponding tests in test_nn.py?

mikaylagawarecki · 2023-08-15T20:44:41Z

test/test_modules.py

+                if isinstance(desired_outputs, torch.Tensor):
+                    desired_outputs = (desired_outputs,)
+                # === Do backward pass. ===
+                ref_diff_outputs = tuple(t for t in desired_outputs if _req_grad(t))


nit: This is an edge case but this line might not be modular if something in desired_outputs is a TensorList (I'm not sure whether that is ever the case though), we could use _traverse_object pytree.tree_flatten instead. Similarly below for diff_outputs

Used pytree.tree_flatten instead.

mikaylagawarecki · 2023-08-15T20:49:03Z

test/test_modules.py

+
+                            if (
+                                input_mem_format != torch.contiguous_format
+                                or module_mem_formats != torch.contiguous_format


Second check will always be True here because we are checking the list instead of the current item

Suggested change

or module_mem_formats != torch.contiguous_format

or module_mem_format != torch.contiguous_format

Fixed as suggested. I also found https://github.com/pytorch/pytorch/pull/106104/files#diff-e449b2c5758c5d63d7ea8ff8f7d918dd4071ab45bd82d843fc34b522b4d143aeL696 has similar code.

mikaylagawarecki · 2023-08-15T20:51:34Z

test/test_modules.py

+                            grad_outputs = tuple(
+                                torch.rand_like(t)
+                                for t in diff_outputs
+                            )
+                            grad_outputs = tuple(
+                                t1.copy_(t2)
+                                for (t1, t2) in zip(grad_outputs, ref_grad_outputs)
+                            )


nit: maybe this would be a bit cleaner but up to you!

Suggested change

grad_outputs = tuple(

torch.rand_like(t)

for t in diff_outputs

)

grad_outputs = tuple(

t1.copy_(t2)

for (t1, t2) in zip(grad_outputs, ref_grad_outputs)

)

grad_outputs = tuple(

torch.empty_like(t1).copy_(t2)

for (t1, t2) in zip(diff_outputs, ref_grad_outputs)

)

It is cleaner. Fixed as suggested.

mikaylagawarecki · 2023-08-15T20:55:27Z

test/test_modules.py

+                        ref_diff_outputs,
+                        ref_diff_inputs,
+                        grad_outputs=ref_grad_outputs,
+                        allow_unused=True,


for my understanding, why do we set allow_unused=True?

Thanks for your comments. I removed allow_unused=True.

mikaylagawarecki · 2023-08-15T21:03:01Z

torch/testing/_internal/common_modules.py

    ModuleInfo(torch.nn.AdaptiveAvgPool2d,
               gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+               # Fails on backward check if output size is 1x1
+               gradcheck_memformat=False,


Really appreciate the detailed comments here regarding why each of these is set to False!

nit: Separately, is there any way we could make this an xfail? I'm hoping that if these are fixed, whoever sends the PR fixing will get signal to un-xfail these tests. I worry that if we make this a flag, re-enabling the tests might fall through the cracks.

I'm thinking maybe something to the effect of

DecorateInfo( unittest.expectedFailure, 'TestModule', 'test_memory_format', active_if=lambda p: p['training'] )

This seems reasonable to me because the non-backward version will still be tested if training=False and we can change the check here to if training and len(ref_diff_outputs) > 0

Let me know your thoughts on this!

Thanks for your suggestion! It‘s a good idea. I just found there is active_if by which DecorateInfo can be activated according to input parameters of the test.
Modified as suggested.

mikaylagawarecki · 2023-08-15T23:57:06Z

One more thing -- do you think it makes sense to extend this to test that the gradient for the channels_last params/buffers of the model have gradients of the correct memory format as well. This could be a followup PR (and if you would prefer I could send the PR instead but just curious to get your thoughts as a developer working on channels_last_3d)

I was looking into #107199 and the complexity of the code paths/amount of branching made me wonder whether we might have silent correctness issues for memory format of gradients of params/buffers as well?

mikaylagawarecki · 2023-08-18T16:05:54Z

torch/testing/_internal/common_modules.py

-                   DecorateInfo(skipMPS),)
+                   DecorateInfo(skipMPS),
+                   # Fails on backward check if output size is 1x1
+                   DecorateInfo(


Seems like the issue is that this gives an unexpected success when run using inductor. Do you know what the issue is here for 1x1 outputs for eager, would it be possible to fix it?

Otherwise I am okay with skipping the test and filing an issue

For 1x1 outputs, mean is applied on CPU instead of AdaptiveAvgPool2d so the grad will be channels first . Firstly, I added expectedFailure for cpu but it is still failed on CUDA. I will check this further.

For unexpected success when run using inductor, please see #107861

Hm how come this doesn't give unexpected success in CI on this PR anymore?

This is because I added:

# See https://github.com/pytorch/pytorch/issues/107861 # When inductor tests are turned on, the setting of requires_grad will be lost for t1, t2 in zip( torch.utils._pytree.tree_flatten(d_args)[0], torch.utils._pytree.tree_flatten(module_input.forward_input.args)[0], ): t1.requires_grad_(t2.requires_grad) for t1, t2 in zip( torch.utils._pytree.tree_flatten(d_kwargs)[0], torch.utils._pytree.tree_flatten(module_input.forward_input.kwargs)[0], ):

When inductor is turned on, this will success as backwards are not executed.

Oh thank you! I couldn't see the diff because it was force pushed, merging now

huydhn · 2023-08-22T00:24:30Z

@pytorchbot drci

mikaylagawarecki · 2023-08-24T16:45:58Z

@pytorchbot rebase

pytorchmergebot · 2023-08-24T16:47:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-08-24T16:48:04Z

Successfully rebased add_backward_check_cl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add_backward_check_cl && git pull --rebase)

CaoE · 2023-08-25T05:04:50Z

@pytorchbot merge

pytorchmergebot · 2023-08-25T05:07:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-25T05:07:35Z

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-rocm5.6-py3.8 / test (distributed, 1, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

CaoE · 2023-08-25T07:31:53Z

@CaoE thanks again for adding this test. Following up,

I will do an in depth review of add channel last 3d support for maxpool3d on CPU #97775 (Could you update that to make sure this test is running on that PR)

Would you be interested in removing the related test_nn.py tests?

@mikaylagawarecki Sorry for slow reply. I'm occupied by some urgent tasks recently.
I may not have much time to do this in short time, but I can gradually remove these tests in later PRs.

CaoE · 2023-08-25T15:25:44Z

@mikaylagawarecki Seems https://github.com/pytorch/pytorch/actions/runs/5972147266/job/16212838253 and https://github.com/pytorch/pytorch/actions/runs/5972147266/job/16212838606 are not related to this PR. Can we skip them and merge this PR ?

mikaylagawarecki · 2023-08-25T18:09:16Z

@pytorchbot merge -f "macos test_multilayer_var failures are unrelated"

pytorchmergebot · 2023-08-25T18:11:47Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add backward check for test_memory_format. Pull Request resolved: #106104 Approved by: https://github.com/mikaylagawarecki

pytorchbot added the open source label Jul 27, 2023

CaoE force-pushed the add_backward_check_cl branch from b98b481 to ba4e739 Compare July 27, 2023 04:54

CaoE added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jul 27, 2023

CaoE force-pushed the add_backward_check_cl branch 9 times, most recently from 32a7f8b to ba6de4b Compare July 31, 2023 02:59

CaoE mentioned this pull request Aug 1, 2023

add channel last 3d support for batch_norm on CPU #97774

Closed

CaoE requested review from jgong5 and mingfeima August 1, 2023 01:20

mikaylagawarecki self-requested a review August 1, 2023 02:47

jgong5 reviewed Aug 1, 2023

View reviewed changes

CaoE force-pushed the add_backward_check_cl branch from ba6de4b to 115dff6 Compare August 11, 2023 08:06

mikaylagawarecki mentioned this pull request Aug 11, 2023

add channel last 3d support for maxpool3d on CPU #97775

Closed

CaoE force-pushed the add_backward_check_cl branch from 115dff6 to f256fae Compare August 15, 2023 02:33

CaoE mentioned this pull request Aug 15, 2023

The difference between input grad computed by channels last backward and the input grad computed by channels first backward of Hardswish on MPS is too large #107214

Open

CaoE force-pushed the add_backward_check_cl branch from f256fae to 499aaf6 Compare August 15, 2023 12:48

mikaylagawarecki reviewed Aug 15, 2023

View reviewed changes

CaoE force-pushed the add_backward_check_cl branch from 499aaf6 to d2e84d4 Compare August 16, 2023 03:55

mikaylagawarecki reviewed Aug 18, 2023

View reviewed changes

CaoE force-pushed the add_backward_check_cl branch 6 times, most recently from edb86f1 to e088221 Compare August 20, 2023 13:43

CaoE force-pushed the add_backward_check_cl branch from e088221 to 20ab3e0 Compare August 24, 2023 10:29

pytorchmergebot force-pushed the add_backward_check_cl branch from 20ab3e0 to 0beb7a6 Compare August 24, 2023 16:48

pytorchmergebot added the merging label Aug 25, 2023

pytorchmergebot removed the merging label Aug 25, 2023

CaoE added 2 commits August 24, 2023 22:11

add backward check for test_memory_format

945776b

fix inductor tests errors

b2924b1

CaoE force-pushed the add_backward_check_cl branch from 0beb7a6 to b2924b1 Compare August 25, 2023 05:11

pytorchmergebot added merging and removed merging labels Aug 25, 2023

pytorchmergebot closed this in 3992450 Aug 25, 2023

voznesenskym pushed a commit that referenced this pull request Aug 27, 2023

Add backward check for test_memory_format (#106104)

13b6e92

Add backward check for test_memory_format. Pull Request resolved: #106104 Approved by: https://github.com/mikaylagawarecki

	or module_mem_formats != torch.contiguous_format
	or module_mem_format != torch.contiguous_format

Add backward check for test_memory_format #106104

Add backward check for test_memory_format #106104

Uh oh!

Conversation

CaoE commented Jul 27, 2023

Uh oh!

pytorch-bot bot commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106104

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

CaoE commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CaoE commented Aug 15, 2023

Uh oh!

mikaylagawarecki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn commented Aug 22, 2023

Uh oh!

mikaylagawarecki commented Aug 24, 2023

Uh oh!

pytorchmergebot commented Aug 24, 2023

Uh oh!

pytorchmergebot commented Aug 24, 2023

Uh oh!

CaoE commented Aug 25, 2023

Uh oh!

pytorchmergebot commented Aug 25, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 25, 2023

Merge failed

Uh oh!

CaoE commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

pytorch-bot bot commented Jul 27, 2023 •

edited

Loading

CaoE commented Aug 1, 2023 •

edited

Loading

mikaylagawarecki left a comment •

edited

Loading

mikaylagawarecki Aug 15, 2023 •

edited

Loading

mikaylagawarecki commented Aug 15, 2023 •

edited

Loading

CaoE Aug 25, 2023 •

edited

Loading

CaoE commented Aug 25, 2023 •

edited

Loading