TST Adds module test comparing output of cpu and gpu #64694

thomasjpfan · 2021-09-08T20:16:48Z

Continues #61935

This PR adds:

test_cpu_gpu_results_are_equal, which checks the results from cpu and cuda.
allow_devices to modules to filter tests without skipping.

Note this PR does not compare gradgrad check to keep the PR smaller.

cc @albanD @mruberry @jbschlosser @walterddr

facebook-github-bot · 2021-09-08T20:16:54Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/64694
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit b6713c5 (more details on the Dr. CI page):

3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (1/3)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:15:50.8306765Z AssertionError: Fa...true : Scalars failed to compare as equal! -6 != 0

2021-09-13T23:15:50.8263415Z ----------------------------------------------------------------------
2021-09-13T23:15:50.8264305Z Traceback (most recent call last):
2021-09-13T23:15:50.8265884Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
2021-09-13T23:15:50.8267115Z     self._join_processes(fn)
2021-09-13T23:15:50.8268714Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
2021-09-13T23:15:50.8270064Z     self._check_return_codes(elapsed_time)
2021-09-13T23:15:50.8303494Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 692, in _check_return_codes
2021-09-13T23:15:50.8304314Z     self.assertEqual(
2021-09-13T23:15:50.8305165Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1676, in assertEqual
2021-09-13T23:15:50.8305953Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-09-13T23:15:50.8306765Z AssertionError: False is not true : Scalars failed to compare as equal! -6 != 0
2021-09-13T23:15:50.8307727Z Expect process 3 exit code to match Process 0 exit code of 0, but got -6
2021-09-13T23:15:50.8308083Z 
2021-09-13T23:15:50.8308615Z ----------------------------------------------------------------------
2021-09-13T23:15:50.8309067Z Ran 85 tests in 112.727s
2021-09-13T23:15:50.8309278Z 
2021-09-13T23:15:50.8309623Z FAILED (failures=1, skipped=31)
2021-09-13T23:15:50.8309888Z 
2021-09-13T23:15:50.8310223Z Generating XML reports...
2021-09-13T23:15:50.8311198Z Generated XML report: test-reports/python-unittest/distributed.test_c10d_gloo/TEST-CommTest-20210913231358.xml
2021-09-13T23:15:50.8322333Z Generated XML report: test-reports/python-unittest/distributed.test_c10d_gloo/TEST-DistributedDataParallelTest-20210913231358.xml

win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (2/3)

Step: "Store Test Reports" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:02:06.8468909Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given

2021-09-13T23:02:06.8417306Z   c:\jenkins\miniconda3\lib\site-packages\coverage\execfile.py(247): run
2021-09-13T23:02:06.8417993Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(746): do_run
2021-09-13T23:02:06.8418731Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(588): command_line
2021-09-13T23:02:06.8419428Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(871): main
2021-09-13T23:02:06.8420062Z   C:\Jenkins\Miniconda3\Scripts\coverage.exe\__main__.py(7): <module>
2021-09-13T23:02:06.8420612Z   c:\jenkins\miniconda3\lib\runpy.py(87): _run_code
2021-09-13T23:02:06.8421125Z   c:\jenkins\miniconda3\lib\runpy.py(194): _run_module_as_main
2021-09-13T23:02:06.8421424Z 
2021-09-13T23:02:06.8421671Z ok (0.003s)
2021-09-13T23:02:06.8452262Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.005s)
2021-09-13T23:02:06.8468909Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2021-09-13T23:02:06.8469844Z ok (0.002s)
2021-09-13T23:02:06.8501973Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.003s)
2021-09-13T23:02:06.8560882Z   test_chained_then (__main__.TestFuture) ... ok (0.006s)
2021-09-13T23:02:06.9695780Z   test_collect_all (__main__.TestFuture) ... ok (0.113s)
2021-09-13T23:02:06.9717543Z   test_done (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9754593Z   test_done_exception (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9801035Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9831170Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2021-09-13T23:02:06.9831860Z 
2021-09-13T23:02:06.9832082Z At:

win-vs2019-cuda10.2-py3 / test (default, 1, 1, windows.8xlarge.nvidia.gpu) (3/3)

Step: "Store Test Reports" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:15:40.1667226Z ls: cannot access ...d/win_tmp/ci_scripts/*': No such file or directory

2021-09-13T23:15:40.0711236Z + PYTORCH_FINAL_PACKAGE_DIR=/c/1231329030/build-results/
2021-09-13T23:15:40.0788636Z ++ cygpath -w /c/1231329030/build-results/
2021-09-13T23:15:40.0918960Z + PYTORCH_FINAL_PACKAGE_DIR_WIN='C:\1231329030\build-results\'
2021-09-13T23:15:40.0920005Z + export PYTORCH_FINAL_PACKAGE_DIR_WIN
2021-09-13T23:15:40.0920608Z + export PYTORCH_TEST_SKIP_NOARCH=1
2021-09-13T23:15:40.0921119Z + PYTORCH_TEST_SKIP_NOARCH=1
2021-09-13T23:15:40.0921876Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/build/torch
2021-09-13T23:15:40.1219032Z + CI_SCRIPTS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts
2021-09-13T23:15:40.1220230Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts
2021-09-13T23:15:40.1465166Z ++ ls '/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts/*'
2021-09-13T23:15:40.1667226Z ls: cannot access '/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts/*': No such file or directory
2021-09-13T23:15:40.1670761Z + '[' -n '' ']'
2021-09-13T23:15:40.1672118Z + export SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/.jenkins/pytorch/win-test-helpers
2021-09-13T23:15:40.1673911Z + SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/.jenkins/pytorch/win-test-helpers
2021-09-13T23:15:40.1674749Z + IN_PULL_REQUEST=
2021-09-13T23:15:40.1675103Z + '[' -n '' ']'
2021-09-13T23:15:40.1675577Z + [[ win-vs2019-cuda10.2-py3 == *cuda11* ]]
2021-09-13T23:15:40.1676328Z + run_tests
2021-09-13T23:15:40.1677206Z + for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe
2021-09-13T23:15:40.1678712Z + [[ -x /c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe ]]
2021-09-13T23:15:40.1679525Z + '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe'

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torch/testing/_internal/common_modules.py

test/test_modules.py

codecov · 2021-09-08T22:41:19Z

Codecov Report

Merging #64694 (ddaa953) into master (3d976d9) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head ddaa953 differs from pull request most recent head f4ba7fd. Consider uploading reports for the commit f4ba7fd to get more accurate results

@@            Coverage Diff             @@
##           master   #64694      +/-   ##
==========================================
- Coverage   66.69%   66.68%   -0.01%     
==========================================
  Files         718      718              
  Lines       92693    92694       +1     
==========================================
- Hits        61817    61814       -3     
- Misses      30876    30880       +4

thomasjpfan · 2021-09-09T18:33:25Z

test/test_modules.py

+        module_inputs_cpu = module_info.module_inputs_func(module_info, device="cpu", dtype=dtype,
+                                                           requires_grad=True)
+
+        def make_leafs_input(items):


This is for cases where the input are not leaves. For example in NLLLoss, the forward_input args are not leaves:

import torch from torch.testing import make_tensor input = make_tensor((15, 10), device='cpu', dtype=torch.float32, requires_grad=True).log_softmax(dim=1) input.is_leaf # False

WHy do you need them to be leafs?
And the only way for them not to be leafs is if they do require gradients so not a problem for gradcheck

This PR is checking for cpu+gpu parity for _.grad. backward() only accumulates gradients in the leafs:

import torch from torch.testing import make_tensor import torch.nn as nn input = (make_tensor((15, 10), device='cpu', dtype=torch.float32, requires_grad=True) .log_softmax(dim=1)) print("input.requires_grad:", input.requires_grad) target = torch.empty(15, device='cpu').uniform_().mul(10).floor().long() nll = nn.NLLLoss() loss = nll(input, target) loss.backward(retain_graph=True) print("input.grad:", input.grad) # input.requires_grad: True # input.grad: None

In that case,

for item in items: if isinstance(item, torch.Tensor): item.retain_grad()

jbschlosser

Looking pretty good, thanks for the update! Few comments below

test/test_modules.py

jbschlosser · 2021-09-09T21:17:51Z

test/test_modules.py

+        def _make_leafs(item):
+            if isinstance(item, dict):
+                for i in item.values():
+                    _make_leafs(i)
+            elif isinstance(item, tuple):
+                for i in item:
+                    _make_leafs(i)
+            else:
+                if not isinstance(item, torch.Tensor) or item.is_leaf:
+                    return
+                old_requires_grad = item.requires_grad


Hm, this is interesting. Is this needed because of NLLLoss's ModuleInputs? Looks like the input there is created with requires_grad=True then log_softmax() is called, creating a graph which breaks the test.

I wonder if we want this somewhere more general to enforce that all tensors with requires_grad=True coming from module_inputs_funcs are guaranteed to have an empty graph. Wouldn't this be needed for the gradcheck tests as well?

Also if we do make it more general and called automagically, it probably should handle lists as well.

Wouldn't this be needed for the gradcheck tests as well?

gradcheck will check the input and call retain_grad:

pytorch/torch/autograd/gradcheck.py

Line 663 in c12df2d

inp.retain_grad()

I updated this function to do the same.

jbschlosser · 2021-09-13T13:44:34Z

test/test_modules.py

+            # === Compare forward output between cpu and gpu ===
+            with freeze_rng_state():
+                cpu_output = cpu_module(*cpu_forward_args, **cpu_forward_kwargs)
+            with freeze_rng_state():


nit: I don't think you need this second freeze_rng_state call

jbschlosser · 2021-09-13T13:44:47Z

test/test_modules.py

+        module_inputs_cpu = module_info.module_inputs_func(module_info, device="cpu", dtype=dtype,
+                                                           requires_grad=True)
+
+        def _retain_grad(item):


Do you mind adding a comment explaining the reason this is necessary?

thomasjpfan · 2021-09-13T16:30:45Z

Depending on what the issue is in #64444 (comment), this PR may also cause issues.

jbschlosser · 2021-09-13T19:50:51Z

Depending on what the issue is in #64444 (comment), this PR may also cause issues.

@thomasjpfan FYI a @skipIfTBB decorator is being added in #64942 to address the failures

thomasjpfan · 2021-09-13T21:24:35Z

@jbschlosser I wonder if that is why the number of threads is set to one in common_nn:

pytorch/torch/testing/_internal/common_nn.py

Lines 5895 to 5897 in 0561e10

    
           def _do_test(self, test_case, module, input): 
        
               num_threads = torch.get_num_threads() 
        
               torch.set_num_threads(1)

jbschlosser · 2021-09-14T18:24:16Z

@jbschlosser I wonder if that is why the number of threads is set to one in common_nn:

pytorch/torch/testing/_internal/common_nn.py

Lines 5895 to 5897 in 0561e10

def _do_test(self, test_case, module, input):

num_threads = torch.get_num_threads()

torch.set_num_threads(1)

Yeah good find, we should probably replicate that for the ModuleInfo tests.

Summary: Continuation of #64694; fixes issues with the diff there Pull Request resolved: #68097 Reviewed By: mruberry Differential Revision: D32300650 Pulled By: jbschlosser fbshipit-source-id: f3a5e72b019d4eddd7202854999eab61fffc9006

jbschlosser · 2021-11-29T23:06:23Z

Finished in #68097

Summary: Continuation of #64694; fixes issues with the diff there Reviewed By: mruberry Differential Revision: D32300650 Pulled By: jbschlosser fbshipit-source-id: f3a5e72b019d4eddd7202854999eab61fffc9006 [ghstack-poisoned]

thomasjpfan added 7 commits September 8, 2021 15:46

TST Adds module test comparing output of cpu and gpu

8d2707b

Merge remote-tracking branch 'upstream/master' into cpu_cuda_check

89b2f91

TST Adds allow_devices

ea0451e

CLN Less diff

2b43d6c

CLN Change function name

b41cdd5

STY Fix flake8

033cece

ENH Remove unneeded change

2060c8d

thomasjpfan added module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 8, 2021

thomasjpfan requested a review from jbschlosser September 8, 2021 20:16

facebook-github-bot added the cla signed label Sep 8, 2021

pytorchbot added the open source label Sep 8, 2021

CLN Small refactor

9b5162a

thomasjpfan commented Sep 8, 2021

View reviewed changes

torch/testing/_internal/common_modules.py Outdated Show resolved Hide resolved

torch/testing/_internal/common_modules.py Show resolved Hide resolved

test/test_modules.py Show resolved Hide resolved

thomasjpfan added 2 commits September 8, 2021 16:25

ENH Better function name

423ddae

CLN Better function name

861add7

Merge remote-tracking branch 'upstream/master' into cpu_cuda_check

87dbb00

thomasjpfan commented Sep 9, 2021

View reviewed changes

thomasjpfan added 2 commits September 9, 2021 15:08

CLN Slight refactor

1b856bd

CLN Remove unneeded leaf check

b78d9b3

jbschlosser reviewed Sep 9, 2021

View reviewed changes

thomasjpfan added 4 commits September 10, 2021 11:04

Merge remote-tracking branch 'upstream/master' into cpu_cuda_check

ecc119b

CLN Retain_grad instead

512da77

CLN Address comments

8d0ccaf

CLN Flake8

bb39c5f

jbschlosser reviewed Sep 13, 2021

View reviewed changes

thomasjpfan added 2 commits September 13, 2021 12:01

Merge remote-tracking branch 'upstream/master' into cpu_cuda_check

3340d19

CLN Adds comments

ddaa953

ENH Indent fix

f4ba7fd

CLN Refactor to methods

b6713c5

This was referenced Nov 8, 2021

Tracker: ModuleInfo-based testing #67913

Open

Add ModuleInfo-based CPU / GPU parity tests #68097

Closed

jbschlosser closed this Nov 29, 2021

PaliC mentioned this pull request Nov 30, 2021

Add ModuleInfo-based CPU / GPU parity tests (#68097) #69154

Closed

TST Adds module test comparing output of cpu and gpu #64694

TST Adds module test comparing output of cpu and gpu #64694

Uh oh!

Conversation

thomasjpfan commented Sep 8, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (1/3)

win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (2/3)

win-vs2019-cuda10.2-py3 / test (default, 1, 1, windows.8xlarge.nvidia.gpu) (3/3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thomasjpfan Sep 9, 2021

Choose a reason for hiding this comment

Uh oh!

albanD Sep 9, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

jbschlosser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbschlosser Sep 9, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

jbschlosser Sep 13, 2021

Choose a reason for hiding this comment

Uh oh!

jbschlosser Sep 13, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Sep 13, 2021

Uh oh!

jbschlosser commented Sep 13, 2021

Uh oh!

thomasjpfan commented Sep 13, 2021

Uh oh!

jbschlosser commented Sep 14, 2021

Uh oh!

jbschlosser commented Nov 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thomasjpfan commented Sep 8, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Sep 8, 2021 •

edited

Loading

codecov bot commented Sep 8, 2021 •

edited

Loading

thomasjpfan Sep 10, 2021 •

edited

Loading