Skip to content

Conversation

@thomasjpfan
Copy link
Contributor

@thomasjpfan thomasjpfan commented Sep 8, 2021

Continues #61935

This PR adds:

  1. test_cpu_gpu_results_are_equal, which checks the results from cpu and cuda.
  2. allow_devices to modules to filter tests without skipping.

Note this PR does not compare gradgrad check to keep the PR smaller.

cc @albanD @mruberry @jbschlosser @walterddr

@thomasjpfan thomasjpfan added module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 8, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 8, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit b6713c5 (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (1/3)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:15:50.8306765Z AssertionError: Fa...true : Scalars failed to compare as equal! -6 != 0
2021-09-13T23:15:50.8263415Z ----------------------------------------------------------------------
2021-09-13T23:15:50.8264305Z Traceback (most recent call last):
2021-09-13T23:15:50.8265884Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
2021-09-13T23:15:50.8267115Z     self._join_processes(fn)
2021-09-13T23:15:50.8268714Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
2021-09-13T23:15:50.8270064Z     self._check_return_codes(elapsed_time)
2021-09-13T23:15:50.8303494Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 692, in _check_return_codes
2021-09-13T23:15:50.8304314Z     self.assertEqual(
2021-09-13T23:15:50.8305165Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1676, in assertEqual
2021-09-13T23:15:50.8305953Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-09-13T23:15:50.8306765Z AssertionError: False is not true : Scalars failed to compare as equal! -6 != 0
2021-09-13T23:15:50.8307727Z Expect process 3 exit code to match Process 0 exit code of 0, but got -6
2021-09-13T23:15:50.8308083Z 
2021-09-13T23:15:50.8308615Z ----------------------------------------------------------------------
2021-09-13T23:15:50.8309067Z Ran 85 tests in 112.727s
2021-09-13T23:15:50.8309278Z 
2021-09-13T23:15:50.8309623Z FAILED (failures=1, skipped=31)
2021-09-13T23:15:50.8309888Z 
2021-09-13T23:15:50.8310223Z Generating XML reports...
2021-09-13T23:15:50.8311198Z Generated XML report: test-reports/python-unittest/distributed.test_c10d_gloo/TEST-CommTest-20210913231358.xml
2021-09-13T23:15:50.8322333Z Generated XML report: test-reports/python-unittest/distributed.test_c10d_gloo/TEST-DistributedDataParallelTest-20210913231358.xml

See GitHub Actions build win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (2/3)

Step: "Store Test Reports" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:02:06.8468909Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2021-09-13T23:02:06.8417306Z   c:\jenkins\miniconda3\lib\site-packages\coverage\execfile.py(247): run
2021-09-13T23:02:06.8417993Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(746): do_run
2021-09-13T23:02:06.8418731Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(588): command_line
2021-09-13T23:02:06.8419428Z   c:\jenkins\miniconda3\lib\site-packages\coverage\cmdline.py(871): main
2021-09-13T23:02:06.8420062Z   C:\Jenkins\Miniconda3\Scripts\coverage.exe\__main__.py(7): <module>
2021-09-13T23:02:06.8420612Z   c:\jenkins\miniconda3\lib\runpy.py(87): _run_code
2021-09-13T23:02:06.8421125Z   c:\jenkins\miniconda3\lib\runpy.py(194): _run_module_as_main
2021-09-13T23:02:06.8421424Z 
2021-09-13T23:02:06.8421671Z ok (0.003s)
2021-09-13T23:02:06.8452262Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.005s)
2021-09-13T23:02:06.8468909Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2021-09-13T23:02:06.8469844Z ok (0.002s)
2021-09-13T23:02:06.8501973Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.003s)
2021-09-13T23:02:06.8560882Z   test_chained_then (__main__.TestFuture) ... ok (0.006s)
2021-09-13T23:02:06.9695780Z   test_collect_all (__main__.TestFuture) ... ok (0.113s)
2021-09-13T23:02:06.9717543Z   test_done (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9754593Z   test_done_exception (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9801035Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.000s)
2021-09-13T23:02:06.9831170Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2021-09-13T23:02:06.9831860Z 
2021-09-13T23:02:06.9832082Z At:

See GitHub Actions build win-vs2019-cuda10.2-py3 / test (default, 1, 1, windows.8xlarge.nvidia.gpu) (3/3)

Step: "Store Test Reports" (full log | diagnosis details | 🔁 rerun)

2021-09-13T23:15:40.1667226Z ls: cannot access ...d/win_tmp/ci_scripts/*': No such file or directory
2021-09-13T23:15:40.0711236Z + PYTORCH_FINAL_PACKAGE_DIR=/c/1231329030/build-results/
2021-09-13T23:15:40.0788636Z ++ cygpath -w /c/1231329030/build-results/
2021-09-13T23:15:40.0918960Z + PYTORCH_FINAL_PACKAGE_DIR_WIN='C:\1231329030\build-results\'
2021-09-13T23:15:40.0920005Z + export PYTORCH_FINAL_PACKAGE_DIR_WIN
2021-09-13T23:15:40.0920608Z + export PYTORCH_TEST_SKIP_NOARCH=1
2021-09-13T23:15:40.0921119Z + PYTORCH_TEST_SKIP_NOARCH=1
2021-09-13T23:15:40.0921876Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/build/torch
2021-09-13T23:15:40.1219032Z + CI_SCRIPTS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts
2021-09-13T23:15:40.1220230Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts
2021-09-13T23:15:40.1465166Z ++ ls '/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts/*'
2021-09-13T23:15:40.1667226Z ls: cannot access '/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/build/win_tmp/ci_scripts/*': No such file or directory
2021-09-13T23:15:40.1670761Z + '[' -n '' ']'
2021-09-13T23:15:40.1672118Z + export SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/.jenkins/pytorch/win-test-helpers
2021-09-13T23:15:40.1673911Z + SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/pytorch-1231329030/.jenkins/pytorch/win-test-helpers
2021-09-13T23:15:40.1674749Z + IN_PULL_REQUEST=
2021-09-13T23:15:40.1675103Z + '[' -n '' ']'
2021-09-13T23:15:40.1675577Z + [[ win-vs2019-cuda10.2-py3 == *cuda11* ]]
2021-09-13T23:15:40.1676328Z + run_tests
2021-09-13T23:15:40.1677206Z + for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe
2021-09-13T23:15:40.1678712Z + [[ -x /c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe ]]
2021-09-13T23:15:40.1679525Z + '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe'

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@codecov
Copy link

codecov bot commented Sep 8, 2021

Codecov Report

Merging #64694 (ddaa953) into master (3d976d9) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head ddaa953 differs from pull request most recent head f4ba7fd. Consider uploading reports for the commit f4ba7fd to get more accurate results

@@            Coverage Diff             @@
##           master   #64694      +/-   ##
==========================================
- Coverage   66.69%   66.68%   -0.01%     
==========================================
  Files         718      718              
  Lines       92693    92694       +1     
==========================================
- Hits        61817    61814       -3     
- Misses      30876    30880       +4     

module_inputs_cpu = module_info.module_inputs_func(module_info, device="cpu", dtype=dtype,
requires_grad=True)

def make_leafs_input(items):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for cases where the input are not leaves. For example in NLLLoss, the forward_input args are not leaves:

import torch
from torch.testing import make_tensor

input = make_tensor((15, 10), device='cpu', dtype=torch.float32, requires_grad=True).log_softmax(dim=1)
input.is_leaf
# False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy do you need them to be leafs?
And the only way for them not to be leafs is if they do require gradients so not a problem for gradcheck

Copy link
Contributor Author

@thomasjpfan thomasjpfan Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is checking for cpu+gpu parity for _.grad. backward() only accumulates gradients in the leafs:

import torch
from torch.testing import make_tensor
import torch.nn as nn

input = (make_tensor((15, 10), device='cpu', dtype=torch.float32, requires_grad=True)
         .log_softmax(dim=1))
print("input.requires_grad:", input.requires_grad)

target = torch.empty(15, device='cpu').uniform_().mul(10).floor().long()

nll = nn.NLLLoss()
loss = nll(input, target)
loss.backward(retain_graph=True)
print("input.grad:", input.grad)

# input.requires_grad: True
# input.grad: None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case,

for item in items:
    if isinstance(item, torch.Tensor):
        item.retain_grad()

Copy link
Contributor

@jbschlosser jbschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, thanks for the update! Few comments below

Comment on lines 166 to 176
def _make_leafs(item):
if isinstance(item, dict):
for i in item.values():
_make_leafs(i)
elif isinstance(item, tuple):
for i in item:
_make_leafs(i)
else:
if not isinstance(item, torch.Tensor) or item.is_leaf:
return
old_requires_grad = item.requires_grad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is interesting. Is this needed because of NLLLoss's ModuleInputs? Looks like the input there is created with requires_grad=True then log_softmax() is called, creating a graph which breaks the test.

I wonder if we want this somewhere more general to enforce that all tensors with requires_grad=True coming from module_inputs_funcs are guaranteed to have an empty graph. Wouldn't this be needed for the gradcheck tests as well?

Also if we do make it more general and called automagically, it probably should handle lists as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be needed for the gradcheck tests as well?

gradcheck will check the input and call retain_grad:

inp.retain_grad()

I updated this function to do the same.

# === Compare forward output between cpu and gpu ===
with freeze_rng_state():
cpu_output = cpu_module(*cpu_forward_args, **cpu_forward_kwargs)
with freeze_rng_state():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think you need this second freeze_rng_state call

module_inputs_cpu = module_info.module_inputs_func(module_info, device="cpu", dtype=dtype,
requires_grad=True)

def _retain_grad(item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a comment explaining the reason this is necessary?

@thomasjpfan
Copy link
Contributor Author

Depending on what the issue is in #64444 (comment), this PR may also cause issues.

@jbschlosser
Copy link
Contributor

Depending on what the issue is in #64444 (comment), this PR may also cause issues.

@thomasjpfan FYI a @skipIfTBB decorator is being added in #64942 to address the failures

@thomasjpfan
Copy link
Contributor Author

@jbschlosser I wonder if that is why the number of threads is set to one in common_nn:

def _do_test(self, test_case, module, input):
num_threads = torch.get_num_threads()
torch.set_num_threads(1)

@jbschlosser
Copy link
Contributor

@jbschlosser I wonder if that is why the number of threads is set to one in common_nn:

def _do_test(self, test_case, module, input):
num_threads = torch.get_num_threads()
torch.set_num_threads(1)

Yeah good find, we should probably replicate that for the ModuleInfo tests.

facebook-github-bot pushed a commit that referenced this pull request Nov 29, 2021
Summary:
Continuation of #64694; fixes issues with the diff there

Pull Request resolved: #68097

Reviewed By: mruberry

Differential Revision: D32300650

Pulled By: jbschlosser

fbshipit-source-id: f3a5e72b019d4eddd7202854999eab61fffc9006
@jbschlosser
Copy link
Contributor

Finished in #68097

PaliC added a commit that referenced this pull request Nov 30, 2021
Summary:
Continuation of #64694; fixes issues with the diff there

Reviewed By: mruberry

Differential Revision: D32300650

Pulled By: jbschlosser

fbshipit-source-id: f3a5e72b019d4eddd7202854999eab61fffc9006

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: nn Related to torch.nn open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants