Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use torch.testing.assert_close for internal test suite #58981

Closed
wants to merge 72 commits into from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented May 26, 2021

With this we partially use the newly added torch.testing.assert_close for the internal numeric comparisons in torch.testing._internal.common_utils.TestCase.assertEqual that is widely used in our test suite.

Some limitations:

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 26, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit d3ad51c (more details on the Dr. CI page):


  • 15/16 failures possibly* introduced in this PR
    • 1/15 non-scanned failure(s)
  • 1/16 broken upstream at merge base b03b45a on Jul 23 from 2:02am to 6:02am

🕵️ 14 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build Windows CI (pytorch-win-vs2019-cpu-py3) / test (default, 2, 2, windows.4xlarge) (1/14)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun)

2021-07-23T16:56:18.6642207Z RuntimeError: test_torch failed!
2021-07-23T16:56:17.4340693Z Generated XML report: test-reports\dist-gloo\test_torch\TEST-TestTorchDeviceTypeCPU-20210723165600.xml
2021-07-23T16:56:17.4342004Z Generated XML report: test-reports\dist-gloo\test_torch\TEST-TestVitalSignsCudaCPU-20210723165600.xml
2021-07-23T16:56:17.6198531Z [TORCH_VITAL] CUDA.used		 False
2021-07-23T16:56:17.6199097Z [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
2021-07-23T16:56:17.6199633Z [TORCH_VITAL] Dataloader.enabled		 True
2021-07-23T16:56:18.6639737Z Traceback (most recent call last):
2021-07-23T16:56:18.6640597Z   File "run_test.py", line 1089, in <module>
2021-07-23T16:56:18.6640955Z     main()
2021-07-23T16:56:18.6641330Z   File "run_test.py", line 1068, in main
2021-07-23T16:56:18.6641778Z     raise RuntimeError(err_message)
2021-07-23T16:56:18.6642207Z RuntimeError: test_torch failed!
2021-07-23T16:56:18.8753654Z 
2021-07-23T16:56:18.8754345Z (base) C:\actions-runner\_work\pytorch\pytorch\pytorch-1060029330\test>popd
2021-07-23T16:56:18.8758724Z 
2021-07-23T16:56:18.8759273Z (base) C:\actions-runner\_work\pytorch\pytorch\pytorch-1060029330>if ERRORLEVEL 1 exit /b 1 
2021-07-23T16:56:18.8784303Z + cleanup
2021-07-23T16:56:18.8784646Z + retcode=1
2021-07-23T16:56:18.8784906Z + set +x
2021-07-23T16:56:18.8817710Z ##[error]Process completed with exit code 1.
2021-07-23T16:56:18.8961629Z ##[group]Run # -ir => recursive include all files in pattern
2021-07-23T16:56:18.8962289Z �[36;1m# -ir => recursive include all files in pattern�[0m

See GitHub Actions build Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / test (default, 2, 2, linux.2xlarge) (2/14)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-07-23T15:01:56.3035890Z AssertionError: Scalars are not close!
2021-07-23T15:01:56.3029769Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
2021-07-23T15:01:56.3030534Z     return fn(slf, *args, **kwargs)
2021-07-23T15:01:56.3031304Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
2021-07-23T15:01:56.3031918Z     return fn(slf, *args, **kwargs)
2021-07-23T15:01:56.3032336Z   File "test_linalg.py", line 885, in test_det
2021-07-23T15:01:56.3032804Z     self.assertEqual(actual, expected)
2021-07-23T15:01:56.3033641Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1440, in assertEqual
2021-07-23T15:01:56.3034198Z     msg=msg,
2021-07-23T15:01:56.3034866Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_asserts.py", line 982, in assert_close
2021-07-23T15:01:56.3035465Z     raise error_meta.to_error()
2021-07-23T15:01:56.3035890Z AssertionError: Scalars are not close!
2021-07-23T15:01:56.3036183Z 
2021-07-23T15:01:56.3036635Z Absolute difference: inf (up to 1e-07 allowed)
2021-07-23T15:01:56.3037229Z Relative difference: inf (up to 1e-07 allowed)
2021-07-23T15:01:56.3037532Z 
2021-07-23T15:01:56.3038012Z ----------------------------------------------------------------------
2021-07-23T15:01:56.3038433Z Ran 2510 tests in 90.637s
2021-07-23T15:01:56.3038641Z 
2021-07-23T15:01:56.3038953Z FAILED (failures=2, skipped=51)
2021-07-23T15:01:56.3039202Z 
2021-07-23T15:01:56.3039526Z Generating XML reports...

See GitHub Actions build Windows CI (pytorch-win-vs2019-cpu-py3) / test (default, 1, 2, windows.4xlarge) (3/14)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun)

2021-07-23T16:20:08.9211486Z RuntimeError: test_autograd failed!
2021-07-23T16:20:08.5945489Z Generated XML report: test-reports\python-unittest\test_autograd\TEST-TestAutogradDeviceTypeCPU-20210723161952.xml
2021-07-23T16:20:08.5947033Z Generated XML report: test-reports\python-unittest\test_autograd\TEST-TestAutogradForwardMode-20210723161952.xml
2021-07-23T16:20:08.5948489Z Generated XML report: test-reports\python-unittest\test_autograd\TEST-TestAutogradFunctional-20210723161952.xml
2021-07-23T16:20:08.5950000Z Generated XML report: test-reports\python-unittest\test_autograd\TEST-TestAutogradInferenceMode-20210723161952.xml
2021-07-23T16:20:08.5951528Z Generated XML report: test-reports\python-unittest\test_autograd\TEST-TestMultithreadAutograd-20210723161952.xml
2021-07-23T16:20:08.9208845Z Traceback (most recent call last):
2021-07-23T16:20:08.9209761Z   File "run_test.py", line 1089, in <module>
2021-07-23T16:20:08.9210141Z     main()
2021-07-23T16:20:08.9210604Z   File "run_test.py", line 1068, in main
2021-07-23T16:20:08.9211059Z     raise RuntimeError(err_message)
2021-07-23T16:20:08.9211486Z RuntimeError: test_autograd failed!
2021-07-23T16:20:09.0977535Z 
2021-07-23T16:20:09.0978386Z (base) C:\actions-runner\_work\pytorch\pytorch\pytorch-1060029330\test>if ERRORLEVEL 1 exit /b 1 
2021-07-23T16:20:09.1001187Z + cleanup
2021-07-23T16:20:09.1001497Z + retcode=1
2021-07-23T16:20:09.1001755Z + set +x
2021-07-23T16:20:09.1031290Z ##[error]Process completed with exit code 1.
2021-07-23T16:20:09.1168411Z ##[group]Run # -ir => recursive include all files in pattern
2021-07-23T16:20:09.1169060Z �[36;1m# -ir => recursive include all files in pattern�[0m
2021-07-23T16:20:09.1169593Z �[36;1m7z a "test-reports-$Env:TEST_CONFIG.zip" -ir'!test\*.xml'�[0m
2021-07-23T16:20:09.1187282Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"

See GitHub Actions build Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / test (default, 1, 2, linux.2xlarge) (4/14)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-07-23T15:05:58.5000783Z test_remote_mess...yUniqueId(created_on=0, local_id=0) to be created.
2021-07-23T15:05:34.5549213Z frame #11: <unknown function> + 0x4272271 (0x7f7472540271 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-07-23T15:05:34.5551756Z frame #12: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7f746e06a783 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-07-23T15:05:34.5553698Z frame #13: <unknown function> + 0xc92bd (0x7f746df972bd in /opt/conda/lib/libstdc++.so.6)
2021-07-23T15:05:34.5555518Z frame #14: <unknown function> + 0x76ba (0x7f74884816ba in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-07-23T15:05:34.5557095Z frame #15: clone + 0x6d (0x7f74881b751d in /lib/x86_64-linux-gnu/libc.so.6)
2021-07-23T15:05:34.5557815Z 
2021-07-23T15:05:34.8162714Z ok (3.317s)
2021-07-23T15:05:42.0387346Z   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.222s)
2021-07-23T15:05:49.2610855Z   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.222s)
2021-07-23T15:05:55.4821213Z   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (6.221s)
2021-07-23T15:05:58.5000783Z   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
2021-07-23T15:05:58.5004100Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
2021-07-23T15:05:58.5007213Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fafc0722a99 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-07-23T15:05:58.5010049Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7fafc071f042 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-07-23T15:05:58.5013283Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7fafc07209de in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-07-23T15:05:58.5016440Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4a4 (0x7fafc4b92c94 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-07-23T15:05:58.5020441Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7fafc4b83fb1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-07-23T15:05:58.5024774Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x12a (0x7fafccf8289a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-07-23T15:05:58.5029225Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x14c (0x7fafc4b886dc in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-07-23T15:05:58.5033402Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7fafccf7f765 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-07-23T15:05:58.5036573Z frame #8: <unknown function> + 0x424335a (0x7fafc4b8535a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build Linux CI (pytorch-linux-bionic-py3.8-gcc9-coverage) / test (default, 1, 2, linux.2xlarge) (5/14)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-07-23T15:31:04.9815302Z AssertionError: Scalars are not close!
2021-07-23T15:31:04.9809055Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
2021-07-23T15:31:04.9809814Z     return fn(slf, *args, **kwargs)
2021-07-23T15:31:04.9810550Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
2021-07-23T15:31:04.9811154Z     return fn(slf, *args, **kwargs)
2021-07-23T15:31:04.9811645Z   File "test_linalg.py", line 885, in test_det
2021-07-23T15:31:04.9812110Z     self.assertEqual(actual, expected)
2021-07-23T15:31:04.9812914Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1430, in assertEqual
2021-07-23T15:31:04.9813578Z     torch.testing.assert_close(
2021-07-23T15:31:04.9814322Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_asserts.py", line 982, in assert_close
2021-07-23T15:31:04.9814880Z     raise error_meta.to_error()
2021-07-23T15:31:04.9815302Z AssertionError: Scalars are not close!
2021-07-23T15:31:04.9815570Z 
2021-07-23T15:31:04.9816021Z Absolute difference: inf (up to 1e-07 allowed)
2021-07-23T15:31:04.9816606Z Relative difference: inf (up to 1e-07 allowed)
2021-07-23T15:31:04.9816884Z 
2021-07-23T15:31:04.9817362Z ----------------------------------------------------------------------
2021-07-23T15:31:04.9817760Z Ran 2510 tests in 170.806s
2021-07-23T15:31:04.9817962Z 
2021-07-23T15:31:04.9818347Z FAILED (failures=2, skipped=51)
2021-07-23T15:31:04.9818620Z 
2021-07-23T15:31:04.9818917Z Generating XML reports...

See GitHub Actions build Linux CI (pytorch-linux-bionic-py3.8-gcc9-coverage) / test (default, 2, 2, linux.2xlarge) (6/14)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-07-23T15:15:53.6670455Z AssertionError: Scalars are not close!
2021-07-23T15:15:53.6664129Z ======================================================================
2021-07-23T15:15:53.6664849Z FAIL [0.021s]: test_finfo (__main__.TestDTypeInfo)
2021-07-23T15:15:53.6665704Z ----------------------------------------------------------------------
2021-07-23T15:15:53.6666240Z Traceback (most recent call last):
2021-07-23T15:15:53.6666688Z   File "test_type_info.py", line 46, in test_finfo
2021-07-23T15:15:53.6667175Z     self.assertEqual(xinfo.max, xninfo.max)
2021-07-23T15:15:53.6668044Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1430, in assertEqual
2021-07-23T15:15:53.6668718Z     torch.testing.assert_close(
2021-07-23T15:15:53.6669460Z   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_asserts.py", line 982, in assert_close
2021-07-23T15:15:53.6670040Z     raise error_meta.to_error()
2021-07-23T15:15:53.6670455Z AssertionError: Scalars are not close!
2021-07-23T15:15:53.6670739Z 
2021-07-23T15:15:53.6671194Z Absolute difference: inf (up to 1e-07 allowed)
2021-07-23T15:15:53.6671769Z Relative difference: inf (up to 1e-07 allowed)
2021-07-23T15:15:53.6672060Z 
2021-07-23T15:15:53.6672523Z ----------------------------------------------------------------------
2021-07-23T15:15:53.6673100Z Ran 3 tests in 0.027s
2021-07-23T15:15:53.6673303Z 
2021-07-23T15:15:53.6673572Z FAILED (failures=1)
2021-07-23T15:15:53.6673784Z 
2021-07-23T15:15:53.6674095Z Generating XML reports...

See CircleCI build pytorch_linux_bionic_py3_6_clang9_noarch_test (7/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:33:35 AssertionError: Scalars are not close!
Jul 23 15:33:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:33:35     return fn(slf, *args, **kwargs)
Jul 23 15:33:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:33:35     return fn(slf, *args, **kwargs)
Jul 23 15:33:35   File "test_linalg.py", line 885, in test_det
Jul 23 15:33:35     self.assertEqual(actual, expected)
Jul 23 15:33:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1440, in assertEqual
Jul 23 15:33:35     msg=msg,
Jul 23 15:33:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_asserts.py", line 982, in assert_close
Jul 23 15:33:35     raise error_meta.to_error()
Jul 23 15:33:35 AssertionError: Scalars are not close!
Jul 23 15:33:35 
Jul 23 15:33:35 Absolute difference: inf (up to 1e-07 allowed)
Jul 23 15:33:35 Relative difference: inf (up to 1e-07 allowed)
Jul 23 15:33:35 
Jul 23 15:33:36 ----------------------------------------------------------------------
Jul 23 15:33:36 Ran 5020 tests in 130.543s
Jul 23 15:33:36 
Jul 23 15:33:36 FAILED (failures=2, skipped=1406)
Jul 23 15:33:36 
Jul 23 15:33:36 Generating XML reports...

See CircleCI build pytorch_macos_10_13_py3_test (8/14)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:20:32 AssertionError: Scalars are not close!
Jul 23 15:20:32   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:20:32     return fn(slf, *args, **kwargs)
Jul 23 15:20:32   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:20:32     return fn(slf, *args, **kwargs)
Jul 23 15:20:32   File "test_linalg.py", line 885, in test_det
Jul 23 15:20:32     self.assertEqual(actual, expected)
Jul 23 15:20:32   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1440, in assertEqual
Jul 23 15:20:32     msg=msg,
Jul 23 15:20:32   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_asserts.py", line 982, in assert_close
Jul 23 15:20:32     raise error_meta.to_error()
Jul 23 15:20:32 AssertionError: Scalars are not close!
Jul 23 15:20:32 
Jul 23 15:20:32 Absolute difference: inf (up to 1e-07 allowed)
Jul 23 15:20:32 Relative difference: inf (up to 1e-07 allowed)
Jul 23 15:20:32 
Jul 23 15:20:32 ----------------------------------------------------------------------
Jul 23 15:20:32 Ran 2510 tests in 199.900s
Jul 23 15:20:32 
Jul 23 15:20:32 FAILED (failures=2, skipped=51)
Jul 23 15:20:32 
Jul 23 15:20:32 Generating XML reports...

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 (9/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:52:11 AssertionError: Scalars are not close!
Jul 23 15:52:11   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:52:11     return fn(slf, *args, **kwargs)
Jul 23 15:52:11   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:52:11     return fn(slf, *args, **kwargs)
Jul 23 15:52:11   File "/var/lib/jenkins/workspace/test/test_linalg.py", line 885, in test_det
Jul 23 15:52:11     self.assertEqual(actual, expected)
Jul 23 15:52:11   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1430, in assertEqual
Jul 23 15:52:11     torch.testing.assert_close(
Jul 23 15:52:11   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_asserts.py", line 982, in assert_close
Jul 23 15:52:11     raise error_meta.to_error()
Jul 23 15:52:11 AssertionError: Scalars are not close!
Jul 23 15:52:11 
Jul 23 15:52:11 Absolute difference: inf (up to 1e-07 allowed)
Jul 23 15:52:11 Relative difference: inf (up to 1e-07 allowed)
Jul 23 15:52:11 
Jul 23 15:52:11 ----------------------------------------------------------------------
Jul 23 15:52:11 Ran 2504 tests in 300.696s
Jul 23 15:52:11 
Jul 23 15:52:11 FAILED (failures=2, skipped=45)
Jul 23 15:52:11 
Jul 23 15:52:11 Generating XML reports...

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (10/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 16:07:43 RuntimeError: test_torch failed!
Jul 23 16:07:43 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestTorchDeviceTypeCPU-20210723160626.xml
Jul 23 16:07:43 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestVitalSignsCudaCPU-20210723160626.xml
Jul 23 16:07:43 [TORCH_VITAL] Dataloader.enabled		 True
Jul 23 16:07:43 [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
Jul 23 16:07:43 [TORCH_VITAL] CUDA.used		 False
Jul 23 16:07:43 Traceback (most recent call last):
Jul 23 16:07:43   File "test/run_test.py", line 1089, in <module>
Jul 23 16:07:43     main()
Jul 23 16:07:43   File "test/run_test.py", line 1068, in main
Jul 23 16:07:43     raise RuntimeError(err_message)
Jul 23 16:07:43 RuntimeError: test_torch failed!
Jul 23 16:07:44 + cleanup
Jul 23 16:07:44 + retcode=1
Jul 23 16:07:44 + set +x
Jul 23 16:07:44 =================== sccache compilation log ===================
Jul 23 16:07:44 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jul 23 16:07:44 Compile requests                      30
Jul 23 16:07:44 Compile requests executed             26
Jul 23 16:07:44 Cache hits                             2
Jul 23 16:07:44 Cache hits (C/C++)                     2
Jul 23 16:07:44 Cache misses                          24

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (11/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:35:01 AssertionError: Scalars are not close!
Jul 23 15:35:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:35:01     return fn(slf, *args, **kwargs)
Jul 23 15:35:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 15:35:01     return fn(slf, *args, **kwargs)
Jul 23 15:35:01   File "test_linalg.py", line 885, in test_det
Jul 23 15:35:01     self.assertEqual(actual, expected)
Jul 23 15:35:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1440, in assertEqual
Jul 23 15:35:01     msg=msg,
Jul 23 15:35:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_asserts.py", line 982, in assert_close
Jul 23 15:35:01     raise error_meta.to_error()
Jul 23 15:35:01 AssertionError: Scalars are not close!
Jul 23 15:35:01 
Jul 23 15:35:01 Absolute difference: inf (up to 1e-07 allowed)
Jul 23 15:35:01 Relative difference: inf (up to 1e-07 allowed)
Jul 23 15:35:01 
Jul 23 15:35:01 ----------------------------------------------------------------------
Jul 23 15:35:01 Ran 2510 tests in 130.045s
Jul 23 15:35:01 
Jul 23 15:35:01 FAILED (failures=2, skipped=51)
Jul 23 15:35:01 
Jul 23 15:35:01 Generating XML reports...

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 (12/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:45:57 RuntimeError: test_torch failed!
Jul 23 15:45:57 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestTorchDeviceTypeCUDA-20210723154417.xml
Jul 23 15:45:57 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestVitalSignsCudaCUDA-20210723154417.xml
Jul 23 15:45:57 [TORCH_VITAL] Dataloader.enabled		 True
Jul 23 15:45:57 [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
Jul 23 15:45:57 [TORCH_VITAL] CUDA.used		 true
Jul 23 15:45:57 Traceback (most recent call last):
Jul 23 15:45:57   File "/var/lib/jenkins/workspace/test/run_test.py", line 1089, in <module>
Jul 23 15:45:57     main()
Jul 23 15:45:57   File "/var/lib/jenkins/workspace/test/run_test.py", line 1068, in main
Jul 23 15:45:57     raise RuntimeError(err_message)
Jul 23 15:45:57 RuntimeError: test_torch failed!
Jul 23 15:45:58 
Jul 23 15:45:58 real	7m18.958s
Jul 23 15:45:58 user	10m24.095s
Jul 23 15:45:58 sys	3m14.486s
Jul 23 15:45:58 + cleanup
Jul 23 15:45:58 + retcode=1
Jul 23 15:45:58 + set +x
Jul 23 15:45:58 =================== sccache compilation log ===================
Jul 23 15:45:58 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jul 23 15:45:58 Compile requests                     64

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (13/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 16:03:15 AssertionError: Scalars are not close!
Jul 23 16:03:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 16:03:15     return fn(slf, *args, **kwargs)
Jul 23 16:03:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 814, in dep_fn
Jul 23 16:03:15     return fn(slf, *args, **kwargs)
Jul 23 16:03:15   File "test_linalg.py", line 885, in test_det
Jul 23 16:03:15     self.assertEqual(actual, expected)
Jul 23 16:03:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1440, in assertEqual
Jul 23 16:03:15     msg=msg,
Jul 23 16:03:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_asserts.py", line 982, in assert_close
Jul 23 16:03:15     raise error_meta.to_error()
Jul 23 16:03:15 AssertionError: Scalars are not close!
Jul 23 16:03:15 
Jul 23 16:03:15 Absolute difference: inf (up to 1e-07 allowed)
Jul 23 16:03:15 Relative difference: inf (up to 1e-07 allowed)
Jul 23 16:03:15 
Jul 23 16:03:16 ----------------------------------------------------------------------
Jul 23 16:03:16 Ran 2504 tests in 310.248s
Jul 23 16:03:16 
Jul 23 16:03:16 FAILED (failures=2, skipped=45)
Jul 23 16:03:16 
Jul 23 16:03:16 Generating XML reports...

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test1 (14/14)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 23 15:41:06 RuntimeError: test_linalg failed!
Jul 23 15:41:06     #167 0x55d5d7195196 in main /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Programs/python.c:69
Jul 23 15:41:06     #168 0x7fa32cc4983f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jul 23 15:41:06     #169 0x55d5d722533d in _start (/opt/conda/bin/python3.6+0x1a733d)
Jul 23 15:41:06 
Jul 23 15:41:06 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/c10/util/complex.h:166:15 in 
Jul 23 15:41:06 Traceback (most recent call last):
Jul 23 15:41:06   File "test/run_test.py", line 1089, in <module>
Jul 23 15:41:06     main()
Jul 23 15:41:06   File "test/run_test.py", line 1068, in main
Jul 23 15:41:06     raise RuntimeError(err_message)
Jul 23 15:41:06 RuntimeError: test_linalg failed!
Jul 23 15:41:07 + cleanup
Jul 23 15:41:07 + retcode=1
Jul 23 15:41:07 + set +x
Jul 23 15:41:07 =================== sccache compilation log ===================
Jul 23 15:41:07 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jul 23 15:41:07 Compile requests                      28
Jul 23 15:41:07 Compile requests executed             26
Jul 23 15:41:07 Cache hits                             2
Jul 23 15:41:07 Cache hits (C/C++)                     2
Jul 23 15:41:07 Cache misses                          24

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pmeier
Copy link
Collaborator Author

pmeier commented May 27, 2021

Blocked by #59067.

@rgommers rgommers added the module: testing Issues related to the torch.testing module (not tests) label Jun 7, 2021
@pmeier
Copy link
Collaborator Author

pmeier commented Jun 23, 2021

Complex comparison with conjugate bit set (for example see this CI run) will be fixed in #60522.

@pmeier
Copy link
Collaborator Author

pmeier commented Jun 23, 2021

@mruberry there are multiple test that fail now because the values are not close within the default tolerances, whereas they passed before. For example see this CI run.

Although torch.testing.assert_close uses the same default tolerances as self.assertEqual, we select them slightly different. self.assertEqual selects the higher tolerance from both dtypes

def _getDefaultRtolAndAtol(self, dtype0, dtype1):
rtol = max(self.dtype_precisions.get(dtype0, (0, 0))[0],
self.dtype_precisions.get(dtype1, (0, 0))[0])
atol = max(self.dtype_precisions.get(dtype0, (0, 0))[1],
self.dtype_precisions.get(dtype1, (0, 0))[1])
return rtol, atol

and afterwards equalizes the dtype. torch.testing.assert_close does it the other way around:

def _get_default_rtol_and_atol(actual: Tensor, expected: Tensor) -> Tuple[float, float]:
dtype = actual.dtype if actual.dtype == expected.dtype else torch.promote_types(actual.dtype, expected.dtype)
return _DTYPE_PRECISIONS.get(dtype, (0.0, 0.0))

I think the way it is done in self.assertEqual makes more sense especially when comparing against a reference that might have a more precise dtype:

>>> actual = torch.tensor(0.99, dtype=torch.bfloat16)
>>> expected = torch.tensor(1.0, dtype=torch.float64)
>>> torch.testing.assert_close(actual, expected, check_dtype=False)
AssertionError: Tensors are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 0.01171875 at 0 (up to 1e-07 allowed)
Greatest relative difference: 0.01171875 at 0 (up to 1e-07 allowed)
>>> torch.testing.assert_close(actual, expected.to(actual), check_dtype=False)

@pmeier pmeier requested a review from mruberry June 23, 2021 09:03
@pmeier
Copy link
Collaborator Author

pmeier commented Jun 23, 2021

@mruberry There are multiple failures throughout the test suite, because self.assertEqual treats 0d tensors and nd tensors with a single element that same if comparing them to a Python scalar. torch.testing.assert_close (IMO correctly) reports the discrepancy. For an example see this CI run.

It is hard for me to judge if these are actual bugs or they were just overlooked while writing the test. Thoughts?

@pmeier
Copy link
Collaborator Author

pmeier commented Jun 23, 2021

@mruberry There are multiple failures throughout the test suite like this:

AssertionError: Except for scalars, type equality is required, but got <class 'torch.nn.parameter.Parameter'> and <class 'torch.Tensor'> instead.

As discussed offline and for example in #56544 (comment) this could be fixed by an additional check_type_equality: bool = True kwarg. We could also have a special "relaxed" value that would allow subclasses, but not arbitrary input type combinations. With that numpy.ndarray vs. torch.Tensor would still fail, whereas torch.nn.Parameter vs. torch.Tensor would pass.

I vote against making torch.nn.Parameter a special case, since even within our test suite there exist cases where it wouldn't be enough. For example test/test_jit.py::TestScript::test_script_method_torch_function_overload defines a custom tensor class, which would not be able to be checked for closeness in our current design.

pmeier added a commit that referenced this pull request Jun 23, 2021
`torch.isclose` does not do this bool tensors, which results in a test failure since subtraction (`abs(actual - expected)`) is not supported for them (see #58981). Since the `dtype` is already checked at this point, we can safely move the upcasting before `torch.isclose` is invoked.

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jun 23, 2021
`torch.isclose` does not do this bool tensors, which results in a test failure since subtraction (`abs(actual - expected)`) is not supported for them (see #58981). Since the `dtype` is already checked at this point, we can safely move the upcasting before `torch.isclose` is invoked.

ghstack-source-id: 7214ffb92f4851e8f5edc9100ed81bdcd5f9db4d
Pull Request resolved: #60536
@mruberry
Copy link
Collaborator

mruberry commented Jun 24, 2021

@mruberry There are multiple failures throughout the test suite, because self.assertEqual treats 0d tensors and nd tensors with a single element that same if comparing them to a Python scalar. torch.testing.assert_close (IMO correctly) reports the discrepancy. For an example see this CI run.

It is hard for me to judge if these are actual bugs or they were just overlooked while writing the test. Thoughts?

The few examples I looked at just seem like they're taking advantage of the "shorthand" that assertEqual allowed. I think changing them is fine.

One option for adopting assert_close as the guts of assertEqual (instead of compareTensors) would be to have each test class define an attribute, like "use_assert_close", and if that attribute is true they call assert_close internally. That would allow going test class by test class if switching everything at once was too laborious.

@mruberry
Copy link
Collaborator

@mruberry There are multiple failures throughout the test suite, because self.assertEqual treats 0d tensors and nd tensors with a single element that same if comparing them to a Python scalar. torch.testing.assert_close (IMO correctly) reports the discrepancy. For an example see this CI run.

It is hard for me to judge if these are actual bugs or they were just overlooked while writing the test. Thoughts?

@mruberry There are multiple failures throughout the test suite like this:

AssertionError: Except for scalars, type equality is required, but got <class 'torch.nn.parameter.Parameter'> and <class 'torch.Tensor'> instead.

As discussed offline and for example in #56544 (comment) this could be fixed by an additional check_type_equality: bool = True kwarg. We could also have a special "relaxed" value that would allow subclasses, but not arbitrary input type combinations. With that numpy.ndarray vs. torch.Tensor would still fail, whereas torch.nn.Parameter vs. torch.Tensor would pass.

I vote against making torch.nn.Parameter a special case, since even within our test suite there exist cases where it wouldn't be enough. For example test/test_jit.py::TestScript::test_script_method_torch_function_overload defines a custom tensor class, which would not be able to be checked for closeness in our current design.

Would your overall proposal be something like allow_subclasses=True and check_strides=False for the default kwarg design of assert_close, then?

pmeier added a commit that referenced this pull request Jun 24, 2021
We now have three out of three datapoints that `check_stride` will be `partial`'ed to `False`:

- `torch`: #58981 (comment)
- `torchvision`: #56544 (comment)
- `kornia`: https://github.com/kornia/kornia/blob/9041c42b410e6a4bbb664c7134a120be80aa2265/test/utils.py#L25

Given that the strides in most cases are in implementation detail, IMO we should change the default to `False`. In cases were matching strides is a requirement for closeness / equality it can always set to `True` manually.

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 16, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 16, 2021
pmeier added a commit that referenced this pull request Dec 16, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 16, 2021
pmeier added a commit that referenced this pull request Dec 16, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 16, 2021
pmeier added a commit that referenced this pull request Dec 17, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 17, 2021
pmeier added a commit that referenced this pull request Dec 17, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 17, 2021
pmeier added a commit that referenced this pull request Dec 21, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 21, 2021
pmeier added a commit that referenced this pull request Dec 22, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 22, 2021
pmeier added a commit that referenced this pull request Dec 23, 2021
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Dec 23, 2021
pmeier added a commit that referenced this pull request Jan 13, 2022
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

Differential Revision: [D33542994](https://our.internmc.facebook.com/intern/diff/D33542994)

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jan 13, 2022
pmeier added a commit that referenced this pull request Jan 21, 2022
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

Differential Revision: [D33542994](https://our.internmc.facebook.com/intern/diff/D33542994)

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jan 21, 2022
pmeier added a commit that referenced this pull request Jan 23, 2022
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

Differential Revision: [D33542994](https://our.internmc.facebook.com/intern/diff/D33542994)

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jan 23, 2022
pmeier added a commit that referenced this pull request Jan 24, 2022
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

Differential Revision: [D33542994](https://our.internmc.facebook.com/intern/diff/D33542994)

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jan 24, 2022
pmeier added a commit that referenced this pull request Jan 25, 2022
…ase.assertEqual`"


Supersedes #58981.

cc mruberry

Differential Revision: [D33542994](https://our.internmc.facebook.com/intern/diff/D33542994)

[ghstack-poisoned]
pmeier added a commit that referenced this pull request Jan 25, 2022
facebook-github-bot pushed a commit that referenced this pull request Jan 27, 2022
Summary:
Pull Request resolved: #67796

Supersedes #58981.

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542994

Pulled By: mruberry

fbshipit-source-id: 527099f5fdc154fd95ee48cd19f0a85eeec43443
pytorchmergebot pushed a commit that referenced this pull request Jan 27, 2022
Summary:
Pull Request resolved: #67796

Supersedes #58981.

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542994

Pulled By: mruberry

fbshipit-source-id: 527099f5fdc154fd95ee48cd19f0a85eeec43443
(cherry picked from commit 1a58915)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: testing Issues related to the torch.testing module (not tests) open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants