Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch_windows_vs2019_py36_cuda10.1_test1 started to fail frequently, which doesn't look like a regression specific to a particular PR #49558

Closed
malfet opened this issue Dec 17, 2020 · 13 comments
Labels
high priority module: ci Related to continuous integration module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@malfet
Copy link
Contributor

malfet commented Dec 17, 2020

@malfet malfet added high priority module: windows Windows support for PyTorch module: ci Related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 17, 2020
@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

Failure often reproduces if one to run InvalidGradients and DeepRenentrant back to back:

(base) circleci@PACKER-5FD865D6 C:\Users\circleci\project\build\win_tmp\build\torch\lib>test_api.exe --gtest_filter=CustomAutogradTest.InvalidGradients:CustomAutogradTest.DeepReentrant
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = CustomAutogradTest.InvalidGradients:CustomAutogradTest.DeepReentrant-*_CUDA:*_MultiCUDA
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from CustomAutogradTest
[ RUN      ] CustomAutogradTest.InvalidGradients
[       OK ] CustomAutogradTest.InvalidGradients (444 ms)
[ RUN      ] CustomAutogradTest.DeepReentrant
unknown file: error: SEH exception with code 0xc00000fd thrown in the test body.
[  FAILED  ] CustomAutogradTest.DeepReentrant (6 ms)
[----------] 2 tests from CustomAutogradTest (451 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (453 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CustomAutogradTest.DeepReentrant

 1 FAILED TEST

@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

DeepReentrant always fails in WinDbg:

ModLoad: 00007ff7`68d50000 00007ff7`6952c000   test_api.exe
ModLoad: 00007ffd`0bdc0000 00007ffd`0bfad000   ntdll.dll
ModLoad: 00007ffd`097b0000 00007ffd`09863000   C:\Windows\System32\KERNEL32.DLL
ModLoad: 00007ffd`08c40000 00007ffd`08ed5000   C:\Windows\System32\KERNELBASE.dll
ModLoad: 00007ffd`08ee0000 00007ffd`08fda000   C:\Windows\System32\ucrtbase.dll
ModLoad: 00007ffc`fb510000 00007ffc`fb579000   C:\Users\circleci\project\build\win_tmp\build\torch\lib\c10.dll
ModLoad: 00007ffc`c4f60000 00007ffc`d0b03000   C:\Users\circleci\project\build\win_tmp\build\torch\lib\torch_cpu.dll
ModLoad: 00007ffc`d8920000 00007ffc`d89b2000   C:\Windows\SYSTEM32\MSVCP140.dll
ModLoad: 00007ffc`d89c0000 00007ffc`d89d9000   C:\Windows\SYSTEM32\VCRUNTIME140.dll
ModLoad: 00007ffc`d93d0000 00007ffc`d93dc000   C:\Windows\SYSTEM32\VCRUNTIME140_1.dll
ModLoad: 0000017f`979d0000 0000017f`979dc000   C:\Windows\SYSTEM32\VCRUNTIME140_1.dll
ModLoad: 0000017f`979b0000 0000017f`979c9000   C:\Windows\SYSTEM32\VCRUNTIME140.dll
ModLoad: 00007ffc`f7450000 00007ffc`f765f000   C:\Users\circleci\project\build\win_tmp\build\torch\lib\libiomp5md.dll
ModLoad: 00007ffc`f5540000 00007ffc`f586c000   C:\Users\circleci\project\build\win_tmp\build\torch\lib\fbgemm.dll
ModLoad: 00007ffd`01910000 00007ffd`01afd000   C:\Windows\SYSTEM32\dbghelp.dll
ModLoad: 00007ffc`fb4d0000 00007ffc`fb510000   C:\Users\circleci\project\build\win_tmp\build\torch\lib\asmjit.dll
(c70.49c): Break instruction exception - code 80000003 (first chance)
ntdll!LdrpDoDebuggerBreak+0x30:
00007ffd`0be9250c cc              int     3
0:000> g
ModLoad: 00007ffd`09140000 00007ffd`091e3000   C:\Windows\System32\advapi32.dll
ModLoad: 00007ffd`09520000 00007ffd`095be000   C:\Windows\System32\msvcrt.dll
ModLoad: 00007ffd`09df0000 00007ffd`09e8f000   C:\Windows\System32\sechost.dll
ModLoad: 00007ffd`093f0000 00007ffd`09512000   C:\Windows\System32\RPCRT4.dll
ModLoad: 00007ffd`07780000 00007ffd`0778c000   C:\Windows\SYSTEM32\CRYPTBASE.DLL
ModLoad: 00007ffd`08200000 00007ffd`0827f000   C:\Windows\System32\bcryptPrimitives.dll
ModLoad: 00007ffd`07da0000 00007ffd`07db1000   C:\Windows\System32\kernel.appcore.dll
(c70.49c): Stack overflow - code c00000fd (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlpAllocateHeap+0x31:
00007ffd`0bdd1a21 c744243401000000 mov     dword ptr [rsp+34h],1 ss:00000017`bfe03fc4=00000000
0:000> k
 # Child-SP          RetAddr           Call Site
00 00000017`bfe03f90 00007ffd`0bdcfbad ntdll!RtlpAllocateHeap+0x31
01 00000017`bfe04200 00007ffd`08eeb5c6 ntdll!RtlpAllocateHeapInternal+0x98d
02 00000017`bfe042f0 00007ffc`ce377133 ucrtbase!_malloc_base+0x36
03 00000017`bfe04320 00007ffc`ca72530b torch_cpu!caffe2::TensorShapes::default_instance+0x148bb93
04 00000017`bfe04350 00007ffc`ca9e734c torch_cpu!at::detail::empty_cpu+0x22b
05 00000017`bfe04540 00007ffc`caec24ab torch_cpu!at::native::empty_strided_cpu+0xcc
06 00000017`bfe045f0 00007ffc`caeb81ad torch_cpu!at::zeros_outf+0xf79ab
07 00000017`bfe04660 00007ffc`cac1ba75 torch_cpu!at::zeros_outf+0xed6ad
08 00000017`bfe046d0 00007ffc`cacc1ac0 torch_cpu!at::native::mkldnn_sigmoid_+0xb39d5
09 00000017`bfe04790 00007ffc`cade5eb9 torch_cpu!at::native::mkldnn_sigmoid_+0x159a20
0a 00000017`bfe04990 00007ffc`cade291d torch_cpu!at::zeros_outf+0x1b3b9
0b 00000017`bfe04a30 00007ffc`cac1ba75 torch_cpu!at::zeros_outf+0x17e1d
0c 00000017`bfe04aa0 00007ffc`cacc1ac0 torch_cpu!at::native::mkldnn_sigmoid_+0xb39d5
0d 00000017`bfe04b60 00007ffc`cad61f46 torch_cpu!at::native::mkldnn_sigmoid_+0x159a20
0e 00000017`bfe04d60 00007ffc`ca9df353 torch_cpu!at::empty_strided+0x166
0f 00000017`bfe04e40 00007ffc`ca9defc8 torch_cpu!at::native::to_dense_backward+0x373
10 00000017`bfe04ef0 00007ffc`cafd4401 torch_cpu!at::native::to+0x78
11 00000017`bfe04f40 00007ffc`cafb6f60 torch_cpu!at::zeros_outf+0x209901
12 00000017`bfe04f80 00007ffc`cb02920f torch_cpu!at::zeros_outf+0x1ec460
13 00000017`bfe04fc0 00007ffc`cb03437c torch_cpu!at::is_custom_op+0x13e6f
14 00000017`bfe05060 00007ffc`cb063d48 torch_cpu!at::is_custom_op+0x1efdc
15 00000017`bfe05190 00007ffc`ca7154ea torch_cpu!at::Tensor::to+0xb8
16 00000017`bfe05200 00007ffc`ca713319 torch_cpu!at::TensorIteratorBase::compute_types+0x5fa
17 00000017`bfe05400 00007ffc`ca71359f torch_cpu!at::TensorIteratorBase::build+0x79
18 00000017`bfe05590 00007ffc`ca713286 torch_cpu!at::TensorIteratorBase::build_binary_op+0xbf
19 00000017`bfe05680 00007ffc`ca81bb64 torch_cpu!at::TensorIterator::binary_op+0x56
1a 00000017`bfe056c0 00007ffc`ca81bcf6 torch_cpu!at::native::sub+0x64
1b 00000017`bfe05a50 00007ffc`caf0a73e torch_cpu!at::native::sub+0x96
1c 00000017`bfe05ad0 00007ffc`caf001c5 torch_cpu!at::zeros_outf+0x13fc3e
1d 00000017`bfe05b60 00007ffc`cac1795c torch_cpu!at::zeros_outf+0x1356c5
1e 00000017`bfe05bd0 00007ffc`cacb389b torch_cpu!at::native::mkldnn_sigmoid_+0xaf8bc
1f 00000017`bfe05ca0 00007ffc`cadba4cd torch_cpu!at::native::mkldnn_sigmoid_+0x14b7fb
20 00000017`bfe05df0 00007ffc`cbeb3959 torch_cpu!at::sub+0xcd
21 00000017`bfe05e90 00007ffc`cbe92be5 torch_cpu!torch::autograd::GraphRoot::apply+0x32959
22 00000017`bfe06000 00007ffc`cac1795c torch_cpu!torch::autograd::GraphRoot::apply+0x11be5
23 00000017`bfe06070 00007ffc`cacb389b torch_cpu!at::native::mkldnn_sigmoid_+0xaf8bc
24 00000017`bfe06140 00007ffc`cb06185d torch_cpu!at::native::mkldnn_sigmoid_+0x14b7fb
25 00000017`bfe06290 00007ff7`68de8802 torch_cpu!at::Tensor::sub+0xcd
26 00000017`bfe06330 00007ff7`68db2dbf test_api!c10::ivalue::Future::exception_ptr+0x502
27 00000017`bfe06490 00007ff7`68de4b17 test_api!caffe2::TypeMeta::_typeMetaData<caffe2::detail::_Uninitialized>+0xa7f
28 00000017`bfe065e0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x8777
29 00000017`bfe06700 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
2a 00000017`bfe06840 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
2b 00000017`bfe06940 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
2c 00000017`bfe09270 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
2d 00000017`bfe09710 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
2e 00000017`bfe09a30 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
2f 00000017`bfe09c90 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
30 00000017`bfe09f50 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
31 00000017`bfe0a210 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
32 00000017`bfe0a290 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
33 00000017`bfe0a370 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
34 00000017`bfe0a3c0 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
35 00000017`bfe0a480 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
36 00000017`bfe0a5f0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
37 00000017`bfe0a660 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
38 00000017`bfe0a720 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
39 00000017`bfe0a840 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
3a 00000017`bfe0a980 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
3b 00000017`bfe0aa80 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
3c 00000017`bfe0d3b0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
3d 00000017`bfe0d850 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
3e 00000017`bfe0db70 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
3f 00000017`bfe0ddd0 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
40 00000017`bfe0e090 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
41 00000017`bfe0e350 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
42 00000017`bfe0e3d0 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
43 00000017`bfe0e4b0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
44 00000017`bfe0e500 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
45 00000017`bfe0e5c0 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
46 00000017`bfe0e730 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
47 00000017`bfe0e7a0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
48 00000017`bfe0e860 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
49 00000017`bfe0e980 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
4a 00000017`bfe0eac0 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
4b 00000017`bfe0ebc0 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
4c 00000017`bfe114f0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
4d 00000017`bfe11990 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
4e 00000017`bfe11cb0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
4f 00000017`bfe11f10 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
50 00000017`bfe121d0 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
51 00000017`bfe12490 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
52 00000017`bfe12510 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
53 00000017`bfe125f0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
54 00000017`bfe12640 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
55 00000017`bfe12700 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
56 00000017`bfe12870 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
57 00000017`bfe128e0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
58 00000017`bfe129a0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
59 00000017`bfe12ac0 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
5a 00000017`bfe12c00 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
5b 00000017`bfe12d00 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
5c 00000017`bfe15630 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
5d 00000017`bfe15ad0 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
5e 00000017`bfe15df0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
5f 00000017`bfe16050 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
60 00000017`bfe16310 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
61 00000017`bfe165d0 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
62 00000017`bfe16650 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
63 00000017`bfe16730 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
64 00000017`bfe16780 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
65 00000017`bfe16840 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
66 00000017`bfe169b0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
67 00000017`bfe16a20 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
68 00000017`bfe16ae0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
69 00000017`bfe16c00 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
6a 00000017`bfe16d40 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
6b 00000017`bfe16e40 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
6c 00000017`bfe19770 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
6d 00000017`bfe19c10 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
6e 00000017`bfe19f30 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
6f 00000017`bfe1a190 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
70 00000017`bfe1a450 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
71 00000017`bfe1a710 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
72 00000017`bfe1a790 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
73 00000017`bfe1a870 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
74 00000017`bfe1a8c0 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
75 00000017`bfe1a980 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
76 00000017`bfe1aaf0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
77 00000017`bfe1ab60 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
78 00000017`bfe1ac20 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
79 00000017`bfe1ad40 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
7a 00000017`bfe1ae80 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
7b 00000017`bfe1af80 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
7c 00000017`bfe1d8b0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
7d 00000017`bfe1dd50 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
7e 00000017`bfe1e070 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
7f 00000017`bfe1e2d0 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
80 00000017`bfe1e590 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
81 00000017`bfe1e850 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
82 00000017`bfe1e8d0 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
83 00000017`bfe1e9b0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
84 00000017`bfe1ea00 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
85 00000017`bfe1eac0 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
86 00000017`bfe1ec30 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
87 00000017`bfe1eca0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
88 00000017`bfe1ed60 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
89 00000017`bfe1ee80 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
8a 00000017`bfe1efc0 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
8b 00000017`bfe1f0c0 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
8c 00000017`bfe219f0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
8d 00000017`bfe21e90 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
8e 00000017`bfe221b0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
8f 00000017`bfe22410 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
90 00000017`bfe226d0 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
91 00000017`bfe22990 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
92 00000017`bfe22a10 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
93 00000017`bfe22af0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
94 00000017`bfe22b40 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
95 00000017`bfe22c00 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
96 00000017`bfe22d70 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
97 00000017`bfe22de0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
98 00000017`bfe22ea0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
99 00000017`bfe22fc0 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
9a 00000017`bfe23100 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
9b 00000017`bfe23200 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
9c 00000017`bfe25b30 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
9d 00000017`bfe25fd0 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
9e 00000017`bfe262f0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
9f 00000017`bfe26550 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
a0 00000017`bfe26810 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
a1 00000017`bfe26ad0 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
a2 00000017`bfe26b50 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
a3 00000017`bfe26c30 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
a4 00000017`bfe26c80 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
a5 00000017`bfe26d40 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
a6 00000017`bfe26eb0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
a7 00000017`bfe26f20 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
a8 00000017`bfe26fe0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
a9 00000017`bfe27100 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
aa 00000017`bfe27240 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
ab 00000017`bfe27340 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
ac 00000017`bfe29c70 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
ad 00000017`bfe2a110 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
ae 00000017`bfe2a430 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
af 00000017`bfe2a690 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
b0 00000017`bfe2a950 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
b1 00000017`bfe2ac10 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
b2 00000017`bfe2ac90 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
b3 00000017`bfe2ad70 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
b4 00000017`bfe2adc0 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
b5 00000017`bfe2ae80 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
b6 00000017`bfe2aff0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
b7 00000017`bfe2b060 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
b8 00000017`bfe2b120 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
b9 00000017`bfe2b240 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
ba 00000017`bfe2b380 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
bb 00000017`bfe2b480 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
bc 00000017`bfe2ddb0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
bd 00000017`bfe2e250 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
be 00000017`bfe2e570 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
bf 00000017`bfe2e7d0 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
c0 00000017`bfe2ea90 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
c1 00000017`bfe2ed50 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
c2 00000017`bfe2edd0 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
c3 00000017`bfe2eeb0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
c4 00000017`bfe2ef00 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
c5 00000017`bfe2efc0 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
c6 00000017`bfe2f130 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
c7 00000017`bfe2f1a0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
c8 00000017`bfe2f260 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
c9 00000017`bfe2f380 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
ca 00000017`bfe2f4c0 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
cb 00000017`bfe2f5c0 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
cc 00000017`bfe31ef0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
cd 00000017`bfe32390 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
ce 00000017`bfe326b0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
cf 00000017`bfe32910 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
d0 00000017`bfe32bd0 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
d1 00000017`bfe32e90 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
d2 00000017`bfe32f10 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
d3 00000017`bfe32ff0 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
d4 00000017`bfe33040 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
d5 00000017`bfe33100 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
d6 00000017`bfe33270 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
d7 00000017`bfe332e0 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
d8 00000017`bfe333a0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
d9 00000017`bfe334c0 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
da 00000017`bfe33600 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
db 00000017`bfe33700 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
dc 00000017`bfe36030 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
dd 00000017`bfe364d0 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
de 00000017`bfe367f0 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
df 00000017`bfe36a50 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
e0 00000017`bfe36d10 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
e1 00000017`bfe36fd0 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
e2 00000017`bfe37050 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
e3 00000017`bfe37130 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
e4 00000017`bfe37180 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
e5 00000017`bfe37240 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
e6 00000017`bfe373b0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
e7 00000017`bfe37420 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
e8 00000017`bfe374e0 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
e9 00000017`bfe37600 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
ea 00000017`bfe37740 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
eb 00000017`bfe37840 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
ec 00000017`bfe3a170 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
ed 00000017`bfe3a610 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
ee 00000017`bfe3a930 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
ef 00000017`bfe3ab90 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b
f0 00000017`bfe3ae50 00007ffc`cc3ee66a torch_cpu!torch::autograd::Engine::make_anomaly_metadata+0x49f
f1 00000017`bfe3b110 00007ffc`cc7cfb0a torch_cpu!torch::autograd::backward+0x6a
f2 00000017`bfe3b190 00007ffc`cc7d073f torch_cpu!torch::jit::mobile::SequentialSampler::save+0x1e56a
f3 00000017`bfe3b270 00007ffc`cb029940 torch_cpu!torch::autograd::VariableType::allCUDATypes+0x9ff
f4 00000017`bfe3b2c0 00007ffc`cb035ba7 torch_cpu!at::is_custom_op+0x145a0
f5 00000017`bfe3b380 00007ffc`cb039ae4 torch_cpu!at::is_custom_op+0x20807
f6 00000017`bfe3b4f0 00007ffc`ca6bee17 torch_cpu!at::Tensor::_backward+0xe4
f7 00000017`bfe3b560 00007ff7`68de4b6b torch_cpu!at::Tensor::backward+0x127
f8 00000017`bfe3b620 00007ff7`68ddcac7 test_api!c10::ivalue::Future::addCallback+0x87cb
f9 00000017`bfe3b740 00007ffc`cb0edb9a test_api!c10::ivalue::Future::addCallback+0x727
fa 00000017`bfe3b880 00007ffc`cc3f5faa torch_cpu!torch::autograd::Node::operator()+0x2da
fb 00000017`bfe3b980 00007ffc`cc3f697b torch_cpu!torch::autograd::Engine::add_thread_pool_task+0x6aa
fc 00000017`bfe3e2b0 00007ffc`cc3faebb torch_cpu!torch::autograd::Engine::evaluate_function+0x3eb
fd 00000017`bfe3e750 00007ffc`cc3f8555 torch_cpu!torch::autograd::Engine::thread_main+0x59b
fe 00000017`bfe3ea70 00007ffc`cc3f7efb torch_cpu!torch::autograd::Engine::execute_with_graph_task+0x355
ff 00000017`bfe3ecd0 00007ffc`cc3eecdf torch_cpu!torch::autograd::Engine::execute+0x52b

@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

Doing a little bit of bisecting: e9d7d37 passes WinDbg test
676bfa6 passes as well
47c65f8 always fails
And on 1047957 it's fixed again

@skyline75489
Copy link
Contributor

Thanks for the research, @malfet . Seems like the stackoverflow exception that's meant to be caught, is not properly handled? I'm new to the codebase, can you point out where the issue may come from?

And is e9d7d37 the last good commit? Wonder which commit is the first bad one.

@mszhanyi
Copy link
Collaborator

mszhanyi commented Dec 18, 2020

@malfet
I think it's caused by the PR #47725.
pytorch_windows_vs2019_py36_cuda10.1_test1 passed after reverting commit 1b6d18a in #49583

@skyline75489
Copy link
Contributor

skyline75489 commented Dec 18, 2020

If the stack is to believed, the logic related is somewhere around here:

if (current_depth >= max_recursion_depth_) {

@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

@mszhanyi Anything CUDA related is an unlikely culprit, as same pattern of failures happen on CPU see https://app.circleci.com/pipelines/github/pytorch/pytorch/252793/workflows/21c2af00-79d0-46b4-b7a7-85b9a6463c66/jobs/9719246 for example.

@swolchok
Copy link
Contributor

I'm trying to debug this, but I am having trouble with basic things like running test_api.exe (fails with missing c10.dll on the command line, seems to do nothing in windbg. Is there a doc on how to get this running?

@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

@swolchok , just copy it along with OpenMP library into torch/lib folder and run from there

@swolchok
Copy link
Contributor

This is likely due to #49359: #49359 (comment)

@swolchok
Copy link
Contributor

That linked explanation explains the DeepReentrant failure, but what about the other test failures? No idea what's going on there.

@peterjc123
Copy link
Collaborator

@swolchok What do you refer to "other" tests here? If you are talking about the failed tests except DeepReentrant in that job, I think they are somehow caused by the fact that we didn't handle the SEH exception while running test_api.exe. So I guess when the SEH exception go away, they will be restored to the normal state.

@malfet
Copy link
Contributor Author

malfet commented May 23, 2022

Irrelevant as we've stopped building 10.1 a while back, closing

@malfet malfet closed this as completed May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: ci Related to continuous integration module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants