Parametrizations depending on several inputs #60530

lezcano · 2021-06-23T10:54:38Z

Resubmit of #58488

There was a line that had been changed in test_nn.py as caught in #58488 (comment)

I reverted that line, which should never have been changed. I reckon that should solve the issue.

messages

Better testing Better docs

Use collections.abc Change the handling of a sequence of one tensor Reword some bits

facebook-github-bot · 2021-06-23T10:54:44Z

💊 CI failures summary and remediations

As of commit bca087e (more details on the Dr. CI page and at hud.pytorch.org/pr/60530):

5/6 failures possibly* introduced in this PR
- 2/5 non-scanned failure(s)
1/6 broken upstream at merge base 90cd57e from Jun 21 until Jun 23

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 23 15:34:42 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.

Jun 23 15:33:58 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7f9fa1603bd3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:33:58 frame #14: <unknown function> + 0xc8421 (0x7f9fa32e5421 in /opt/conda/lib/libstdc++.so.6)
Jun 23 15:33:58 frame #15: <unknown function> + 0x76ba (0x7f9fb0e4f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 23 15:33:58 frame #16: clone + 0x6d (0x7f9fb0b8551d in /lib/x86_64-linux-gnu/libc.so.6)
Jun 23 15:33:58 
Jun 23 15:33:58 ok (4.555s)
Jun 23 15:34:14   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (16.090s)
Jun 23 15:34:24   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.968s)
Jun 23 15:34:29   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.556s)
Jun 23 15:34:37   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.566s)
Jun 23 15:34:42   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 23 15:34:42 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 23 15:34:42 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f462ec2ded9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f462ec29fd2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f462ec2b96e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f4627578b64 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7f4627568b81 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f462ff8b9f8 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 23 15:34:42 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f462756d654 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f462ff8aff5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 23 15:34:42 frame #8: <unknown function> + 0x426522a (0x7f462756a22a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

pytorch_libtorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 23 13:37:28 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory

Jun 23 13:37:27 ++++ extract_trap_cmd
Jun 23 13:37:27 ++++ printf '%s\n' ''
Jun 23 13:37:27 +++ printf '%s\n' cleanup
Jun 23 13:37:27 ++ trap -- '
Jun 23 13:37:27 cleanup' EXIT
Jun 23 13:37:27 ++ [[ pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build != *pytorch-win-* ]]
Jun 23 13:37:27 ++ which sccache
Jun 23 13:37:28 ++ sccache --stop-server
Jun 23 13:37:28 ++ true
Jun 23 13:37:28 ++ rm /var/lib/jenkins/sccache_error.log
Jun 23 13:37:28 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory
Jun 23 13:37:28 ++ true
Jun 23 13:37:28 ++ [[ -n '' ]]
Jun 23 13:37:28 ++ [[ pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build == *rocm* ]]
Jun 23 13:37:28 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jun 23 13:37:28 ++ SCCACHE_IDLE_TIMEOUT=1200
Jun 23 13:37:28 ++ RUST_LOG=sccache::server=error
Jun 23 13:37:28 ++ sccache --start-server
Jun 23 13:37:28 sccache: Starting the server...
Jun 23 13:37:28 ++ sccache --zero-stats
Jun 23 13:37:28 Compile requests                      0

1 failure not recognized by patterns:

Job	Step	Action
^{binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test}	^{Run in docker}	🔁 rerun

3 jobs timed out:

binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
pytorch_libtorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build
pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build from Jun 21 until Jun 23 (5d476f5 - 63219f1)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

albanD · 2021-06-23T13:28:29Z

Thanks!
I'm triggering the master jobs to make sure everything is all good now.

facebook-github-bot · 2021-06-23T13:32:29Z

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-25T16:18:21Z

@albanD merged this pull request in 3a838e4.

Summary: Resubmit of pytorch#58488 There was a line that had been changed in `test_nn.py` as caught in pytorch#58488 (comment) I reverted that line, which should never have been changed. I reckon that should solve the issue. Pull Request resolved: pytorch#60530 Reviewed By: ngimel Differential Revision: D29329865 Pulled By: albanD fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3

Summary: Resubmit of #58488 There was a line that had been changed in `test_nn.py` as caught in #58488 (comment) I reverted that line, which should never have been changed. I reckon that should solve the issue. Pull Request resolved: #60530 Reviewed By: ngimel Differential Revision: D29329865 Pulled By: albanD fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3

lezcano added 13 commits June 11, 2021 11:00

Intial implementation.

b17cdca

comment's from Alban

6cfe16c

Check parametrizations at registration for earlier and better error

7506fff

messages

minor

5c6946d

mypy

c8b0704

Errors are now mostly thrown at registration

1a82c9b

Better testing Better docs

mypy

07b297a

Address most of Alban's comments

5c46f0e

Fail sooner

968da20

Use a proper optimizer

ba56b21

Prefer sgd steps and check tha they actually work

42cadd2

Use collections.abc Change the handling of a sequence of one tensor Reword some bits

Correct == 1

ccc0a87

Correct CI

bca087e

lezcano added the module: nn Related to torch.nn label Jun 23, 2021

lezcano requested a review from albanD June 23, 2021 10:54

lezcano requested a review from jbschlosser as a code owner June 23, 2021 10:54

facebook-github-bot added the cla signed label Jun 23, 2021

lezcano removed the request for review from jbschlosser June 23, 2021 10:55

pytorchbot added the open source label Jun 23, 2021

albanD added the ci/master label Jun 23, 2021

albanD mentioned this pull request Jun 23, 2021

Allow resizing of parametrized tensors #60418

Closed

albanD approved these changes Jun 25, 2021

View reviewed changes

facebook-github-bot closed this in 3a838e4 Jun 25, 2021

facebook-github-bot added the Merged label Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrizations depending on several inputs #60530

Parametrizations depending on several inputs #60530

lezcano commented Jun 23, 2021

facebook-github-bot commented Jun 23, 2021 •

edited

Loading

albanD commented Jun 23, 2021

facebook-github-bot commented Jun 23, 2021

facebook-github-bot commented Jun 25, 2021

Parametrizations depending on several inputs #60530

Parametrizations depending on several inputs #60530

Conversation

lezcano commented Jun 23, 2021

facebook-github-bot commented Jun 23, 2021 • edited Loading

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (1/2)

pytorch_libtorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (2/2)

1 failure not recognized by patterns:

🚧 1 fixed upstream failure:

ci.pytorch.org: 1 failed

albanD commented Jun 23, 2021

facebook-github-bot commented Jun 23, 2021

facebook-github-bot commented Jun 25, 2021

facebook-github-bot commented Jun 23, 2021 •

edited

Loading