Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrizations depending on several inputs #60530

Closed
wants to merge 13 commits into from

Conversation

lezcano
Copy link
Collaborator

@lezcano lezcano commented Jun 23, 2021

Resubmit of #58488

There was a line that had been changed in test_nn.py as caught in #58488 (comment)

I reverted that line, which should never have been changed. I reckon that should solve the issue.

@lezcano lezcano added the module: nn Related to torch.nn label Jun 23, 2021
@lezcano lezcano requested a review from albanD June 23, 2021 10:54
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 23, 2021

💊 CI failures summary and remediations

As of commit bca087e (more details on the Dr. CI page and at hud.pytorch.org/pr/60530):


  • 5/6 failures possibly* introduced in this PR
    • 2/5 non-scanned failure(s)
  • 1/6 broken upstream at merge base 90cd57e from Jun 21 until Jun 23

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 23 15:34:42 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.
Jun 23 15:33:58 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7f9fa1603bd3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:33:58 frame #14: <unknown function> + 0xc8421 (0x7f9fa32e5421 in /opt/conda/lib/libstdc++.so.6)
Jun 23 15:33:58 frame #15: <unknown function> + 0x76ba (0x7f9fb0e4f6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 23 15:33:58 frame #16: clone + 0x6d (0x7f9fb0b8551d in /lib/x86_64-linux-gnu/libc.so.6)
Jun 23 15:33:58 
Jun 23 15:33:58 ok (4.555s)
Jun 23 15:34:14   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (16.090s)
Jun 23 15:34:24   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.968s)
Jun 23 15:34:29   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.556s)
Jun 23 15:34:37   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.566s)
Jun 23 15:34:42   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 23 15:34:42 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 23 15:34:42 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f462ec2ded9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f462ec29fd2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f462ec2b96e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 23 15:34:42 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f4627578b64 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7f4627568b81 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f462ff8b9f8 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 23 15:34:42 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f462756d654 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 23 15:34:42 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f462ff8aff5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 23 15:34:42 frame #8: <unknown function> + 0x426522a (0x7f462756a22a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

See CircleCI build pytorch_libtorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 23 13:37:28 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory
Jun 23 13:37:27 ++++ extract_trap_cmd
Jun 23 13:37:27 ++++ printf '%s\n' ''
Jun 23 13:37:27 +++ printf '%s\n' cleanup
Jun 23 13:37:27 ++ trap -- '
Jun 23 13:37:27 cleanup' EXIT
Jun 23 13:37:27 ++ [[ pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build != *pytorch-win-* ]]
Jun 23 13:37:27 ++ which sccache
Jun 23 13:37:28 ++ sccache --stop-server
Jun 23 13:37:28 ++ true
Jun 23 13:37:28 ++ rm /var/lib/jenkins/sccache_error.log
Jun 23 13:37:28 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory
Jun 23 13:37:28 ++ true
Jun 23 13:37:28 ++ [[ -n '' ]]
Jun 23 13:37:28 ++ [[ pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build == *rocm* ]]
Jun 23 13:37:28 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jun 23 13:37:28 ++ SCCACHE_IDLE_TIMEOUT=1200
Jun 23 13:37:28 ++ RUST_LOG=sccache::server=error
Jun 23 13:37:28 ++ sccache --start-server
Jun 23 13:37:28 sccache: Starting the server...
Jun 23 13:37:28 ++ sccache --zero-stats
Jun 23 13:37:28 Compile requests                      0

1 failure not recognized by patterns:

Job Step Action
CircleCI binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test Run in docker 🔁 rerun

3 jobs timed out:

  • binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
  • pytorch_libtorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build
  • pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@lezcano lezcano removed the request for review from jbschlosser June 23, 2021 10:55
@albanD
Copy link
Collaborator

albanD commented Jun 23, 2021

Thanks!
I'm triggering the master jobs to make sure everything is all good now.

@facebook-github-bot
Copy link
Contributor

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in 3a838e4.

asuhan pushed a commit to asuhan/pytorch that referenced this pull request Jun 28, 2021
Summary:
Resubmit of pytorch#58488

There was a line that had been changed in `test_nn.py` as caught in pytorch#58488 (comment)

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: pytorch#60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
asuhan pushed a commit that referenced this pull request Jun 30, 2021
Summary:
Resubmit of #58488

There was a line that had been changed in `test_nn.py` as caught in #58488 (comment)

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: #60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants