Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] r1.15 rccl upstream patch #34532

Merged

Conversation

@jeffdaily
Copy link
Contributor

jeffdaily commented Nov 22, 2019

This patch is a backport of current RCCL support in master for the r1.15 branch. RCCL support was not complete in the r1.15 branch, and since this is the last V1 release branch, it is important to have this feature here.

Further, without this PR, the r1.15 branch will not build for the latest ROCm release due to missing clang 10-based header files. See #31849 for the same change to master.

jeffdaily added 2 commits Nov 22, 2019
This patch is a backport of current RCCL support in master
for the r1.15 branch.
@jeffdaily jeffdaily requested a review from chsigg as a code owner Nov 22, 2019
@tensorflow-bot tensorflow-bot bot added the size:L label Nov 22, 2019
@googlebot googlebot added the cla: yes label Nov 22, 2019
@jeffdaily jeffdaily changed the title [ROCm] R1.15 rccl upstream patch [ROCm] r1.15 rccl upstream patch Nov 22, 2019
@jeffdaily

This comment has been minimized.

Copy link
Contributor Author

jeffdaily commented Nov 25, 2019

Thank you @mihaimaruseac for approving. I was concerned with the number of failing checks even though the build was successful for ROCm and non-ROCm paths on our test systems. I tried reproducing the sanity test failure locally but the indicated failures were unrelated to this PR. I hope you were able to otherwise test this satisfactorily.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Nov 25, 2019

We have some issues with the tests on the old release branches. We're working on fixes.

@sunway513

This comment has been minimized.

Copy link
Contributor

sunway513 commented Nov 26, 2019

Hi @mihaimaruseac , could you help update the ETA to have this PR merged to r1.15 release branch?

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Nov 26, 2019

I think first week of December?

@sunway513

This comment has been minimized.

Copy link
Contributor

sunway513 commented Dec 3, 2019

@mihaimaruseac gentle ping, thanks!

@jeffdaily

This comment has been minimized.

Copy link
Contributor Author

jeffdaily commented Dec 9, 2019

@mihaimaruseac gentle ping, thanks.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Dec 9, 2019

Apologies. I didn't yet get a chance to investigate why the CI fails on the release branches. It seems to be picking some configuration from master but didn't yet have time to dig more into this.

@jeffdaily

This comment has been minimized.

Copy link
Contributor Author

jeffdaily commented Dec 18, 2019

@mihaimaruseac gentle ping, thanks. This is also blocking the subsequent PR #34769.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Dec 20, 2019

Apologies for the delay. I tried now to get #33981 merged so that the builds would run against the r1.15 branch instead of master. However, the builds use remote execution and at the moment our remote execution is only configured from HEAD.

I'll try over the holidays and bring @gunan in too and see what we can do.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Dec 26, 2019

So I got the linux builds and one Windows one to pass on the 2.0 equivalent of #33981 (#33982). I'll try to get the others too but by start of the new year I will cherry-pick these changes to #33981 with as many presubmits passing as possible.

@gunan

This comment has been minimized.

Copy link
Member

gunan commented Dec 27, 2019

Actually, remote builds do not care about which branch we are running from.
All remote build parameters, docker containers and the compiler options are baked into the branch, under third_party/toolchains/preconfig folder.

@deven-amd

This comment has been minimized.

Copy link
Contributor

deven-amd commented Jan 15, 2020

@mihaimaruseac @gunan, gentle ping

anything we can / need to do on our end to help out?

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Jan 15, 2020

#33981 was merged less than an hour ago. This means we can attempt running presubmits on the branch.

Probably some will fail at this moment as VMs changed but I will run presubmits again later in the night, when the VMs for this branch can be reused.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Jan 15, 2020

Windows Bazel GPU and Ubuntu Sanity failures are expected at this time, trying them again later.

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Jan 15, 2020

@jeffdaily @deven-amd can you please check the Linux GPU build? I'm still going to run it on the 1.15 VMs around 8 hours from now, just in case the failure is not related to the PR but to the VMs.

Once this is merged, I think next is #34769 and #35230. Both of them are gated on this one, right?

@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Jan 16, 2020

According to https://source.cloud.google.com/results/invocations/efa3a582-4b1e-472a-af44-1bad8131553e/targets/%2F%2Ftensorflow%2Fcore%2Fkernels:collective_nccl_test_gpu/log (the only failure on the proper VMs), we have some failures introduced by this PR (other PRs on the branch are completely green)

2020-01-16 06:03:24.050562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 1628 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
2020-01-16 06:03:24.154879: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-01-16 06:03:24.154914: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x10462bd)[0x7f0ca2e412bd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f0ca1bef390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f0ca0b9c428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f0ca0b9e02a]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x15df094)[0x7f0ca33da094]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f0ca2e1ef27]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f0ca2e1f7bf]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f0ca2e26e81]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f0ca2e24578]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x1685a8f)[0x7f0ca3480a8f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f0ca1be56ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0ca0c6e41d]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tensorflow::CurrentStackTrace()


	gsignal
	abort

	tensorflow::EventMgr::PollEvents(bool, absl::InlinedVector<tensorflow::EventMgr::InUse, 4ul, std::allocator<tensorflow::EventMgr::InUse> >*)
	tensorflow::EventMgr::PollLoop()
	Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
	std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)


	clone
*** End stack trace ***
external/bazel_tools/tools/test/test-setup.sh: line 310:    16 Aborted                 (core dumped) "${TEST_PATH}" "$@" 2>&1
@jeffdaily

This comment has been minimized.

Copy link
Contributor Author

jeffdaily commented Jan 20, 2020

@mihaimaruseac thank you for bringing this test to my attention. I am trying to reproduce now on our platform.

@googlebot

This comment has been minimized.

Copy link

googlebot commented Jan 20, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes labels Jan 20, 2020
@jerryyin jerryyin force-pushed the ROCmSoftwarePlatform:r1.15-rccl-upstream-patch branch from 0f9dc72 to c954406 Jan 20, 2020
@googlebot

This comment has been minimized.

Copy link

googlebot commented Jan 20, 2020

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes and removed cla: no labels Jan 20, 2020
@jeffdaily jeffdaily requested a review from mihaimaruseac Jan 20, 2020
@jeffdaily

This comment has been minimized.

Copy link
Contributor Author

jeffdaily commented Jan 20, 2020

@mihaimaruseac I was able to reproduce the failing test more than once. After this change, c954406 , I was no longer able to reproduce. Let's watch CI now. Thanks.

@mihaimaruseac mihaimaruseac merged commit 360b2e3 into tensorflow:r1.15 Jan 21, 2020
7 of 9 checks passed
7 of 9 checks passed
MacOS CPU Python3 Internal CI build started.
Details
MacOS Python2 and CC Internal CI build started.
Details
Android Demo App Internal CI build successful
Details
Linux GPU Internal CI build successful
Details
Ubuntu CPU Internal CI build successful
Details
Ubuntu Sanity Internal CI build successful
Details
Windows Bazel Internal CI build successful
Details
Windows Bazel GPU Internal CI build successful
Details
cla/google All necessary CLAs are signed
@mihaimaruseac

This comment has been minimized.

Copy link
Collaborator

mihaimaruseac commented Jan 21, 2020

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants
You can’t perform that action at this time.