[ROCm] This change replaces the original assert for detecting multiple #49232

rsanthanam-amd · 2021-05-17T13:35:59Z

NCCL managers in favor of a warning log message.

The original assert was added to protect against a potential deadlock
scenario (that is specific to ROCm) when multiple hosts are emulated
using a single host. This can happen in this scenario because two
different NCCL streams can map to the same hardware queue and block
each other from making forward progress.

This deadlock scenario seems to only occur in this specific scenario.

However, the assert was preventing other scenarios involving multiple
NCCL managers from executing even though they did not cause a deadlock.

For example, TF creates a single eager context (which creates a NCCL manager)
prior to running a model in eager mode. However, other eager contexts can be
created for single use purposes which results in additional NCCL managers being
created but not used duing the model run. In these scenarios, deadlock is
impossible and thus the creation of multiple NCCL managers should be allowed
to proceed.

The specific case which is prompting this change is that in addition to the
regular eager context, an additional context is created by the MLIR constant
folding optimization to compute the constant at compile time so that it can
be used as a replacement later on the compilation process. This additional
context is only used once and cannot cause a deadlock.

This change enables these cases to execute successfully.

In the future, the original assert will have to be reinstated with
additional logic restricting its execution to the specific scenario(s)
which cause the deadlock.

NCCL managers in favor of a warning log message. The original assert was added to protect against a potential deadlock scenario (that is specific to ROCm) when multiple hosts are emulated using a single host. This can happen in this scenario because two different NCCL streams can map to the same hardware queue and block each other from making forward progress. This deadlock scenario seems to only occur in this specific scenario. However, the assert was preventing other scenarios involving multiple NCCL managers from executing even though they did not cause a deadlock. For example, TF creates a single eager context (which creates a NCCL manager) prior to running a model in eager mode. However, other eager contexts can be created for single use purposes which results in additional NCCL managers being created but not used duing the model run. In these scenarios, deadlock is impossible and thus the creation of multiple NCCL managers should be allowed to proceed. The specific case which is prompting this change is that in addition to the regular eager context, an additional context is created by the MLIR constant folding optimization to compute the constant at compile time so that it can be used as a replacement later on the compilation process. This additional context is only used once and cannot cause a deadlock. This change enables these cases to execute successfully. In the future, the original assert will have to be reinstated with additional logic restricting its execution to the specific scenario(s) which cause the deadlock.

rsanthanam-amd · 2021-05-17T13:39:48Z

FYI.

/cc @cheshire @chsigg

rsanthanam-amd · 2021-05-21T12:21:34Z

@chsigg Gentle ping.

rsanthanam-amd · 2021-05-27T12:07:49Z

@chsigg Gentle ping.

rsanthanam-amd · 2021-06-05T14:28:46Z

@cheshire gentle ping.

cheshire · 2021-06-07T18:28:57Z

@jurahul any thoughts on this?

cheshire · 2021-06-07T18:34:59Z

@rsanthanam-amd sorry I am confused. In general, we try to avoid warning log messages: if it's an error, we should check at least dynamically, and if it's not an error, we should not print anything.

@jpienaar can we avoid creating the NCCL manager from the TF eager context used by the MLIR bridge?

Could you clarify the connection with ROCm?

jpienaar · 2021-06-07T18:48:49Z

@rsanthanam-amd sorry I am confused. In general, we try to avoid warning log messages: if it's an error, we should check at least dynamically, and if it's not an error, we should not print anything.

@jpienaar can we avoid creating the NCCL manager from the TF eager context used by the MLIR bridge?

Could you clarify the connection with ROCm?

I'm surprised a NCCL manager is created given GPU count is set to 0 there - at least I can't think of where a NCCL manager would be useful without any GPU devices enabled. I think we should be able to avoid that.

cheshire · 2021-06-07T18:58:42Z

This is a PR from ~21 days ago, is it possible that the bug is no longer there?

gbaned · 2021-06-30T15:15:19Z

@cheshire, @jpienaar, @rsanthanam-amd Any update on this PR? Please. Thanks!

cheshire · 2021-06-30T15:25:44Z

@rsanthanam-amd do you still hit the original issue?

rsanthanam-amd · 2021-06-30T15:29:08Z

@cheshire I recently tried to reproduce the problem but the model I am using is experiencing problems because of the keras decoupling. I will look into this as soon as I can sort out that problem.

cheshire · 2021-06-30T15:31:22Z

@rsanthanam-amd is this the issue about checking isinstance on OptimizerV2 by. chance?

rsanthanam-amd · 2021-06-30T15:32:22Z

@cheshire Yes, exactly!

cheshire · 2021-06-30T21:59:11Z

@rsanthanam-amd could you please describe exactly how did you hit it?

rsanthanam-amd · 2021-06-30T22:21:28Z

@cheshire the model i am working with is using some lars optimizer. Here is the error message: "TypeError: "inner_optimizer" must be an instance of OptimizerV2, but got: <lars_optimizer.LARSOptimizer object at 0x7f6b390250f0>"

cheshire · 2021-06-30T23:32:51Z

so the underlying issue is that you are mixing Keras classes from import keras and from from tensorflow.python import keras. If you remove all the imports of the latter class, the issue should be resolved. Where is your LARSOptimizer definition?

rsanthanam-amd · 2021-07-01T13:24:44Z

@cheshire Thank you so much! Your suggestion worked and I got the model running. This Larsoptimizer seems to be part of the model source.

I synced our ROCm fork with this repo on Monday and I can confirm that in our synced repo, this multiple nccl manager problem still exists.

jpienaar · 2021-07-01T14:10:09Z

Could you report the call stacks of where the NcclManagers are created?

rsanthanam-amd · 2021-07-01T15:44:55Z

Call stack No. 1

#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2cf7f11834 in tensorflow::MaybeCreateNcclCommunicator() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f2cf7efe26a in tensorflow::EagerContext::EagerContext(tensorflow::SessionOptions const&, tensorflow::ContextDevicePlacementPolicy, bool, tensorflow::DeviceMgr*, bool, tensorflow::Rendezvous*, tensorflow::DistributedFunctionLibraryRuntime*, tensorflow::CollectiveExecutorMgrInterface*, bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#3 0x00007f2cf15ae71e in TFE_NewContext () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#4 0x00007f2ce18c73b5 in pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module&)::{lambda(TFE_ContextOptions const*)#8}, pybind11::object, TFE_ContextOptions const*, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::return_value_policy>(pybind11_init__pywrap_tfe(pybind11::module&)::{lambda(TFE_ContextOptions const*)#8}&&, pybind11::object ()(TFE_ContextOptions const), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::return_value_policy const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so
#5 0x00007f2ce18c4e96 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so

Call stack No. 2

#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2cf7f11834 in tensorflow::MaybeCreateNcclCommunicator() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f2cf7efe26a in tensorflow::EagerContext::EagerContext(tensorflow::SessionOptions const&, tensorflow::ContextDevicePlacementPolicy, bool, tensorflow::DeviceMgr*, bool, tensorflow::Rendezvous*, tensorflow::DistributedFunctionLibraryRuntime*, tensorflow::CollectiveExecutorMgrInterface*, bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f2cf15ae71e in TFE_NewContext () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f2cf0541fa0 in mlir::TF::ConstantFoldFallbackHook(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&)::{lambda()#2}::operator()() const [clone .isra.161] () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f2cf15a755f in mlir::TF::ConstantFoldFallbackHook(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&)
() from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f2cf8ac3595 in mlir::TF::(anonymous namespace)::TFConstantFoldInterface::fold(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&) const () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f2cfd2fe25a in mlir::Operation::fold(llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f2cfa0e87b0 in mlir::OperationFolder::tryToFold(mlir::OpBuilder&, mlir::Operation*, llvm::SmallVectorImplmlir::Value&, llvm::function_ref<void (mlir::Operation*)>) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f2cfa0e9a04 in mlir::OperationFolder::tryToFold(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, llvm::function_ref<void (mlir::Operation*)>, bool*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f2cfa0ebc41 in mlir::applyPatternsAndFoldGreedily(llvm::MutableArrayRefmlir::Region, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f2cfa457ebb in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f2cfa458212 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass >, mlir::Pass> >, mlir::Operation, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f2cfa456b71 in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f2cfa45808f in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f2cfa458212 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass >, mlir::Pass> >, mlir::Operation, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f2cfa4597fa in mlir::PassManager::runPasses(mlir::Operation*, mlir::AnalysisManager) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007f2cfa45a1a5 in mlir::PassManager::run(mlir::Operation*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#18 0x00007f2cf858c55d in tensorflow::CompileGraphSetup(mlir::ModuleOp, llvm::ArrayReftensorflow::XlaArgument, std::vector<int, std::allocator >, llvm::SmallVector<tensorflow::TensorOrResourceShape, 4u>&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#19 0x00007f2cf858ce1f in tensorflow::BuildHloFromModule(mlir::ModuleOp, xla::XlaBuilder&, llvm::ArrayRefxla::XlaOp, std::vector<xla::XlaOp, std::allocatorxla::XlaOp >&, llvm::ArrayReftensorflow::XlaArgument, llvm::StringRef, llvm::MutableArrayRef<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass > >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#20 0x00007f2cf8591b3b in tensorflow::BuildHloFromGraph(tensorflow::Graph const&, xla::XlaBuilder&, llvm::ArrayRefxla::XlaOp, std::vector<xla::XlaOp, std::--Type for more, q to quit, c to continue without paging--
allocatorxla::XlaOp >&, llvm::ArrayReftensorflow::XlaArgument, llvm::ArrayRef<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, llvm::StringRef, tensorflow::FunctionLibraryDefinition const&, tensorflow::GraphDebugInfo const&, llvm::MutableArrayRef<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass > >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007f2cf85434fd in tensorflow::MlirXlaOpKernel::ConstructXlaOp(tensorflow::XlaOpKernelContext) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#22 0x00007f2cf85447f9 in tensorflow::MlirXlaOpKernel::Compile(tensorflow::XlaOpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#23 0x00007f2cf8577d66 in tensorflow::XlaOpKernel::Compute(tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#24 0x00007f2cf8e60137 in tensorflow::XlaCompilationDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#25 0x00007f2cf85620f9 in tensorflow::GraphCompiler::Compile() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#26 0x00007f2cf8574d44 in tensorflow::XlaCompiler::CompileGraph(tensorflow::XlaCompiler::CompileOptions const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<tensorflow::Graph, std::default_deletetensorflow::Graph >, absl::lts_20210324::Span<tensorflow::XlaArgument const>, tensorflow::XlaCompilationResult*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#27 0x00007f2cf857711a in tensorflow::XlaCompiler::CompileFunction(tensorflow::XlaCompiler::CompileOptions const&, tensorflow::NameAttrList const&, absl::lts_20210324::Span<tensorflow::XlaArgument const>, tensorflow::XlaCompilationResult*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#28 0x00007f2cf8544cd0 in std::_Function_handler<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*), tensorflow::XlaCompilationCache::Compile(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompiler::CompileOptions const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**)::{lambda(tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)#1}>::_M_invoke(std::_Any_data const&, tensorflow::XlaCompiler*&&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*&&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#29 0x00007f2cf854bb8f in tensorflow::XlaCompilationCache::CompileStrict(tensorflow::XlaCompilationCache::Entry*, tensorflow::XlaCompiler::Options const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)> const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#30 0x00007f2cf854d9dc in tensorflow::XlaCompilationCache::CompileImpl(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, std::function<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)> const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#31 0x00007f2cf854e836 in tensorflow::XlaCompilationCache::Compile(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompiler::CompileOptions const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#32 0x00007f2cf1df6da9 in tensorflow::CompileToLocalExecutable(tensorflow::OpKernelContext*, tensorflow::NameAttrList const&, bool, tensorflow::XlaPlatformInfo const&, absl::lts_20210324::Span<tensorflow::Tensor const* const>, absl::lts_20210324::Span<tensorflow::VariableInfo const>, absl::lts_20210324::Span, tensorflow::XlaCompilationCache::CompileMode, bool, xla::LocalClient**, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
--Type for more, q to quit, c to continue without paging--
#33 0x00007f2cf1df981c in tensorflow::XlaLocalLaunchBase::Compute(tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#34 0x00007f2ce40495f9 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#35 0x00007f2ce414242f in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::Process(tensorflow::PropagatorState::TaggedNode, long) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#36 0x00007f2ce4143718 in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::RunTask<tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::lts_20210324::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#2}>(tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::lts_20210324::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#2}&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#37 0x00007f2cf15f2181 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#38 0x00007f2cf15ef873 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#39 0x00007f2ce45dc507 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#40 0x00007f2ef08d66db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#41 0x00007f2ef0c0f71f in clone () from /lib/x86_64-linux-gnu/libc.so.6

jpienaar · 2021-07-02T01:35:40Z

Pending change here for review to avoid the init where there are 0 GPU devices configured, just waiting for reviews to complete.

jpienaar · 2021-07-02T21:58:30Z

fc158eb should avoid creating these now where we have 0 GPUs set.

gbaned · 2021-07-08T11:21:29Z

@rsanthanam-amd Can you please check @jpienaar's comments and keep us posted ? Thanks!

rsanthanam-amd · 2021-07-08T11:42:56Z

@jpienaar @gbaned I verified that jpienaar's fix does indeed address the issue and I am able to run my model without the need for the PR.

This PR can be canceled.

rsanthanam-amd · 2021-07-14T15:07:39Z

@cheshire Since @jpienaar fixed the underlying issue, can I cancel and close this PR?

rsanthanam-amd requested a review from chsigg as a code owner May 17, 2021 13:36

google-ml-butler bot added the size:S CL Change Size: Small label May 17, 2021

google-cla bot added the cla: yes label May 17, 2021

rsanthanam-amd changed the title ~~This change replaces the original assert for detecting multiple~~ [ROCm] This change replaces the original assert for detecting multiple May 17, 2021

deven-amd added the kokoro:force-run Tests on submitted change label May 17, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label May 17, 2021

gbaned self-assigned this May 18, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation May 18, 2021

gbaned added comp:gpu GPU related issues awaiting review Pull request awaiting review labels May 19, 2021

gbaned added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Jul 8, 2021

cheshire closed this Jul 14, 2021

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Jul 14, 2021

rsanthanam-amd deleted the upstream_rocm_multiple_nccl_manager_fix branch July 21, 2021 13:20

akuegel mentioned this pull request Dec 13, 2022

[ROCm] Fix for multiple NCCL manager issue. #58090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] This change replaces the original assert for detecting multiple #49232

[ROCm] This change replaces the original assert for detecting multiple #49232

rsanthanam-amd commented May 17, 2021

rsanthanam-amd commented May 17, 2021

rsanthanam-amd commented May 21, 2021

rsanthanam-amd commented May 27, 2021

rsanthanam-amd commented Jun 5, 2021

cheshire commented Jun 7, 2021

cheshire commented Jun 7, 2021

jpienaar commented Jun 7, 2021

cheshire commented Jun 7, 2021

gbaned commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jul 1, 2021

jpienaar commented Jul 1, 2021

rsanthanam-amd commented Jul 1, 2021

jpienaar commented Jul 2, 2021

jpienaar commented Jul 2, 2021

gbaned commented Jul 8, 2021

rsanthanam-amd commented Jul 8, 2021 •

edited

rsanthanam-amd commented Jul 14, 2021

[ROCm] This change replaces the original assert for detecting multiple #49232

[ROCm] This change replaces the original assert for detecting multiple #49232

Conversation

rsanthanam-amd commented May 17, 2021

rsanthanam-amd commented May 17, 2021

rsanthanam-amd commented May 21, 2021

rsanthanam-amd commented May 27, 2021

rsanthanam-amd commented Jun 5, 2021

cheshire commented Jun 7, 2021

cheshire commented Jun 7, 2021

jpienaar commented Jun 7, 2021

cheshire commented Jun 7, 2021

gbaned commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jun 30, 2021

cheshire commented Jun 30, 2021

rsanthanam-amd commented Jul 1, 2021

jpienaar commented Jul 1, 2021

rsanthanam-amd commented Jul 1, 2021

Call stack No. 1

Call stack No. 2

jpienaar commented Jul 2, 2021

jpienaar commented Jul 2, 2021

gbaned commented Jul 8, 2021

rsanthanam-amd commented Jul 8, 2021 • edited

rsanthanam-amd commented Jul 14, 2021

rsanthanam-amd commented Jul 8, 2021 •

edited