Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] This change replaces the original assert for detecting multiple #49232

Conversation

rsanthanam-amd
Copy link
Contributor

NCCL managers in favor of a warning log message.

The original assert was added to protect against a potential deadlock
scenario (that is specific to ROCm) when multiple hosts are emulated
using a single host. This can happen in this scenario because two
different NCCL streams can map to the same hardware queue and block
each other from making forward progress.

This deadlock scenario seems to only occur in this specific scenario.

However, the assert was preventing other scenarios involving multiple
NCCL managers from executing even though they did not cause a deadlock.

For example, TF creates a single eager context (which creates a NCCL manager)
prior to running a model in eager mode. However, other eager contexts can be
created for single use purposes which results in additional NCCL managers being
created but not used duing the model run. In these scenarios, deadlock is
impossible and thus the creation of multiple NCCL managers should be allowed
to proceed.

The specific case which is prompting this change is that in addition to the
regular eager context, an additional context is created by the MLIR constant
folding optimization to compute the constant at compile time so that it can
be used as a replacement later on the compilation process. This additional
context is only used once and cannot cause a deadlock.

This change enables these cases to execute successfully.

In the future, the original assert will have to be reinstated with
additional logic restricting its execution to the specific scenario(s)
which cause the deadlock.

NCCL managers in favor of a warning log message.

The original assert was added to protect against a potential deadlock
scenario (that is specific to ROCm) when multiple hosts are emulated
using a single host.  This can happen in this scenario because two
different NCCL streams can map to the same hardware queue and block
each other from making forward progress.

This deadlock scenario seems to only occur in this specific scenario.

However, the assert was preventing other scenarios involving multiple
NCCL managers from executing even though they did not cause a deadlock.

For example, TF creates a single eager context (which creates a NCCL manager)
prior to running a model in eager mode.  However, other eager contexts can be
created for single use purposes which results in additional NCCL managers being
created but not used duing the model run.  In these scenarios, deadlock is
impossible and thus the creation of multiple NCCL managers should be allowed
to proceed.

The specific case which is prompting this change is that in addition to the
regular eager context, an additional context is created by the MLIR constant
folding optimization to compute the constant at compile time so that it can
be used as a replacement later on the compilation process.  This additional
context is only used once and cannot cause a deadlock.

This change enables these cases to execute successfully.

In the future, the original assert will have to be reinstated with
additional logic restricting its execution to the specific scenario(s)
which cause the deadlock.
@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label May 17, 2021
@google-cla google-cla bot added the cla: yes label May 17, 2021
@rsanthanam-amd
Copy link
Contributor Author

FYI.

/cc @cheshire @chsigg

@rsanthanam-amd rsanthanam-amd changed the title This change replaces the original assert for detecting multiple [ROCm] This change replaces the original assert for detecting multiple May 17, 2021
@deven-amd deven-amd added the kokoro:force-run Tests on submitted change label May 17, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 17, 2021
@gbaned gbaned self-assigned this May 18, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 18, 2021
@gbaned gbaned added comp:gpu GPU related issues awaiting review Pull request awaiting review labels May 19, 2021
@rsanthanam-amd
Copy link
Contributor Author

@chsigg Gentle ping.

1 similar comment
@rsanthanam-amd
Copy link
Contributor Author

@chsigg Gentle ping.

@rsanthanam-amd
Copy link
Contributor Author

@cheshire gentle ping.

@cheshire
Copy link
Member

cheshire commented Jun 7, 2021

@jurahul any thoughts on this?

@cheshire
Copy link
Member

cheshire commented Jun 7, 2021

@rsanthanam-amd sorry I am confused. In general, we try to avoid warning log messages: if it's an error, we should check at least dynamically, and if it's not an error, we should not print anything.

@jpienaar can we avoid creating the NCCL manager from the TF eager context used by the MLIR bridge?

Could you clarify the connection with ROCm?

@jpienaar
Copy link
Member

jpienaar commented Jun 7, 2021

@rsanthanam-amd sorry I am confused. In general, we try to avoid warning log messages: if it's an error, we should check at least dynamically, and if it's not an error, we should not print anything.

@jpienaar can we avoid creating the NCCL manager from the TF eager context used by the MLIR bridge?

Could you clarify the connection with ROCm?

I'm surprised a NCCL manager is created given GPU count is set to 0 there - at least I can't think of where a NCCL manager would be useful without any GPU devices enabled. I think we should be able to avoid that.

@cheshire
Copy link
Member

cheshire commented Jun 7, 2021

This is a PR from ~21 days ago, is it possible that the bug is no longer there?

@gbaned
Copy link
Contributor

gbaned commented Jun 30, 2021

@cheshire, @jpienaar, @rsanthanam-amd Any update on this PR? Please. Thanks!

@cheshire
Copy link
Member

@rsanthanam-amd do you still hit the original issue?

@rsanthanam-amd
Copy link
Contributor Author

@cheshire I recently tried to reproduce the problem but the model I am using is experiencing problems because of the keras decoupling. I will look into this as soon as I can sort out that problem.

@cheshire
Copy link
Member

@rsanthanam-amd is this the issue about checking isinstance on OptimizerV2 by. chance?

@rsanthanam-amd
Copy link
Contributor Author

@cheshire Yes, exactly!

@cheshire
Copy link
Member

@rsanthanam-amd could you please describe exactly how did you hit it?

@rsanthanam-amd
Copy link
Contributor Author

@cheshire the model i am working with is using some lars optimizer. Here is the error message: "TypeError: "inner_optimizer" must be an instance of OptimizerV2, but got: <lars_optimizer.LARSOptimizer object at 0x7f6b390250f0>"

@cheshire
Copy link
Member

so the underlying issue is that you are mixing Keras classes from import keras and from from tensorflow.python import keras. If you remove all the imports of the latter class, the issue should be resolved. Where is your LARSOptimizer definition?

@rsanthanam-amd
Copy link
Contributor Author

@cheshire Thank you so much! Your suggestion worked and I got the model running. This Larsoptimizer seems to be part of the model source.

I synced our ROCm fork with this repo on Monday and I can confirm that in our synced repo, this multiple nccl manager problem still exists.

@jpienaar
Copy link
Member

jpienaar commented Jul 1, 2021

Could you report the call stacks of where the NcclManagers are created?

@rsanthanam-amd
Copy link
Contributor Author

Call stack No. 1

#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2cf7f11834 in tensorflow::MaybeCreateNcclCommunicator() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f2cf7efe26a in tensorflow::EagerContext::EagerContext(tensorflow::SessionOptions const&, tensorflow::ContextDevicePlacementPolicy, bool, tensorflow::DeviceMgr*, bool, tensorflow::Rendezvous*, tensorflow::DistributedFunctionLibraryRuntime*, tensorflow::CollectiveExecutorMgrInterface*, bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#3 0x00007f2cf15ae71e in TFE_NewContext () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#4 0x00007f2ce18c73b5 in pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module
&)::{lambda(TFE_ContextOptions const*)#8}, pybind11::object, TFE_ContextOptions const*, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::return_value_policy>(pybind11_init__pywrap_tfe(pybind11::module
&)::{lambda(TFE_ContextOptions const*)#8}&&, pybind11::object ()(TFE_ContextOptions const), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::return_value_policy const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so
#5 0x00007f2ce18c4e96 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tfe.so

Call stack No. 2

#0 0x00007f2ef0b2cfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2cf7f11834 in tensorflow::MaybeCreateNcclCommunicator() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f2cf7efe26a in tensorflow::EagerContext::EagerContext(tensorflow::SessionOptions const&, tensorflow::ContextDevicePlacementPolicy, bool, tensorflow::DeviceMgr*, bool, tensorflow::Rendezvous*, tensorflow::DistributedFunctionLibraryRuntime*, tensorflow::CollectiveExecutorMgrInterface*, bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f2cf15ae71e in TFE_NewContext () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f2cf0541fa0 in mlir::TF::ConstantFoldFallbackHook(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&)::{lambda()#2}::operator()() const [clone .isra.161] () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f2cf15a755f in mlir::TF::ConstantFoldFallbackHook(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&)
() from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f2cf8ac3595 in mlir::TF::(anonymous namespace)::TFConstantFoldInterface::fold(mlir::Operation*, llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&) const () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f2cfd2fe25a in mlir::Operation::fold(llvm::ArrayRefmlir::Attribute, llvm::SmallVectorImplmlir::OpFoldResult&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f2cfa0e87b0 in mlir::OperationFolder::tryToFold(mlir::OpBuilder&, mlir::Operation*, llvm::SmallVectorImplmlir::Value&, llvm::function_ref<void (mlir::Operation*)>) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f2cfa0e9a04 in mlir::OperationFolder::tryToFold(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, llvm::function_ref<void (mlir::Operation*)>, bool*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f2cfa0ebc41 in mlir::applyPatternsAndFoldGreedily(llvm::MutableArrayRefmlir::Region, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f2cfa457ebb in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f2cfa458212 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass >, mlir::Pass> >, mlir::Operation, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f2cfa456b71 in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f2cfa45808f in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f2cfa458212 in mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass >, mlir::Pass> >, mlir::Operation, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f2cfa4597fa in mlir::PassManager::runPasses(mlir::Operation*, mlir::AnalysisManager) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007f2cfa45a1a5 in mlir::PassManager::run(mlir::Operation*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#18 0x00007f2cf858c55d in tensorflow::CompileGraphSetup(mlir::ModuleOp, llvm::ArrayReftensorflow::XlaArgument, std::vector<int, std::allocator >, llvm::SmallVector<tensorflow::TensorOrResourceShape, 4u>&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#19 0x00007f2cf858ce1f in tensorflow::BuildHloFromModule(mlir::ModuleOp, xla::XlaBuilder&, llvm::ArrayRefxla::XlaOp, std::vector<xla::XlaOp, std::allocatorxla::XlaOp >&, llvm::ArrayReftensorflow::XlaArgument, llvm::StringRef, llvm::MutableArrayRef<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass > >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#20 0x00007f2cf8591b3b in tensorflow::BuildHloFromGraph(tensorflow::Graph const&, xla::XlaBuilder&, llvm::ArrayRefxla::XlaOp, std::vector<xla::XlaOp, std::--Type for more, q to quit, c to continue without paging--
allocatorxla::XlaOp >&, llvm::ArrayReftensorflow::XlaArgument, llvm::ArrayRef<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, llvm::StringRef, tensorflow::FunctionLibraryDefinition const&, tensorflow::GraphDebugInfo const&, llvm::MutableArrayRef<std::unique_ptr<mlir::Pass, std::default_deletemlir::Pass > >) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007f2cf85434fd in tensorflow::MlirXlaOpKernel::ConstructXlaOp(tensorflow::XlaOpKernelContext
) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#22 0x00007f2cf85447f9 in tensorflow::MlirXlaOpKernel::Compile(tensorflow::XlaOpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#23 0x00007f2cf8577d66 in tensorflow::XlaOpKernel::Compute(tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#24 0x00007f2cf8e60137 in tensorflow::XlaCompilationDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#25 0x00007f2cf85620f9 in tensorflow::GraphCompiler::Compile() ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#26 0x00007f2cf8574d44 in tensorflow::XlaCompiler::CompileGraph(tensorflow::XlaCompiler::CompileOptions const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<tensorflow::Graph, std::default_deletetensorflow::Graph >, absl::lts_20210324::Span<tensorflow::XlaArgument const>, tensorflow::XlaCompilationResult*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#27 0x00007f2cf857711a in tensorflow::XlaCompiler::CompileFunction(tensorflow::XlaCompiler::CompileOptions const&, tensorflow::NameAttrList const&, absl::lts_20210324::Span<tensorflow::XlaArgument const>, tensorflow::XlaCompilationResult*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#28 0x00007f2cf8544cd0 in std::_Function_handler<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*), tensorflow::XlaCompilationCache::Compile(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompiler::CompileOptions const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**)::{lambda(tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)#1}>::_M_invoke(std::_Any_data const&, tensorflow::XlaCompiler*&&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*&&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#29 0x00007f2cf854bb8f in tensorflow::XlaCompilationCache::CompileStrict(tensorflow::XlaCompilationCache::Entry*, tensorflow::XlaCompiler::Options const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)> const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#30 0x00007f2cf854d9dc in tensorflow::XlaCompilationCache::CompileImpl(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, std::function<tensorflow::Status (tensorflow::XlaCompiler*, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompilationResult*)> const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#31 0x00007f2cf854e836 in tensorflow::XlaCompilationCache::Compile(tensorflow::XlaCompiler::Options const&, tensorflow::NameAttrList const&, std::vector<tensorflow::XlaArgument, std::allocatortensorflow::XlaArgument > const&, tensorflow::XlaCompiler::CompileOptions const&, tensorflow::XlaCompilationCache::CompileMode, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#32 0x00007f2cf1df6da9 in tensorflow::CompileToLocalExecutable(tensorflow::OpKernelContext*, tensorflow::NameAttrList const&, bool, tensorflow::XlaPlatformInfo const&, absl::lts_20210324::Span<tensorflow::Tensor const* const>, absl::lts_20210324::Span<tensorflow::VariableInfo const>, absl::lts_20210324::Span, tensorflow::XlaCompilationCache::CompileMode, bool, xla::LocalClient**, tensorflow::XlaCompilationResult const**, xla::LocalExecutable**) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
--Type for more, q to quit, c to continue without paging--
#33 0x00007f2cf1df981c in tensorflow::XlaLocalLaunchBase::Compute(tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#34 0x00007f2ce40495f9 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#35 0x00007f2ce414242f in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::Process(tensorflow::PropagatorState::TaggedNode, long) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#36 0x00007f2ce4143718 in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::RunTask<tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::lts_20210324::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#2}>(tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::lts_20210324::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#2}&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#37 0x00007f2cf15f2181 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#38 0x00007f2cf15ef873 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#39 0x00007f2ce45dc507 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#40 0x00007f2ef08d66db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#41 0x00007f2ef0c0f71f in clone () from /lib/x86_64-linux-gnu/libc.so.6

@jpienaar
Copy link
Member

jpienaar commented Jul 2, 2021

Pending change here for review to avoid the init where there are 0 GPU devices configured, just waiting for reviews to complete.

@jpienaar
Copy link
Member

jpienaar commented Jul 2, 2021

fc158eb should avoid creating these now where we have 0 GPUs set.

@gbaned
Copy link
Contributor

gbaned commented Jul 8, 2021

@rsanthanam-amd Can you please check @jpienaar's comments and keep us posted ? Thanks!

@gbaned gbaned added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Jul 8, 2021
@rsanthanam-amd
Copy link
Contributor Author

rsanthanam-amd commented Jul 8, 2021

@jpienaar @gbaned I verified that jpienaar's fix does indeed address the issue and I am able to run my model without the need for the PR.

This PR can be canceled.

@rsanthanam-amd
Copy link
Contributor Author

@cheshire Since @jpienaar fixed the underlying issue, can I cancel and close this PR?

@cheshire cheshire closed this Jul 14, 2021
PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Jul 14, 2021
@rsanthanam-amd rsanthanam-amd deleted the upstream_rocm_multiple_nccl_manager_fix branch July 21, 2021 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:gpu GPU related issues size:S CL Change Size: Small stat:awaiting response Status - Awaiting response from author
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

6 participants