New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protobufs are into multiple shared libraries loaded from python #8394
Comments
Thank you for reporting! We saw something similar a couple of weeks ago. Currently we don't have safeguards for this. @keveman @martinwicke @jhseu @gunan |
@allenlavoie fyi @gunan Perhaps we should run non-mac clang builds in our CI? |
Running clang builds is in the plans, but there are still a few higher priority items on the list. |
* OpenCL Improvements * Registers Scatter and ScatterNd Ops for SYCL * Registers Stack op for SYCL * Fixes No sycl buffer found error for debug ops * Registers MatMul and Transpose Ops to SYCL device for double * Extends analyzer_cli_test.py test to cover SYCL * Fixes Transpose Op for double when on SYCL * Bumps Eigen version to fix double precision issue on SYCL * Extends SessionDebugTestBase to cover SYCL * Register SYCL implementations for random ops * Avoid functions that might not be defined on SYCL device (#51) * Avoid functions that might not be defined on SYCL device * Simplify by using Eigen math functions * OpenCL improvements - Bumps Eigen Version - Refactors Ops registration - Introduces workaround for Const Op related to the difference between CUDA which uses pointers and OpenCL that uses buffers/accessors - Extends memory types to cover DEVICE_SYCL as well - Introduces GetSYCLDevice() method that returns list of supported devices with GPU device having the highest priority ( doesn't include blacklisted devices ) - ::internal::Transpose -> tensorflow::internal::Transpose in order to avoid compilation reported error - re-introduces fix for bugged string replacement causing a lot of compilation warnings -c -> --include - Adds sycl_runtime to bazels ARRAY_DEPS - Replicates TF_CALL_GPU_PROXY_TYPES for SYCL * [OpenCL] Fixes an issue caused by switch to aligned allocator for sycl buffer (#53) * [Build] Use gcc/g++ as a host compiler to avoid #8394 (#54) * [OpenCL] Fixes Scatter Op * Fix testSimple and testConst in stack_op_test (#3) * Fix testSimple and testConst in stack_op_test * Create a specialisation of DoParallelConcatUpdate for SyclDevice and register it * Guard all code in TENSORFLOW_USE_SYCL * Do not use sycl device for int32 * Registration of the Sycl version is now looking like the one for the GPU * Remove added empty line * Register batch normalization kernels for OpenCL (#61) * [OpenCL] RandomGamma has no GPU friendly implementation (#57) * [OpenCL] Compatibility fixes for TensorFlow 1.1.0-rc1 * [OpenCL] Implements BatchMatmul Op for SYCL * Lowercase the device name when GPU or SYCL returned * [OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device * [Eigen] Version bump * GPU device name string manipulation is not needed anymore * [OpenCL] Adds SYCL to device backwards compatibility * [OpenCL] Extends core_rnn_test.py to run for SYCL device * [OpenCL] Minor optimizations for build script * [OpenCL] Enables skip folder list in build script * [OpenCL] Fixes ApplyAdamOp for Sycl device * [OpenCL] SYCL device improvements * [OpenCL] Fixes debug_ops's SEGFAULT for SYCL device * [Build] Adds hexagon to skipped folders list * [OpenCL] Removes EnterLameDuckMode from SYCL device and allocator * [OpenCL] Registers Unique Op for SYCL device * [OpenCL][Temporary] Disables tests for SYCL target due to features not being implemented yet Tests affected: - tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py - tensorflow/python/kernel_tests/conv_ops_test.py - tensorflow/python/kernel_tests/depthwise_conv_op_test.py - tensorflow/python/kernel_tests/pooling_ops_3d_test.py - tensorflow/python/kernel_tests/pooling_ops_test.py - tensorflow/python/kernel_tests/scatter_nd_ops_test.py - tensorflow/python/training/adam_test.py - tensorflow/python/training/localhost_cluster_performance_test.py - tensorflow/python/training/training_ops_test.py * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Tests affected: - tensorflow/python/debug/cli/analyzer_cli_test.py - tensorflow/python/debug/lib/session_debug_testlib.py - tensorflow/python/debug/lib/stepper_test.py - tensorflow/python/kernel_tests/unstack_op_test.py - tensorflow/python/ops/image_ops_test.py * [OpenCL] Take options.config.device_count() into consideration * [OpenCL] Fixes compilation warning * [OpenCL] device:SYCL:0 -> sycl:0 * [OpenCL] Removes unwanted flags in building script Removes flags given to computecpp that enable SIMD instructions Removes duplicate flags * bool -> const bool * [OpenCL] sycl in test_util.gpu_device_name() -> is_sycl_enabled() * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py * Imports test_util from tensorflow.python.framework * [OpenCL] Fixes formatting in Python code * [OpenCL] Extends session_test.py to cover SYCL device * [OpenCL] Cleans singleton class * [OpenCL] Keeping CUDA happy * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py - tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_ops_test.py * Added support for building with SYCL on ARM. * Acts on the review feedback from: - #9117 (comment) - #9117 (comment) * [OpenCL] Fixes scatter_nd_op_test * Fixes auto-merge mistake * [OpenCL] struct SyclDevice -> class SyclDevice * Revert "[OpenCL] struct SyclDevice -> class SyclDevice" This reverts commit addd433. * [OpenCL] Reverting refactoring commit. As requested in the review #9117 (comment) This change set will be re-introduced in smaller chunks. * Revert "[OpenCL] device:SYCL:0 -> sycl:0" This reverts commit cf16e60. * Revert "[OpenCL] Adds SYCL to device backwards compatibility" This reverts commit b8401b5. * Acts on the feedback from #9117 (comment) * control_flow_ops_py_test.py expects device name to be lower cased * Acts on the feedback from #9117 (comment) * Removes debug print * Removes not needed partial specialisation * [OpenCL] Registers ScatterNdFunctor for SYCL device * [OpenCL] Make it compile * [OpenCL] Follow gpu_device changes * [OpenCL] Adds cxx_builtin_include_directory for python lib Fixes bazels missing undeclared inclusions that appeared after merge with TensorFlow upstream * [OpenCL] Fixes Constant Op * [OpenCL] gXX-4.8 -> gXX * [OpenCL] Removes -D_GLIBCXX_USE_CXX11_ABI=0 as it breaks default compiler setup for Ubuntu 16.04 * Revert "[OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device" This reverts commit 06c50c0. * [OpenCL] CPU allocator is a singleton we should not delete it
Clang build is now running nightly. Closing this issue. |
@gunan, the issue is still there. There a patch applied to workaround it, but I'm just saying that you may want to keep the issue open until a proper solution is possible with bazel. |
We have some longer term projects which may triage this on the side. |
You are right, it hard to fix without changes to bazel. |
@ilya-biryukov: Could you kindly recheck if the patch is still necessary? I've not managed to reproduce the problem with Clang 3.9. |
@tkoeppe , sorry for the late response. AFAIK there were changes to tensorflow's build files that should have fixed that problem. It should be fine now, I'll double-check that it's the case and will report if there are any issues left. |
@ilya-biryukov: Thank you very much! I've already removed the patch internally; I'm not sure if this has landed yet. |
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
* OpenCL Improvements * Registers Scatter and ScatterNd Ops for SYCL * Registers Stack op for SYCL * Fixes No sycl buffer found error for debug ops * Registers MatMul and Transpose Ops to SYCL device for double * Extends analyzer_cli_test.py test to cover SYCL * Fixes Transpose Op for double when on SYCL * Bumps Eigen version to fix double precision issue on SYCL * Extends SessionDebugTestBase to cover SYCL * Register SYCL implementations for random ops * Avoid functions that might not be defined on SYCL device (#51) * Avoid functions that might not be defined on SYCL device * Simplify by using Eigen math functions * OpenCL improvements - Bumps Eigen Version - Refactors Ops registration - Introduces workaround for Const Op related to the difference between CUDA which uses pointers and OpenCL that uses buffers/accessors - Extends memory types to cover DEVICE_SYCL as well - Introduces GetSYCLDevice() method that returns list of supported devices with GPU device having the highest priority ( doesn't include blacklisted devices ) - ::internal::Transpose -> tensorflow::internal::Transpose in order to avoid compilation reported error - re-introduces fix for bugged string replacement causing a lot of compilation warnings -c -> --include - Adds sycl_runtime to bazels ARRAY_DEPS - Replicates TF_CALL_GPU_PROXY_TYPES for SYCL * [OpenCL] Fixes an issue caused by switch to aligned allocator for sycl buffer (#53) * [Build] Use gcc/g++ as a host compiler to avoid tensorflow/tensorflow#8394 (#54) * [OpenCL] Fixes Scatter Op * Fix testSimple and testConst in stack_op_test (#3) * Fix testSimple and testConst in stack_op_test * Create a specialisation of DoParallelConcatUpdate for SyclDevice and register it * Guard all code in TENSORFLOW_USE_SYCL * Do not use sycl device for int32 * Registration of the Sycl version is now looking like the one for the GPU * Remove added empty line * Register batch normalization kernels for OpenCL (#61) * [OpenCL] RandomGamma has no GPU friendly implementation (#57) * [OpenCL] Compatibility fixes for TensorFlow 1.1.0-rc1 * [OpenCL] Implements BatchMatmul Op for SYCL * Lowercase the device name when GPU or SYCL returned * [OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device * [Eigen] Version bump * GPU device name string manipulation is not needed anymore * [OpenCL] Adds SYCL to device backwards compatibility * [OpenCL] Extends core_rnn_test.py to run for SYCL device * [OpenCL] Minor optimizations for build script * [OpenCL] Enables skip folder list in build script * [OpenCL] Fixes ApplyAdamOp for Sycl device * [OpenCL] SYCL device improvements * [OpenCL] Fixes debug_ops's SEGFAULT for SYCL device * [Build] Adds hexagon to skipped folders list * [OpenCL] Removes EnterLameDuckMode from SYCL device and allocator * [OpenCL] Registers Unique Op for SYCL device * [OpenCL][Temporary] Disables tests for SYCL target due to features not being implemented yet Tests affected: - tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py - tensorflow/python/kernel_tests/conv_ops_test.py - tensorflow/python/kernel_tests/depthwise_conv_op_test.py - tensorflow/python/kernel_tests/pooling_ops_3d_test.py - tensorflow/python/kernel_tests/pooling_ops_test.py - tensorflow/python/kernel_tests/scatter_nd_ops_test.py - tensorflow/python/training/adam_test.py - tensorflow/python/training/localhost_cluster_performance_test.py - tensorflow/python/training/training_ops_test.py * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Tests affected: - tensorflow/python/debug/cli/analyzer_cli_test.py - tensorflow/python/debug/lib/session_debug_testlib.py - tensorflow/python/debug/lib/stepper_test.py - tensorflow/python/kernel_tests/unstack_op_test.py - tensorflow/python/ops/image_ops_test.py * [OpenCL] Take options.config.device_count() into consideration * [OpenCL] Fixes compilation warning * [OpenCL] device:SYCL:0 -> sycl:0 * [OpenCL] Removes unwanted flags in building script Removes flags given to computecpp that enable SIMD instructions Removes duplicate flags * bool -> const bool * [OpenCL] sycl in test_util.gpu_device_name() -> is_sycl_enabled() * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py * Imports test_util from tensorflow.python.framework * [OpenCL] Fixes formatting in Python code * [OpenCL] Extends session_test.py to cover SYCL device * [OpenCL] Cleans singleton class * [OpenCL] Keeping CUDA happy * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py - tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_ops_test.py * Added support for building with SYCL on ARM. * Acts on the review feedback from: - tensorflow/tensorflow#9117 (comment) - tensorflow/tensorflow#9117 (comment) * [OpenCL] Fixes scatter_nd_op_test * Fixes auto-merge mistake * [OpenCL] struct SyclDevice -> class SyclDevice * Revert "[OpenCL] struct SyclDevice -> class SyclDevice" This reverts commit addd43348c374a5379f67bb1e5ad084715722fc2. * [OpenCL] Reverting refactoring commit. As requested in the review tensorflow/tensorflow#9117 (comment) This change set will be re-introduced in smaller chunks. * Revert "[OpenCL] device:SYCL:0 -> sycl:0" This reverts commit cf16e60340b62d16c3764d71b716fe03d35f87a9. * Revert "[OpenCL] Adds SYCL to device backwards compatibility" This reverts commit b8401b5164199b7a169be1c1d8dea5001195c390. * Acts on the feedback from tensorflow/tensorflow#9117 (comment) * control_flow_ops_py_test.py expects device name to be lower cased * Acts on the feedback from tensorflow/tensorflow#9117 (comment) * Removes debug print * Removes not needed partial specialisation * [OpenCL] Registers ScatterNdFunctor for SYCL device * [OpenCL] Make it compile * [OpenCL] Follow gpu_device changes * [OpenCL] Adds cxx_builtin_include_directory for python lib Fixes bazels missing undeclared inclusions that appeared after merge with TensorFlow upstream * [OpenCL] Fixes Constant Op * [OpenCL] gXX-4.8 -> gXX * [OpenCL] Removes -D_GLIBCXX_USE_CXX11_ABI=0 as it breaks default compiler setup for Ubuntu 16.04 * Revert "[OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device" This reverts commit 06c50c0a485f40c30a436f02c3fa7794e370c49d. * [OpenCL] CPU allocator is a singleton we should not delete it
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message). Co-authored-by: Roy Williams <roy.williams.iii@gmail.com>
Issue description
Tensorflow currently fails with the following error if compiled using
clang
in-c opt
mode when trying to importtensorflow.contrib
package in Python .Python code reproducing the problem is very simple:
Program output:
The short story is that protobufs are getting statically linked into two shared libraries, both of which get loaded at runtime and that causes the error.
Here's the full breakdown of what happens:
//tensorflow/core:protos_all_cc
) get compiled as a static library.//tensorflow/core:protos_all_cc
) get statically linked into two separate shared libraries:_pywrap_tensorflow_internal.so
and_pywrap_tensorflow_print_model_analysis_lib.so
.AddDescriptors
) insideexample.pb.cc
(look for it inbazel-genfiles
) to the global initialization code of both shared libraries.python run.py
starts running. While processing python's import statement, dynamic linker gets called to load_pywrap_tensorflow_internal.so
. Static initialization code insideexample.pb.cc
is run, registering it to the protobuf database of_pywrap_tensorflow_internal.so
_pywrap_tensorflow_print_model_analysis_lib.so
gets loaded. Since python callsdlopen
withRTLD_GLOBAL
dynamic linker finds an existing symbols forAddDescriptorsImpl
in_pywrap_tensorflow_internal.so
and uses that for all calls to that function later(for calls coming from_pywrap_tensorflow_print_model_analysis_lib.so
too).example.pb.cc
is run again (for_pywrap_tensorflow_print_model_analysis_lib.so
), it callsAddDescriptorsImpl
and gets into the function from_pywrap_tensorflow_internal.so
, which tries to registers the same file again in the protobuf database of_pywrap_tensorflow_internal.so
leading to the specified error.Here are a few observations that may be interesting:
gcc
, becausegcc
doesn't inlineAddDescriptors
to the global initialization code of libraries, then dynamic linker merges those two functions into one, and that function has a proper check for being called multiple times(AddDescriptorsImpl
, which is getting called after inlining doesn't). But note that it may break too ifgcc
will start inliningAddDescriptors
in a newer version.Environment info
Operating System: ubuntu 14.04
Installed version of CUDA and cuDNN: none
git rev-parse HEAD
)ff9682b5f493ae7ad912da29789668dbf50d5e1f
bazel version
Build label: 0.4.4 Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar Build time: Wed Feb 1 18:54:21 2017 (1485975261) Build timestamp: 1485975261 Build timestamp as int: 1485975261
Repro
clang
is installed. My version is3.8.0-2ubuntu3~trusty4
, but that shouldn't matter.The text was updated successfully, but these errors were encountered: