New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows C++ tensorflow_cc.dll has overlapping memory address between string gpu options for "allocator type" and "visible device list" #39439
Comments
So clearly it is a compilation and linking problem these attributes are part of the same protobuf message: https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/core/protobuf/config.proto So the symbol they address will have the same name. Which is ?fixed_address_empty_string@internal@protobuf@google@@3v?$ExplicitlyConstructed@V?$basic_string@DU?$char_traits@D@std@@v?$allocator@D@2@@std@@@123@A How is it possible for the compilation process to address different a memory address for the same protobuf symbol from the message in the .cc file above. |
Mentioning similar issue: |
Where is the object file for this config.proto file mentioned above I could find
I could try linking against that rather than exposing the symbol from But I have yet to do a symbol dump from that file to see if
|
Here is another sign of hope: |
Mentioning @ttdd11 @Steroes @ZhuoranLyu @brantl @sitting-duck who have been near this issue before. |
OK here it is From my continous integration test
Here is
Now I will see if I can get it to work somehow so those two things are not on top of each other. |
Can anyone describe to me how https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/core/protobuf/config.proto Becomes those two I do not understand the protobuf and bazel process particularly well. |
Here is the contents of found in in the attached .zip files |
Here are the |
The same code on a static version of tensorflow with the same code under Linux does not share the same address. |
@sanjoy if you need any additional information everything I do is triggered by repeatable scripts in a CI environment this is not a roulette process but a repeatable process. |
@gunan Do you have anyone working on this? |
Does the issue occur on a newer version? |
I will have an attempt, but r1.12+ requires hardware instructions which are more modern than some legacy hardware which I wanted to support I think AVX, AVX2 and SSE4 I will see if I can do the build on a cloud box I think others have reported on 1.15 I will see if I can find a reference to that first before spinning up a whole new build platform. Sam |
I think I need static linking a little more "out of the box" my hacked solution without linking to protocol buffers ( leaves me with the following missing symbols Without betting property on it these all come from
here are a list of my libs
based on
this is coming from the information in
|
It all becomes a can of worms using a The character limit comes into play over and over, with a 65K character limit you get stuffed at every turn. The 16bit history of the Windows Operating System has a lot to answer for. |
Maybe if I just add the |
Simply adding
The the objects being linked you go from 34 missing symbols to 164 missing symbols. I really need a expert's guide about how to run the following C++ code in Windows 10
Down to the operating system, compiler, bazel version, CUDA version, etc. The issue is not isolated to We do not have a solution for static linking, but that is a requirement for the above code to run correctly. My hacked solution doesn't work beyond the simple driver version, under It was based around adding the inputs to the Thereby using the inputs for I am unsure what causes this behaviour. |
As an example if the core/platform/windows/port.cc file has the symbols for port::InitMain how can I find which artifact has the symbols for the compiled version of that code? This is one of the missing symbols that was in the list about, when dynamically linking. So if the inputs to the .dll are there when statically linking. Why is this symbol missing? Maybe if I can work this out I can apply similar logic to the remaining 33 missing symbols Sam |
OK I followed this up with a recursive grep through the So the symbols should be there twice! One curiosity I have is the see:
|
libtensorflow_cc.so-2.params.zip So the symbols should be there twice! One curiosity I have is the see:
But if we are in the situation where including This is before I started hacking. see file attached, |
We have established two things Including of the same symbols twice when linking can cause unexpected behavior Tensorflow includes the same port.o file with the same symbols at least twice Can we make sure the input into a bazel target have a directed acyclic graph representation? It seems my approach of getting the union set of all of the object files under Linux and MacOS achieves this outcome But we will be faced with a .a file that is larger than 2Gb to say nothing of feeding the arguments via lists smaller than 65K characters How do we proceed? Sam |
Including the same symbol twice results in ODR (one definition rule) and that is UB / prone to errors. We have several ODR violations in TF, need to identify and fix them. Thanks for finding out that |
I guess the next conclusion would be to find the size of the Union of all the required object files by recursive search of .params files then make a linking strategy with a complexity less than n factorial which was my last attempt failing after a six hours time out Then see if the final linkage is under 2GB I am guessing that is a testable binary outcome the file will have a size greater or less than 2Gb Sam |
Sorry for maybe a stupid question, but what workaround is the "final" one? However, this leads to a crash: I thought that this error might have something to do with what is described in this thread, since for me the pointers for Any way to navigate around this is appreciated! |
I wish you well. The solution is blocked by two issues. The DLL cannot be dynamically linked because the symbols from lib protobuf will overlap with the symbols from Tensorflow Statically linking leads to a object file symbol archive which is larger 2Gb So while there is this I think this may take several hours to accumulate all of symbols in the obj file to make a static archive. |
@kognat-docs, |
It is unclear to be how to apply the large archive option to the bazel build. If one of the bazel Windows team could point it out that would be great Sam |
How do I create a static archive in Windows 10 greater than 2Gb using bazel? |
1 similar comment
How do I create a static archive in Windows 10 greater than 2Gb using bazel? |
If there were clear instructions on how to create a static archive in Windows 10 greater than 2Gb using bazel? Then this long overdue issue can be closed. |
I think we should ask this on Bazel repo? |
Sounds like a good idea! |
Hi There, We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help. This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information. |
So the underlying surprising feature in that protobuf doesn't make separate allocations for each empty std::string in its structures. Instead it tries to have a fixed empty string The bug is then that (for unknown-to-me build reasons) either there are multiple There's an unsafe workaround -- the |
I am grateful for your insight to this issue it seems that the creation of multiple definitions and thereby multiple symbols in the shared library are the cause of the incorrect behaviour. When it is possible to create a static library such as on Linux and OSX we are left with a single definition during the linking of the static archive. So the unsafe work around seems to be less work than fixing all of the multiple definition problems in such a large code base. but arena needs to error to prevent multiple definitions from squatting on the same memory address without ownership sam |
https://developers.google.com/protocol-buffers/docs/reference/arenas Leaving for reference |
1 similar comment
https://developers.google.com/protocol-buffers/docs/reference/arenas Leaving for reference |
Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
NA
Describe the current behavior
I am creating as session as follows adapted from original code
which results in
Describe the expected behavior
Session is created and runs on GPU 0 only using only 80% of available memory
Standalone code to reproduce the issue
Other info / logs
Please see the following issues
#16291
fo40225/tensorflow-windows-wheel#39
I have built my tensorflow.dll as follows:
$ENV:USE_BAZEL_VERSION="0.19.2"
$ENV:PYTHON_BIN_PATH=C:\ProgramData\Anaconda3\python.exe
$ENV:Path += ";C:\msys64\usr\bin"
$ENV:Path += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\bin"
$ENV:Path += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\extras\CUPTI\libx64"
$ENV:Path += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\cudnn-9.2-windows10-x64-v7.5.0.56\cuda\bin"
$ENV:BAZEL_SH = "C:\msys64\usr\bin\bash.exe"
$ENV:CUDA_TOOLKIT_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2"
$ENV:TF_CUDA_VERSION="9.2"
$ENV:CUDNN_INSTALL_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\cudnn-9.2-windows10-x64-v7.5.0.56\cuda"
$ENV:TF_CUDNN_VERSION="7"
$ENV:TF_NCCL_VERSION="1"
$ENV:TF_CUDA_COMPUTE_CAPABILITIES="3.5,3.7,5.0,5.2,6.0,6.1"
$ENV:TF_CUDA_CLANG="0"
$ENV:TF_NEED_CUDA="1"
$ENV:TF_NEED_ROCM="0"
$ENV:TF_NEED_OPENCL_SYCL="0"
$params = "configure.py",""
Remove-Item -Recurse -Force "C:\Windows\system32\config\systemprofile_bazel_SYSTEM\install\75b09cf1ac98c0ffb0534079b30efcc4"
cmd /c "ECHO Y" | & python.exe @params
bazel.exe clean --expunge
bazel.exe build --copt=-nvcc_options=disable-warnings --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac,-no_mac --announce_rc --test_timeout 300,450,1200,3600 --test_size_filters=small,medium --jobs=12 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so
edits have been made to the following files:
within
tensorflow/BUILD
becomes
and within
tf_cc_shared_object
the function oftensorflow/BUILD
becomes
The contents of
tf_exported_symbols_msvc.lds
areAs documented by
#22047 (comment)
My software is linked against
libprotobuf.lib
from https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gzbuilt as
I also tried editing
tensorflow\tf_version_script.lds
to includeI also tried the
TF_EXPORT
macro from#include "tensorflow/core/platform/macros.h"
in
tensorflow/core/public/session_options.h
and
tensorflow/core/common_runtime/session_options.cc
as suggested by
https://github.com/sitting-duck/stuff/tree/master/ai/tensorflow/build_tensorflow_1.14_source_for_Windows
Do you have any suggestions about how to make sure that
the GPU options for allocator type and visible device list do not share the same memory but we still have a monolithic DLL under windows?
The text was updated successfully, but these errors were encountered: