-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Description
Hello tensorflow team,
I have been starting to use your tensorflow debugger but have run into the issue that when I try and use it on a multi-gpu model I get ValueError: Duplicate node name: 'n/_0'.
Inspecting things closer, I saw that the issue originated from the run_metadata, whose partition graphs have many _Send and _HostRecv ops with names like 'n/_0'. These ops are replicated with identical names across my towers which is what is causing the issue.
Looking through the tensorflow code, I believe I tracked where this name is set down to graph_partition.cc:195 where the edge's source name is used as the prefix 'n'. Unfortunately, I have not been able to figure out why the source's name is only 'n', but that seems to be the root of the issue here.
I should add that I never set any tensor name to 'n' anywhere in my own code. Plus, I see certain tests in your codebase rely on names such as 'n/_0' which indicates to me the name is being set somewhere internally in the tensorflow code.
Any help you can provide would be much appreciated!
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
I didn't find any related issues.
Environment info
Operating System: Ubuntu 14.04.5 LTS (running in a singularity container on a CentOS 6.7 host).
Installed version of CUDA and cuDNN:
I am using CUDA 8.0 with NVIDIA driver 367.48, and cuDNN v5.1 .
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
libOpenCL.so
libOpenCL.so.1
libOpenCL.so.1.0
libOpenCL.so.1.0.0
libcublas.so
libcublas.so.8.0
libcublas.so.8.0.45
libcublas_device.a
libcublas_static.a
libcudadevrt.a
libcudart.so
libcudart.so.8.0
libcudart.so.8.0.44
libcudart_static.a
libcudnn.so
libcudnn.so.5
libcudnn.so.5.1.5
libcudnn_static.a
libcufft.so
libcufft.so.8.0
libcufft.so.8.0.44
libcufft_static.a
libcufftw.so
libcufftw.so.8.0
libcufftw.so.8.0.44
libcufftw_static.a
libcuinj64.so
libcuinj64.so.8.0
libcuinj64.so.8.0.44
libculibos.a
libcurand.so
libcurand.so.8.0
libcurand.so.8.0.44
libcurand_static.a
libcusolver.so
libcusolver.so.8.0
libcusolver.so.8.0.44
libcusolver_static.a
libcusparse.so
libcusparse.so.8.0
libcusparse.so.8.0.44
libcusparse_static.a
libnppc.so
libnppc.so.8.0
libnppc.so.8.0.44
libnppc_static.a
libnppi.so
libnppi.so.8.0
libnppi.so.8.0.44
libnppi_static.a
libnppial.so
libnppial.so.8.0
libnppial.so.8.0.44
libnppicc.so
libnppicc.so.8.0
libnppicc.so.8.0.44
libnppicom.so
libnppicom.so.8.0
libnppicom.so.8.0.44
libnppidei.so
libnppidei.so.8.0
libnppidei.so.8.0.44
libnppif.so
libnppif.so.8.0
libnppif.so.8.0.44
libnppig.so
libnppig.so.8.0
libnppig.so.8.0.44
libnppim.so
libnppim.so.8.0
libnppim.so.8.0.44
libnppist.so
libnppist.so.8.0
libnppist.so.8.0.44
libnppisu.so
libnppisu.so.8.0
libnppisu.so.8.0.44
libnppitc.so
libnppitc.so.8.0
libnppitc.so.8.0.44
libnpps.so
libnpps.so.8.0
libnpps.so.8.0.44
libnpps_static.a
libnvToolsExt.so
libnvToolsExt.so.1
libnvToolsExt.so.1.0.0
libnvblas.so
libnvblas.so.8.0
libnvblas.so.8.0.44
libnvgraph.so
libnvgraph.so.8.0
libnvgraph.so.8.0.44
libnvgraph_static.a
libnvrtc-builtins.so
libnvrtc-builtins.so.8.0
libnvrtc-builtins.so.8.0.44
libnvrtc.so
libnvrtc.so.8.0
libnvrtc.so.8.0.44
stubs
- A link to the pip package you installed:
I installed tensorflow usingpip install tensorflow-gpu==0.12.1 - The output from
python -c "import tensorflow; print(tensorflow.__version__)".
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
0.12.1
What other attempted solutions have you tried?
The single GPU case works fine.
Logs or other output that would be helpful
Here is the dump of some of the problematic nodes.
node {
name: "n/_0"
op: "_Send"
input: "__copy_TOWER0/Const_0"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
attr {
key: "client_terminated"
value {
b: false
}
}
attr {
key: "recv_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device_incarnation"
value {
i: 0
}
}
attr {
key: "tensor_name"
value {
s: "edge_545___copy_TOWER0/Const_0"
}
}
}
node {
name: "n/_1"
op: "_HostRecv"
input: "^n/_0"
attr {
key: "client_terminated"
value {
b: false
}
}
attr {
key: "recv_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device_incarnation"
value {
i: 0
}
}
attr {
key: "tensor_name"
value {
s: "edge_545___copy_TOWER0/Const_0"
}
}
attr {
key: "tensor_type"
value {
type: DT_FLOAT
}
}
}
node {
name: "n/_2"
op: "_Send"
input: "__copy_TOWER0/Sub_0"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
attr {
key: "client_terminated"
value {
b: false
}
}
attr {
key: "recv_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device"
value {
s: "/job:localhost/replica:0/task:0/gpu:0"
}
}
attr {
key: "send_device_incarnation"
value {
i: 0
}
}
attr {
key: "tensor_name"
value {
s: "edge_551___copy_TOWER0/Sub_0"
}
}
}
End of backtrace at crash point
/home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py(419)run()
-> run_end_resp = self.on_run_end(run_end_req)
/home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/local_cli_wrapper.py(262)on_run_end()
-> self._dump_root, partition_graphs=partition_graphs)
/home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/debug_data.py(407)__init__()
-> self._load_partition_graphs(partition_graphs)
> /home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/debug_data.py(493)_load_partition_graphs()
-> raise ValueError("Duplicate node name: '%s'" % node.name)