Skip to content

Bug - tfdbg + multi-gpu gives ValueError: Duplicate node name: 'n/_0' #7051

@raphtown

Description

@raphtown

Hello tensorflow team,

I have been starting to use your tensorflow debugger but have run into the issue that when I try and use it on a multi-gpu model I get ValueError: Duplicate node name: 'n/_0'.

Inspecting things closer, I saw that the issue originated from the run_metadata, whose partition graphs have many _Send and _HostRecv ops with names like 'n/_0'. These ops are replicated with identical names across my towers which is what is causing the issue.

Looking through the tensorflow code, I believe I tracked where this name is set down to graph_partition.cc:195 where the edge's source name is used as the prefix 'n'. Unfortunately, I have not been able to figure out why the source's name is only 'n', but that seems to be the root of the issue here.

I should add that I never set any tensor name to 'n' anywhere in my own code. Plus, I see certain tests in your codebase rely on names such as 'n/_0' which indicates to me the name is being set somewhere internally in the tensorflow code.

Any help you can provide would be much appreciated!

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

I didn't find any related issues.

Environment info

Operating System: Ubuntu 14.04.5 LTS (running in a singularity container on a CentOS 6.7 host).

Installed version of CUDA and cuDNN:
I am using CUDA 8.0 with NVIDIA driver 367.48, and cuDNN v5.1 .
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

libOpenCL.so
libOpenCL.so.1
libOpenCL.so.1.0
libOpenCL.so.1.0.0
libcublas.so
libcublas.so.8.0
libcublas.so.8.0.45
libcublas_device.a
libcublas_static.a
libcudadevrt.a
libcudart.so
libcudart.so.8.0
libcudart.so.8.0.44
libcudart_static.a
libcudnn.so
libcudnn.so.5
libcudnn.so.5.1.5
libcudnn_static.a
libcufft.so
libcufft.so.8.0
libcufft.so.8.0.44
libcufft_static.a
libcufftw.so
libcufftw.so.8.0
libcufftw.so.8.0.44
libcufftw_static.a
libcuinj64.so
libcuinj64.so.8.0
libcuinj64.so.8.0.44
libculibos.a
libcurand.so
libcurand.so.8.0
libcurand.so.8.0.44
libcurand_static.a
libcusolver.so
libcusolver.so.8.0
libcusolver.so.8.0.44
libcusolver_static.a
libcusparse.so
libcusparse.so.8.0
libcusparse.so.8.0.44
libcusparse_static.a
libnppc.so
libnppc.so.8.0
libnppc.so.8.0.44
libnppc_static.a
libnppi.so
libnppi.so.8.0
libnppi.so.8.0.44
libnppi_static.a
libnppial.so
libnppial.so.8.0
libnppial.so.8.0.44
libnppicc.so
libnppicc.so.8.0
libnppicc.so.8.0.44
libnppicom.so
libnppicom.so.8.0
libnppicom.so.8.0.44
libnppidei.so
libnppidei.so.8.0
libnppidei.so.8.0.44
libnppif.so
libnppif.so.8.0
libnppif.so.8.0.44
libnppig.so
libnppig.so.8.0
libnppig.so.8.0.44
libnppim.so
libnppim.so.8.0
libnppim.so.8.0.44
libnppist.so
libnppist.so.8.0
libnppist.so.8.0.44
libnppisu.so
libnppisu.so.8.0
libnppisu.so.8.0.44
libnppitc.so
libnppitc.so.8.0
libnppitc.so.8.0.44
libnpps.so
libnpps.so.8.0
libnpps.so.8.0.44
libnpps_static.a
libnvToolsExt.so
libnvToolsExt.so.1
libnvToolsExt.so.1.0.0
libnvblas.so
libnvblas.so.8.0
libnvblas.so.8.0.44
libnvgraph.so
libnvgraph.so.8.0
libnvgraph.so.8.0.44
libnvgraph_static.a
libnvrtc-builtins.so
libnvrtc-builtins.so.8.0
libnvrtc-builtins.so.8.0.44
libnvrtc.so
libnvrtc.so.8.0
libnvrtc.so.8.0.44
stubs
  1. A link to the pip package you installed:
    I installed tensorflow using pip install tensorflow-gpu==0.12.1
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)".
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
0.12.1 

What other attempted solutions have you tried?

The single GPU case works fine.

Logs or other output that would be helpful

Here is the dump of some of the problematic nodes.

node {
  name: "n/_0"
  op: "_Send"
  input: "__copy_TOWER0/Const_0"
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "client_terminated"
    value {
      b: false
    }
  }
  attr {
    key: "recv_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device_incarnation"
    value {
      i: 0
    }
  }
  attr {
    key: "tensor_name"
    value {
      s: "edge_545___copy_TOWER0/Const_0"
    }
  }
}
node {
  name: "n/_1"
  op: "_HostRecv"
  input: "^n/_0"
  attr {
    key: "client_terminated"
    value {
      b: false
    }
  }
  attr {
    key: "recv_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device_incarnation"
    value {
      i: 0
    }
  }
  attr {
    key: "tensor_name"
    value {
      s: "edge_545___copy_TOWER0/Const_0"
    }
  }
  attr {
    key: "tensor_type"
    value {
      type: DT_FLOAT
    }
  }
}
node {
  name: "n/_2"
  op: "_Send"
  input: "__copy_TOWER0/Sub_0"
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "client_terminated"
    value {
      b: false
    }
  }
  attr {
    key: "recv_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device"
    value {
      s: "/job:localhost/replica:0/task:0/gpu:0"
    }
  }
  attr {
    key: "send_device_incarnation"
    value {
      i: 0
    }
  }
  attr {
    key: "tensor_name"
    value {
      s: "edge_551___copy_TOWER0/Sub_0"
    }
  }
}

End of backtrace at crash point

  /home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/framework.py(419)run()                 
-> run_end_resp = self.on_run_end(run_end_req)                                                                              
  /home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/wrappers/local_cli_wrapper.py(262)on_run_end()  
-> self._dump_root, partition_graphs=partition_graphs)                                                                      
  /home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/debug_data.py(407)__init__()                    
-> self._load_partition_graphs(partition_graphs)                                                                            
> /home/raphtown/.local/lib/python2.7/site-packages/tensorflow/python/debug/debug_data.py(493)_load_partition_graphs()      
-> raise ValueError("Duplicate node name: '%s'" % node.name)                                                                

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions