tensorflow conflicts with nn.DataParallel #2230

lanpa · 2017-07-28T14:53:22Z

Environment: 2 GTX1080 GPU
minimal reproducible code:

import torch
from torch.autograd import Variable

import tensorflow as tf
with tf.device('/cpu:0'):
    emb = tf.Variable([[1,2],[3,4]], name="embedding")

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
    sess.run(emb.initializer)

model = torch.nn.Linear(128, 1).cuda()
model = torch.nn.DataParallel(model).cuda()

data = Variable(torch.Tensor(8,128)).cuda()
x = model(data)

error message:

  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 225, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 59, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 64, in replicate
    return replicate(module, device_ids)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast(devices)(*params)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
    outputs = comm.broadcast_coalesced(inputs, self.target_gpus)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 49, in broadcast_coalesced
    raise RuntimeError('all tensors must be on devices[0]')

By removing model = torch.nn.DataParallel(model).cuda() or sess.run the code works fine.

The text was updated successfully, but these errors were encountered:

cdluminate · 2017-07-30T07:46:52Z

What about setting the environt variable CUDA_VISIBLE_DEVICES=1?

See also http://pytorch.org/docs/master/notes/cuda.html

lanpa · 2017-07-30T18:47:51Z

CUDA_VISIBLE_DEVICES=1 makes only one GPU do the work, which did not take DataParallel's advantage. I think the interesting thing is that other program might cause pytorch moving its GPU tensors around. (and cause strange errors?)

apaszke · 2017-08-07T00:36:57Z

Can you print torch.cuda.current_device() after you run the TF initializer?

lanpa · 2017-08-07T06:07:32Z

oops, it moves to another GPU after TF initialize!
https://gist.github.com/anonymous/411931230de42bcecd8a9dd535c64e6b

before import tf:
0 2
after import tf:
0 2
2017-08-07 14:00:54.239520: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-07 14:00:54.239548: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-07 14:00:54.239557: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-07 14:00:54.239564: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-07 14:00:54.239571: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-07 14:00:54.341517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-07 14:00:54.341880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.797
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 5.60GiB
2017-08-07 14:00:54.446014: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x2d66880 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-08-07 14:00:54.446348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-07 14:00:54.446694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.797
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 5.60GiB
2017-08-07 14:00:54.447293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 
2017-08-07 14:00:54.447318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y 
2017-08-07 14:00:54.447325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y 
2017-08-07 14:00:54.447342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
2017-08-07 14:00:54.447355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:02:00.0)
after init:
1 2
after init (outside session):
1 2
Traceback (most recent call last):
  File "bug.py", line 24, in <module>
    x = model(data)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 60, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 65, in replicate
    return replicate(module, device_ids)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast(devices)(*params)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 18, in forward
    outputs = comm.broadcast_coalesced(inputs, self.target_gpus)
  File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 52, in broadcast_coalesced
    raise RuntimeError('all tensors must be on devices[0]')
RuntimeError: all tensors must be on devices[0]

colesbury · 2017-08-17T15:51:59Z

It's too bad that Tensorflow changes the current device, but this is the expected PyTorch behavior. The model must be on device_ids[0]. device_ids defaults to 0, 1, 2, .... If you're model is on device 7, you must manually specify device_ids.

Or, just set the current device after the TensorFlow call:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
    sess.run(emb.initializer)

torch.cuda.set_device(0)  # set the device back to 0

model = torch.nn.Linear(128, 1).cuda()
model = torch.nn.DataParallel(model).cuda()

data = Variable(torch.Tensor(8,128)).cuda()
x = model(data)

lanpa mentioned this issue Aug 1, 2017

Incorporate add_embedding into SummaryWriter lanpa/tensorboardX#5

Closed

colesbury closed this as completed Aug 17, 2017

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue Dec 6, 2022

Fix pytorch#2230 (pytorch#2231)

acb42b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow conflicts with nn.DataParallel #2230

tensorflow conflicts with nn.DataParallel #2230

lanpa commented Jul 28, 2017

cdluminate commented Jul 30, 2017

lanpa commented Jul 30, 2017

apaszke commented Aug 7, 2017

lanpa commented Aug 7, 2017

colesbury commented Aug 17, 2017

tensorflow conflicts with nn.DataParallel #2230

tensorflow conflicts with nn.DataParallel #2230

Comments

lanpa commented Jul 28, 2017

cdluminate commented Jul 30, 2017

lanpa commented Jul 30, 2017

apaszke commented Aug 7, 2017

lanpa commented Aug 7, 2017

colesbury commented Aug 17, 2017