Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input tensor on GPU in C++ API #5902

Closed
larsmennen opened this issue Nov 28, 2016 · 17 comments

Comments

@larsmennen
Copy link

@larsmennen larsmennen commented Nov 28, 2016

I am trying to feed a Tensor (using the C++ API) that has memory allocated on the GPU (using GPUBFCAllocator) into a network.
Now, the placeholder in the network is on the GPU (I checked this in Tensorboard), and the memory allocated for the input tensor is on the GPU, but whenever I run the network, nvprof --print-gpu-trace shows me [CUDA memcpy HtoD] and [CUDA memcpy DtoH], before the computations (i.e. convolutions etc.) start.
This suggests to me that the input tensor is being copied to CPU memory, and then back to GPU memory.

While debugging this, I found multiple hints in the source that seem to suggest the CPU is always used as device to feed tensors from.
See e.g.:


std::unique_ptr<Device> device = GetCPUDevice(env);

  1. Is this analysis correct?
  2. How can one feed in a tensor that has memory allocated on GPU memory, without copying back and forth to CPU memory? If this is currently not possible, then I think this would be a good feature to add.
    Especially when one wants to combine Tensorflow input/output with other algorithms (not in TF), one might want to keep data on the GPU, to avoid host to device and device to host transfers.

Thanks in advance.

Environment info

Operating System: Ubuntu 16.04.1 LTS

Installed version of CUDA and cuDNN: CUDA 8.0, cuDNN 5.1.5
Output of ls -l /usr/local/cuda/lib64/libcud*:

-rw-r--r-- 1 root root 558720 Sep 15 00:02 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root     16 Sep 15 00:05 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root     19 Sep 15 00:05 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.44
-rw-r--r-- 1 root root 415432 Sep 15 00:02 /usr/local/cuda/lib64/libcudart.so.8.0.44
-rw-r--r-- 1 root root 775162 Sep 15 00:02 /usr/local/cuda/lib64/libcudart_static.a
lrwxrwxrwx 1 root root     43 Oct  3 17:03 /usr/local/cuda/lib64/libcudnn.so -> /usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so
lrwxrwxrwx 1 root root     45 Oct  3 17:03 /usr/local/cuda/lib64/libcudnn.so.5 -> /usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so.5
lrwxrwxrwx 1 root root     49 Oct  3 17:03 /usr/local/cuda/lib64/libcudnn.so.5.1.5 -> /usr/local/cudnn-8.0-v5.1/lib64/libcudnn.so.5.1.5
lrwxrwxrwx 1 root root     49 Oct  3 17:03 /usr/local/cuda/lib64/libcudnn_static.a -> /usr/local/cudnn-8.0-v5.1/lib64/libcudnn_static.a

Tensorflow installed from source:

  1. The commit hash (git rev-parse HEAD): a507438
  2. The output of bazel version:
Build label: 0.4.0
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Nov 2 17:54:14 2016 (1478109254)
Build timestamp: 1478109254
Build timestamp as int: 1478109254

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

This should give the general idea.

tensorflow::GPUBFCAllocator* allocator = new tensorflow::GPUBFCAllocator(0, sizeof(float) * height * width * 3);
tensorflow::Tensor input_tensor = tensorflow::Tensor(allocator, tensorflow::DataType::DT_FLOAT, tensorflow::TensorShape( { 1, height, width, 3 }));
std::vector<tensorflow::Tensor>* outputs = new std::vector<tensorflow::Tensor>;
<copy some data into the allocated space>
<create a new session, load graph etc.> // Note: the "input_layer" is on the GPU
session->Run( { { "input_layer", input_tensor } }, { "output_layer" }, { }, outputs);

Logs or other output that would be helpful

Partial output of nvprof --print-gpu-trace:

   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
249.54ms  4.7360us                    -               -         -         -         -  1.0039KB  207.01MB/s  GeForce GTX 107         1         7  [CUDA memset]
932.67ms  44.799us                    -               -         -         -         -  384.00KB  8.1745GB/s  GeForce GTX 107         1         7  [CUDA memcpy HtoD]
933.22ms  15.584us            (12 32 1)        (32 8 1)        20        0B        0B         -           -  GeForce GTX 107         1         7  void cv::cudev::grid_transform_detail::transformSmart<int=4, unsigned char, float, cv::cudev::saturate_cast_func<unsigned char, float>, cv::cudev::WithOutMask>(cv::cudev::GlobPtr<unsigned char>, cv::cudev::grid_transform_detail::transformSmart<int=4, unsigned char, float, cv::cudev::saturate_cast_func<unsigned char, float, float>, cv::cudev::WithOutMask>, unsigned char, float, int, int) [158]
933.29ms  121.12us                    -               -         -         -         -  1.5000MB  12.094GB/s  GeForce GTX 107         1         7  [CUDA memcpy DtoH]
1.24468s  2.1120us                    -               -         -         -         -        4B  1.8062MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24492s     992ns                    -               -         -         -         -        4B  3.8455MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24522s     992ns                    -               -         -         -         -        4B  3.8455MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24553s     992ns                    -               -         -         -         -        4B  3.8455MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24594s  1.0880us                    -               -         -         -         -        4B  3.5062MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24632s     992ns                    -               -         -         -         -        4B  3.8455MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24800s     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24816s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24830s  38.879us                    -               -         -         -         -  288.00KB  7.0644GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24846s  1.0880us                    -               -         -         -         -      512B  448.79MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24858s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24868s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24874s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24963s  106.62us                    -               -         -         -         -  1.1250MB  10.304GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24981s  1.4080us                    -               -         -         -         -  1.0000KB  693.58MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.24987s  1.1840us                    -               -         -         -         -  1.0000KB  824.80MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25153s  196.60us                    -               -         -         -         -  2.2500MB  11.176GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25189s  1.4080us                    -               -         -         -         -  1.0000KB  693.58MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25196s  1.2800us                    -               -         -         -         -  1.0000KB  762.94MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25250s  180.92us                    -               -         -         -         -  2.2500MB  12.145GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25276s  1.4080us                    -               -         -         -         -  1.0000KB  693.58MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25281s  1.1840us                    -               -         -         -         -  1.0000KB  824.80MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25332s  180.67us                    -               -         -         -         -  2.2500MB  12.162GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25355s  1.4400us                    -               -         -         -         -  1.0000KB  678.17MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25359s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25406s  191.36us                    -               -         -         -         -  2.2500MB  11.483GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25431s  1.4400us                    -               -         -         -         -  1.0000KB  678.17MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25435s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25461s  90.910us                    -               -         -         -         -  1.1250MB  12.085GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25477s  1.3120us                    -               -         -         -         -  1.0000KB  744.33MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25482s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25486s  1.0560us                    -               -         -         -         -       68B  61.411MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25490s  4.1280us                    -               -         -         -         -  38.250KB  8.8367GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25500s  1.1840us                    -               -         -         -         -      256B  206.20MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25506s  1.0560us                    -               -         -         -         -      256B  231.19MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25511s  12.288us                    -               -         -         -         -  144.00KB  11.176GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25523s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25526s  1.0560us                    -               -         -         -         -      256B  231.19MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25532s  12.320us                    -               -         -         -         -  144.00KB  11.147GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25542s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25545s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25549s  1.6640us                    -               -         -         -         -  6.7500KB  3.8686GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25552s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25555s  1.1840us                    -               -         -         -         -      256B  206.20MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25559s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25562s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25570s  23.487us                    -               -         -         -         -  288.00KB  11.694GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25580s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25584s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25597s  45.919us                    -               -         -         -         -  576.00KB  11.963GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25607s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25610s  1.0880us                    -               -         -         -         -      512B  448.79MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25624s  45.919us                    -               -         -         -         -  576.00KB  11.963GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25634s  1.0880us                    -               -         -         -         -      512B  448.79MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25638s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25641s  1.4720us                    -               -         -         -         -  2.0000KB  1.2958GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25645s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25649s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25652s  1.1840us                    -               -         -         -         -  1.0000KB  824.80MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25656s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25659s  1.0880us                    -               -         -         -         -      512B  448.79MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25662s  1.1200us                    -               -         -         -         -      512B  435.97MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25666s  1.0880us                    -               -         -         -         -      256B  224.39MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25669s  1.0560us                    -               -         -         -         -      256B  231.19MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25673s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25676s  1.1840us                    -               -         -         -         -  1.0000KB  824.80MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25680s  1.1520us                    -               -         -         -         -  1.0000KB  847.71MB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.25965s  360.89us                    -               -         -         -         -  4.5000MB  12.177GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26011s  1.5360us                    -               -         -         -         -  2.0000KB  1.2418GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26015s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26543s  719.60us                    -               -         -         -         -  9.0000MB  12.214GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26626s  1.8240us                    -               -         -         -         -  2.0000KB  1.0457GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26630s  1.2480us                    -               -         -         -         -  2.0000KB  1.5283GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26788s  727.98us                    -               -         -         -         -  9.0000MB  12.073GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26872s  1.5040us                    -               -         -         -         -  2.0000KB  1.2682GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.26876s  1.2480us                    -               -         -         -         -  2.0000KB  1.5283GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27030s  718.67us                    -               -         -         -         -  9.0000MB  12.230GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27109s  1.5680us                    -               -         -         -         -  2.0000KB  1.2164GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27112s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27268s  718.58us                    -               -         -         -         -  9.0000MB  12.231GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27349s  1.5040us                    -               -         -         -         -  2.0000KB  1.2682GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27359s  1.6000us                    -               -         -         -         -  2.0000KB  1.1921GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27437s  360.12us                    -               -         -         -         -  4.5000MB  12.203GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27479s  1.5360us                    -               -         -         -         -  2.0000KB  1.2418GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27484s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27487s  1.2480us                    -               -         -         -         -  2.0000KB  1.5283GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27491s  1.3120us                    -               -         -         -         -  2.0000KB  1.4538GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27495s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27651s  719.02us                    -               -         -         -         -  9.0000MB  12.224GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27732s  1.5040us                    -               -         -         -         -  2.0000KB  1.2682GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27736s  1.3440us                    -               -         -         -         -  2.0000KB  1.4192GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27890s  719.02us                    -               -         -         -         -  9.0000MB  12.224GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27970s  1.5350us                    -               -         -         -         -  2.0000KB  1.2426GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.27976s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28131s  719.09us                    -               -         -         -         -  9.0000MB  12.223GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28212s  1.5360us                    -               -         -         -         -  2.0000KB  1.2418GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28222s  1.2480us                    -               -         -         -         -  2.0000KB  1.5283GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28377s  718.06us                    -               -         -         -         -  9.0000MB  12.240GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28463s  1.6320us                    -               -         -         -         -  2.0000KB  1.1687GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28468s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28619s  719.86us                    -               -         -         -         -  9.0000MB  12.209GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28703s  1.5040us                    -               -         -         -         -  2.0000KB  1.2682GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28707s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28863s  719.28us                    -               -         -         -         -  9.0000MB  12.219GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28946s  1.5040us                    -               -         -         -         -  2.0000KB  1.2682GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28951s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28955s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28960s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.28964s  1.2800us                    -               -         -         -         -  2.0000KB  1.4901GB/s  GeForce GTX 107         1        14  [CUDA memcpy HtoD]
1.29034s  20.768us                    -               -         -         -         -  1.5000MB  70.534GB/s  GeForce GTX 107         1        14  [CUDA memcpy DtoD]
1.29047s  30.015us             (16 1 1)      (1024 1 1)        25        0B        0B         -           -  GeForce GTX 107         1        13  void tensorflow::functor::SwapDimension1And2InTensor3<float>(int, float const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3<float>*) [936]
1.29050s  3.7120us              (2 1 1)      (1024 1 1)        27        0B        0B         -           -  GeForce GTX 107         1        13  void tensorflow::functor::SwapDimension0And2InTensor3<float>(int, float const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension0And2InTensor3<float>*) [942]
1.59580s     864ns                    -               -         -         -         -      112B  123.62MB/s  GeForce GTX 107         1         7  [CUDA memcpy HtoD]
<computations start>
@poxvoculi

This comment has been minimized.

Copy link
Member

@poxvoculi poxvoculi commented Nov 28, 2016

Thanks for your clear and detailed issue report.

Your analysis is correct. GPU tensors are always fed by establishing a CPU tensor and copying its value to the GPU. In the other direction, when reading the output of a graph computation, if the value you want to read is produced on the GPU, it must be copied back to the CPU before producing I/O e.g. writing to a file or window.

The basic reason is that the GPU is simply an auxillary processor with very limited capability to engage in I/O operations. It is not feasible to e.g. read from local disk directly into GPU memory without first staging through CPU memory. In principle this should not pose a performance problem, so long as you're able to to buffer input into CPU RAM ahead of it's being needed on the GPU.

If you want to combine a TF program with a non-TF program that both run on GPU, without copying data back and forth to CPU RAM, I think you will need to arrange for input/output values to be in Vars that are local to the GPU, and somehow make them accessible to the non-TF program. Alternatively, maybe you can package your non-TF program as a single TF Op.

@gaoquanbing-ausxin

This comment has been minimized.

Copy link

@gaoquanbing-ausxin gaoquanbing-ausxin commented Nov 29, 2016

qq

@larsmennen

This comment has been minimized.

Copy link
Author

@larsmennen larsmennen commented Nov 29, 2016

Thanks @poxvoculi for your quick reply and detailed explanation.

I understand that reading from/to disk etc. into GPU memory without passing through CPU memory is not feasible. However, that is not what I am trying to do. I am trying to combine a TF progam with a non-TF program that both run on GPU, as you mention.

Your last suggestion (using Vars) seems feasible, but it also seems like a bit of a workaround. Especially in high-performance environments, I can imagine that I would not be the only one using this functionality.

The fact that data is copied to CPU memory and then back to GPU memory when running a graph with an input tensor on GPU memory, suggests to me that TF is detecting that this tensor is on GPU memory.
So wouldn't it be easier to directly work from an input tensor if it is on GPU memory, and transfer it CPU -> GPU if it is on CPU memory?
Similarly there could be an option I think to specify whether output tensors should be copied back to CPU memory or not.
This would give the flexibility needed I think.

Thanks and looking forward to hear your thoughts.

@poxvoculi

This comment has been minimized.

Copy link
Member

@poxvoculi poxvoculi commented Nov 29, 2016

OK, if your real question is "How do I get a GPU-using TensorFlow program to cooperate effectively with a non-TF GPU program?" that's a whole collection of difficult issues.

The short, easy answer is to pipeline them, with the inputs/output staged through CPU RAM, as it sounds like you're doing. Anything beyond that gets into tricky stuff that I probably shouldn't even suggest.

Just for illustrative purposes, here are some thoughts about some of the problems.

Suppose you're willing to invoke the programs as separate processes, and just want to avoid the CPU RAM staging of values passed between them. TensorFlow allocates all GPU RAM by default, because that allows us to do memory management more efficiently than by using cudaMalloc. Via an option one can request that TF only allocate part of the total. TF can only apply operations to Tensor typed memory regions that it has allocated. It should be possible to declare a Var to be resident to a GPU, which provides a long-term storage with a static location, but there's no way to specify an address at which that Var should allocate. Once the Var has been established, a TF graph program can read and store its value. With considerably hackery on your part, it might be possible to modify the TF runtime to capture the GPU RAM address of an allocated Var in which you're interested, and make it available to another process. While the TF program is still live, holding the allocated memory, but idle, it might be possible to start another program which uses the GPU and can read/write a location it hasn't allocated itself via pointer. (GPUs are not (yet) virtualizable, so I think a second process will see the same memory contents and address space as the first.) So this might point to a crude way of alternating the actions of two separate GPU-using programs on some shared memory, so long as only one of them is a TF program.

Alternatively, you might want to call a non-TF program in the middle of a TF graph, e.g. by wrapping it in a new Op. This could work, if you're willing and able to rewrite that non-TF program to obey the TF GPU runtime assumptions about use of stream executor contexts, memory allocators, etc. which are pretty non-standard in the CUDA world. Simply trying to link and call a pre-existing GPU utility is unlikely to work.

Hope this clarifies things.

@larsmennen

This comment has been minimized.

Copy link
Author

@larsmennen larsmennen commented Nov 30, 2016

Thank you again for your quick and detailed reply.

That is exactly the question. My problem does not relate to calling a non-TF program in the middle of a TF graph.

Let's say program A runs on the GPU (non-TF) and I want to feed its output (which is in GPU memory) into a TF network (program B, which uses the TF C++ API).
I understand that the easiest way is to do: copy output of A back to CPU memory -> run program B. TF will then internally copy the data to GPU memory and start running the graph.

Now, I'm looking for a way to bypass the process of staging it through CPU RAM, because from a performance point of view this seems to be unnecessary. It would be more efficient if we could do a memory copy from GPU memory to GPU memory, i.e. copying the output of program A to the allocated memory for the input tensor of program B, both living on GPU memory.

As you mention there are some problems there.
I understand that TensorFlow cannot operate on memory that it didn't allocate, but I am using a TF allocator:

tensorflow::GPUBFCAllocator* allocator = new tensorflow::GPUBFCAllocator(0, sizeof(float) * height * width * 3);
tensorflow::Tensor input_tensor = tensorflow::Tensor(allocator, tensorflow::DataType::DT_FLOAT, tensorflow::TensorShape( { 1, height, width, 3 }));
<copy output data from program A into the GPU memory allocated by input_tensor using a GPU->GPU copy>

So TF allocated a Tensor on GPU memory, holding our input data (i.e. the output of program A), and I do not see why TF will now still insist on staging this through CPU RAM. All of this we can do at runtime.

I think it would be a good feature addition to skip staging through CPU RAM if the input tensor has its memory allocated on the GPU using a TF GPU allocator.
Similarly, there could be a flag to keep output tensors on GPU memory and not stage them back to CPU RAM. It would then be the users responsibility of course to deal with this, but it will allow users to not waste any time copying back and forth to CPU RAM in performance-critical environments.

The solution using Vars could work I think, but I think it is a bit of a workaround and making this a feature could be useful to more users who are trying to optimize setups where TF is getting input/output from other programs operating with GPU memory.

What are your thoughts on this?

@poxvoculi

This comment has been minimized.

Copy link
Member

@poxvoculi poxvoculi commented Nov 30, 2016

The session->Run() function computes the the subgraph closure induced by the specified outputs, in other words, it computes forward from consts, Vars or fed Nodes only along paths that lead to the requested outputs. Given this situation, if you already have figured out how to copy the values computed by A into a TF allocated tensor, why not copy them into a GPU resident Var, instead of into some other kind of tensor, and skip the explicit feed argument to Run()? If you're working in C++, you can use DMAHelper to get the address of the backing buffer for the memcpy.

I agree it could be handy to have Session::Run() recognize that a feed tensor may already be GPU resident. I guess that whoever wrote that code assumed it would never happen, so it could take a while to discover everything that needs to be fixed.

@larsmennen

This comment has been minimized.

Copy link
Author

@larsmennen larsmennen commented Dec 1, 2016

Yes, using Vars would indeed be a solution for now, but I still think (as you mention) that support for using feed tensors for this would be welcome.
Do you think this could be flagged as 'enhancement'?

Thank you for the pointer to DMAHelper.

@harigov

This comment has been minimized.

Copy link

@harigov harigov commented Jan 19, 2018

Is this support being considered for a future release? This could help improve performance for tasks that need to eke out every bit of performance.

@poxvoculi

This comment has been minimized.

Copy link
Member

@poxvoculi poxvoculi commented Jan 19, 2018

@asimshankar Is the c-api going to support auto-recognition of GPU-resident tensors supplied to Session::Run()?

@poxvoculi poxvoculi assigned poxvoculi and asimshankar and unassigned poxvoculi Jan 19, 2018
@asimshankar

This comment has been minimized.

Copy link
Contributor

@asimshankar asimshankar commented Jan 20, 2018

A proper implementation would likely involve some interface changes as well (perhaps the Tensor object in C++ should also point to the device/allocator that backs its TensorBuffer). There is nobody actively working on this at this time. If anyone would like to contribute a change, we'd be welcoming of that (perhaps make sense to discuss the plan before sending a PR though).

A hacky workaround might be to create dummy input/output operations that avoid the copying. For example, the Tensor object provided to Session::Run might just contain enough metadata for the kernel to materialize a Tensor object referencing the GPU memory of interest.

@yaira24

This comment has been minimized.

Copy link

@yaira24 yaira24 commented May 6, 2018

larsmennen did solved this issue? i have the same problem

@tkurmann

This comment has been minimized.

Copy link

@tkurmann tkurmann commented May 15, 2018

I would also be interested in knowing how to solve this issue. From my understanding, tensorflow::ops::Variable does not expose the tensor so we can memcpy into it. Is there any other way to get the cuda pointer from the Variable?

@omaralvarez

This comment has been minimized.

Copy link

@omaralvarez omaralvarez commented Aug 6, 2018

Same problem here, I'm working with video in GPU memory and for me copying data to CPU and back to GPU again is a problem, I'm working in realtime and copies kill performance. Is there any update on this issue?

@asimshankar

This comment has been minimized.

Copy link
Contributor

@asimshankar asimshankar commented Aug 6, 2018

Update: We do have an experimental means of feeding/fetching the GPU memory of Tensors, added in commit a1d6179

This is subject to change, but something y'all may want to try.
See the unittest

TEST(DirectSessionTest, FeedAndFetchTensorsInDeviceMemory) {

for an example.

This is all in the runtime implementation, there are no current plans to surface this into higher level APIs (or Python) at this time.

@kidtronnix

This comment has been minimized.

Copy link

@kidtronnix kidtronnix commented Aug 28, 2018

+1 on this feature, especially for python.

Let me give a definitive use case, for same data different parameter / architecture jobs, we have the same data-set being used to train and evaluate over and over again. If we have 100 different combinations to try, this means we will copy from disk -> cpu mem -> gpu mem 100 times. ALL nvidia documentation suggests limiting data transfer between device and host as the primary optimisation.

Currently TF is a blocker for us using in certain production use cases because of this.

@AndreyFilippov

This comment has been minimized.

Copy link

@AndreyFilippov AndreyFilippov commented Sep 13, 2018

+1 too, need to feed from GPU memory

@fierval

This comment has been minimized.

Copy link

@fierval fierval commented Oct 29, 2018

Tried @asimshankar 's suggested experimental code, it works, thank you! Any time frame for when this will become an API not likely to change at any moment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.