Low GPU utilization with tfjs-node-gpu #468

brannondorsey · 2018-06-25T20:50:56Z

TensorFlow.js version

  "dependencies": {
    "@tensorflow/tfjs": "^0.11.4",
    "@tensorflow/tfjs-node": "^0.1.5",
    "@tensorflow/tfjs-node-gpu": "^0.1.7",
}

Browser version

N/A. Node v8.9.4. Ubuntu 16.04

Describe the problem or feature request

Using tfjs-node-gpu, I can't seem to get GPU utilization above ~0-3%. I have CUDA 9 and CuDNN 7.1 installed, am importing @tensorflow/tfjs-node-gpu, and am setting the "tensorflow" backend with tf.setBackend('tensorflow'). CPU usage is at 100% on one core, but GPU utilization is practically none. I've tried tfjs-examples/baseball-node (replacing import'@tensorflow/tfjs-node' with import'@tensorflow/tfjs-node-gpu' of course) as well as my own custom LSTM code. Does tfjs-node-gpu actually run processes on the GPU?

Code to reproduce the bug / link to feature request

# assumes CUDA 9, CuDNN 7.1, and latest nvidia drivers are already installed
git clone https://github.com/tensorflow/tfjs-examples
cd tfjs-examples/baseball-node

# replace tfjs-node import with tfjs-node-gpu
sed -i s/tfjs-node/tfjs-node-gpu/ src/server/server.ts

# install dependencies and download data
yarn add @tensorflow/tfjs-node-gpu
yarn && yarn download-data

# start the server
yarn start-server

Now open another terminal and watch GPU usage. Note that if you are running the process on the same GPU as an X window server GPU usage will likely be greater than 3% because of that process. I've tested this on a dedicated GPU running no other processes using the CUDA_VISIBLE_DEVICES env var.

# monitor GPU utilization
watch -n 0.1 nvidia-smi

The text was updated successfully, but these errors were encountered:

nkreeger · 2018-06-27T18:26:15Z

Hi Brannon,

Apologies for the delay - I was out on holiday. So for some neural networks, GPU can actually be slower than regular CPU usage. This happens because there is a cost to copy tensor data from local storage over to GPU memory. The baseball network is a simple 4-layer network of nothing more than relus and a sigmoid at the end. These type of networks are slower on GPU because of all the copying to GPU memory.

If you want to take advantage of GPU, using a network that has some level of pooling and/or convolutions. For example, in the tfjs-examples repo, we have a MNIST example that runs entirely on Node:

https://github.com/tensorflow/tfjs-examples/tree/master/mnist-node

This runs super fast on the GPU since convolutions are nice and optimized for CUDA operations. Trying running that example with your watch of nvidia-sim tool.

brannondorsey · 2018-06-27T21:32:24Z

Ah, I see. Running the mnist-node example on a designated GTX 1060 with no other GPU processes does generate ~20% GPU utilization. What mechanism is used to automagically determine whether a model graph will be run on the GPU or stay on the CPU? As I mentioned in the OP I also tried this on my own custom RNN (browser code here). There I would have expected the GPU to be used, as it is with a nearly identical model in Keras.

If my memory serves me correctly Python Tensorflow gives the option to specify CPU/GPU device specifically. Does no such functionality exist in tfjs-node?

nkreeger · 2018-06-27T21:38:09Z

Python + Keras will use a graph-based execution which can run faster on GPU. We use the Eager-style API from TensorFlow which does not actually have a graph of placeholders - it allocates new Tensors for op-output. This is probably why you see Keras utilizing the GPU more.

We do not have a device API yet - it's something we're considering down the road once we introduce TPU support. For now, we default to Tensor placement using the default settings in TensorFlow eager (i.e. copy all non-int32 tensors).

brannondorsey · 2018-06-28T15:46:51Z

Gotcha. Thanks for that clarification. I've revisited the char-rnn tfjs-node-gpu example I was telling you about and it looks like it is indeed running on the GPU as memory is allocated, but GPU utilization is ~1%. If I'm understanding you correctly this is because tfjs-node-gpu is using TF Eager mode. So I should expect the same type of model to run ~1 GPU utilization if it were written in Python using TF Eager mode as well, correct?

Does tfjs-node-gpu intend to add support for graph-based execution at some point in the near future? Unless I'm missing something, this "Eager mode only" behavior creates some significance performance hurdles, no? In general, how does tfjs-node-gpu compare in performance to similar implementations in Keras?

I ask because I'm writing some documentation for my team and am beginning to consider a javascript-first approach to common high-level ML tasks. A year ago that would have seemed like a crazy idea, but with tfjs, maybe not so. Basically I'm curious if tfjs-node-gpu will ever be comparable in performance to Keras and Python Tensorflow?

brannondorsey · 2018-08-22T15:15:00Z

@nkreeger, any thoughts on a few of these last questions?

aligajani · 2019-03-13T06:40:19Z

@brannondorsey My opinion on the last question: tfjs won't be as fast as tfpy.

AbhimanyuAryan · 2020-06-06T12:49:52Z

@nkreeger something wrong with training using TFJS Node GPU. It's still training on CPU. I'm using 2080ti. This is MNIST example from tfjs repository

(tf) abhimanyuaryan@hackintosh:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

joshverd · 2020-09-05T00:28:31Z

I'm seeing epoch cycles take over double the time on my GPU when compared to my CPU. Is there any way to improve this?

jeffcrouse · 2022-01-21T16:31:16Z

Sorry to piggyback on this issue, but I think am having a similar problem. I'm trying to use @tensorflow/tfjs-node-gpu and @tensorflow-models/body-pix on Windows (CUDA 11.2 and cuDNN 8.1). A single inference is taking almost 2000ms. Meanwhile, the WebGL demo takes about 15ms per frame.

I see this in my terminal when I run my script (https://gist.github.com/jeffcrouse/750f26afdaedb4d6cd0a523ed591dccc):

C:\Users\jeff\Desktop\tfjs_test>node index.js
2022-01-21 10:20:14.576371: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-21 10:20:15.138140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3965 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5
2022-01-21 10:20:16.174617: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
1996 milliseconds

And I see a spike in my GPU usage, but it's still SUPER SLOW.

To be totally honest, I can't follow most of the discussion above, but I'm curious if anyone can explain to me why the browser WebGL performance would be 130x better than GPU-backed node? I understand the idea of copying data to the GPU being slow, but why isn't this an issue for the browser?

Thanks in advance!

f4z3k4s · 2022-01-27T13:55:21Z

We actually experience the same. Running our model on CPU takes ~400ms, running it on GPU takes ~3000ms. This happens on a server with two NVIDIA GeForce RTX 3090 and cuda 11.6 with cudnn 8.3. Relevant logs:

2022-01-27 22:48:03.044007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19758 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2022-01-27 22:48:03.044598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22307 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b4:00.0, compute capability: 8.6
2022-01-27 22:48:04.985189: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8302
2022-01-27 22:48:06.383271: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

I can confrim that cuda is installed well as I am able to utilize it with several other tools correctly.

This does not happen in the browser though, running on WebGL is way faster than CPU inference.

UPDATE: I actually have to admit, that I was only testing these by only doing 1 inference instead of 100s or 1000s. I created test suites for larger magnitudes of inference, and it's actually true that copying the model to GPU memory is what takes a lot of time. After that's done, GPU inference is way faster than CPU inference:

GPU info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0 Off |                  N/A |
|  0%   26C    P8    34W / 390W |   2552MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:B4:00.0 Off |                  N/A |
|  0%   28C    P8    24W / 350W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    360790      C   ...9TtSrW0h-py3.7/bin/python     2549MiB |

CPU info:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         1000.089
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        12 MiB
L3 cache:                        16.5 MiB
NUMA node0 CPU(s):               0-23

Following were the results for averaging 100 inferences on a hot GPU (model is loaded to GPU memory and not disposed between model.execute calls):

	GPU	CPU
yolov5 s model	61.9ms	146.6ms
yolov5 m model	73.9ms	255.1ms
yolov5 l model	85.1ms	386.4ms
yolov5 x model	97.3ms	609.1ms

tafsiri added the comp:node.js label Jun 26, 2018

tafsiri assigned nkreeger Jun 26, 2018

nkreeger closed this as completed Jun 27, 2018

adwellj mentioned this issue May 20, 2019

Trying hard to install tfjs-node-gpu on Windows 10 #1589

Closed

Z3r0S3v3n mentioned this issue Nov 28, 2023

[Snyk] Security upgrade googleapis from 39.2.0 to 49.0.0 Z3r0S3v3n/tfjs#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low GPU utilization with tfjs-node-gpu #468

Low GPU utilization with tfjs-node-gpu #468

brannondorsey commented Jun 25, 2018

nkreeger commented Jun 27, 2018 •

edited

Loading

brannondorsey commented Jun 27, 2018

nkreeger commented Jun 27, 2018

brannondorsey commented Jun 28, 2018

brannondorsey commented Aug 22, 2018

aligajani commented Mar 13, 2019

AbhimanyuAryan commented Jun 6, 2020 •

edited

Loading

joshverd commented Sep 5, 2020

jeffcrouse commented Jan 21, 2022

f4z3k4s commented Jan 27, 2022 •

edited

Loading

Low GPU utilization with tfjs-node-gpu #468

Low GPU utilization with tfjs-node-gpu #468

Comments

brannondorsey commented Jun 25, 2018

TensorFlow.js version

Browser version

Describe the problem or feature request

Code to reproduce the bug / link to feature request

nkreeger commented Jun 27, 2018 • edited Loading

brannondorsey commented Jun 27, 2018

nkreeger commented Jun 27, 2018

brannondorsey commented Jun 28, 2018

brannondorsey commented Aug 22, 2018

aligajani commented Mar 13, 2019

AbhimanyuAryan commented Jun 6, 2020 • edited Loading

joshverd commented Sep 5, 2020

jeffcrouse commented Jan 21, 2022

f4z3k4s commented Jan 27, 2022 • edited Loading

nkreeger commented Jun 27, 2018 •

edited

Loading

AbhimanyuAryan commented Jun 6, 2020 •

edited

Loading

f4z3k4s commented Jan 27, 2022 •

edited

Loading