Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low GPU utilization with tfjs-node-gpu #468

Closed
brannondorsey opened this issue Jun 25, 2018 · 10 comments
Closed

Low GPU utilization with tfjs-node-gpu #468

brannondorsey opened this issue Jun 25, 2018 · 10 comments
Assignees

Comments

@brannondorsey
Copy link
Contributor

TensorFlow.js version

  "dependencies": {
    "@tensorflow/tfjs": "^0.11.4",
    "@tensorflow/tfjs-node": "^0.1.5",
    "@tensorflow/tfjs-node-gpu": "^0.1.7",
}

Browser version

N/A. Node v8.9.4. Ubuntu 16.04

Describe the problem or feature request

Using tfjs-node-gpu, I can't seem to get GPU utilization above ~0-3%. I have CUDA 9 and CuDNN 7.1 installed, am importing @tensorflow/tfjs-node-gpu, and am setting the "tensorflow" backend with tf.setBackend('tensorflow'). CPU usage is at 100% on one core, but GPU utilization is practically none. I've tried tfjs-examples/baseball-node (replacing import'@tensorflow/tfjs-node' with import'@tensorflow/tfjs-node-gpu' of course) as well as my own custom LSTM code. Does tfjs-node-gpu actually run processes on the GPU?

Code to reproduce the bug / link to feature request

# assumes CUDA 9, CuDNN 7.1, and latest nvidia drivers are already installed
git clone https://github.com/tensorflow/tfjs-examples
cd tfjs-examples/baseball-node

# replace tfjs-node import with tfjs-node-gpu
sed -i s/tfjs-node/tfjs-node-gpu/ src/server/server.ts

# install dependencies and download data
yarn add @tensorflow/tfjs-node-gpu
yarn && yarn download-data

# start the server
yarn start-server

Now open another terminal and watch GPU usage. Note that if you are running the process on the same GPU as an X window server GPU usage will likely be greater than 3% because of that process. I've tested this on a dedicated GPU running no other processes using the CUDA_VISIBLE_DEVICES env var.

# monitor GPU utilization
watch -n 0.1 nvidia-smi
@nkreeger
Copy link
Contributor

nkreeger commented Jun 27, 2018

Hi Brannon,

Apologies for the delay - I was out on holiday. So for some neural networks, GPU can actually be slower than regular CPU usage. This happens because there is a cost to copy tensor data from local storage over to GPU memory. The baseball network is a simple 4-layer network of nothing more than relus and a sigmoid at the end. These type of networks are slower on GPU because of all the copying to GPU memory.

If you want to take advantage of GPU, using a network that has some level of pooling and/or convolutions. For example, in the tfjs-examples repo, we have a MNIST example that runs entirely on Node:

https://github.com/tensorflow/tfjs-examples/tree/master/mnist-node

This runs super fast on the GPU since convolutions are nice and optimized for CUDA operations. Trying running that example with your watch of nvidia-sim tool.

@brannondorsey
Copy link
Contributor Author

Ah, I see. Running the mnist-node example on a designated GTX 1060 with no other GPU processes does generate ~20% GPU utilization. What mechanism is used to automagically determine whether a model graph will be run on the GPU or stay on the CPU? As I mentioned in the OP I also tried this on my own custom RNN (browser code here). There I would have expected the GPU to be used, as it is with a nearly identical model in Keras.

If my memory serves me correctly Python Tensorflow gives the option to specify CPU/GPU device specifically. Does no such functionality exist in tfjs-node?

@nkreeger
Copy link
Contributor

Python + Keras will use a graph-based execution which can run faster on GPU. We use the Eager-style API from TensorFlow which does not actually have a graph of placeholders - it allocates new Tensors for op-output. This is probably why you see Keras utilizing the GPU more.

We do not have a device API yet - it's something we're considering down the road once we introduce TPU support. For now, we default to Tensor placement using the default settings in TensorFlow eager (i.e. copy all non-int32 tensors).

@brannondorsey
Copy link
Contributor Author

Gotcha. Thanks for that clarification. I've revisited the char-rnn tfjs-node-gpu example I was telling you about and it looks like it is indeed running on the GPU as memory is allocated, but GPU utilization is ~1%. If I'm understanding you correctly this is because tfjs-node-gpu is using TF Eager mode. So I should expect the same type of model to run ~1 GPU utilization if it were written in Python using TF Eager mode as well, correct?

Does tfjs-node-gpu intend to add support for graph-based execution at some point in the near future? Unless I'm missing something, this "Eager mode only" behavior creates some significance performance hurdles, no? In general, how does tfjs-node-gpu compare in performance to similar implementations in Keras?

I ask because I'm writing some documentation for my team and am beginning to consider a javascript-first approach to common high-level ML tasks. A year ago that would have seemed like a crazy idea, but with tfjs, maybe not so. Basically I'm curious if tfjs-node-gpu will ever be comparable in performance to Keras and Python Tensorflow?

@brannondorsey
Copy link
Contributor Author

@nkreeger, any thoughts on a few of these last questions?

@aligajani
Copy link

@brannondorsey My opinion on the last question: tfjs won't be as fast as tfpy.

@AbhimanyuAryan
Copy link

AbhimanyuAryan commented Jun 6, 2020

@nkreeger something wrong with training using TFJS Node GPU. It's still training on CPU. I'm using 2080ti. This is MNIST example from tfjs repository

(tf) abhimanyuaryan@hackintosh:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

Selection_012

@joshverd
Copy link

joshverd commented Sep 5, 2020

I'm seeing epoch cycles take over double the time on my GPU when compared to my CPU. Is there any way to improve this?

@jeffcrouse
Copy link

Sorry to piggyback on this issue, but I think am having a similar problem. I'm trying to use @tensorflow/tfjs-node-gpu and @tensorflow-models/body-pix on Windows (CUDA 11.2 and cuDNN 8.1). A single inference is taking almost 2000ms. Meanwhile, the WebGL demo takes about 15ms per frame.

I see this in my terminal when I run my script (https://gist.github.com/jeffcrouse/750f26afdaedb4d6cd0a523ed591dccc):

C:\Users\jeff\Desktop\tfjs_test>node index.js
2022-01-21 10:20:14.576371: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-21 10:20:15.138140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3965 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5
2022-01-21 10:20:16.174617: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
1996 milliseconds

And I see a spike in my GPU usage, but it's still SUPER SLOW.

To be totally honest, I can't follow most of the discussion above, but I'm curious if anyone can explain to me why the browser WebGL performance would be 130x better than GPU-backed node? I understand the idea of copying data to the GPU being slow, but why isn't this an issue for the browser?

Thanks in advance!

@f4z3k4s
Copy link

f4z3k4s commented Jan 27, 2022

We actually experience the same. Running our model on CPU takes ~400ms, running it on GPU takes ~3000ms. This happens on a server with two NVIDIA GeForce RTX 3090 and cuda 11.6 with cudnn 8.3. Relevant logs:

2022-01-27 22:48:03.044007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19758 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2022-01-27 22:48:03.044598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22307 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b4:00.0, compute capability: 8.6
2022-01-27 22:48:04.985189: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8302
2022-01-27 22:48:06.383271: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

I can confrim that cuda is installed well as I am able to utilize it with several other tools correctly.

This does not happen in the browser though, running on WebGL is way faster than CPU inference.


UPDATE: I actually have to admit, that I was only testing these by only doing 1 inference instead of 100s or 1000s. I created test suites for larger magnitudes of inference, and it's actually true that copying the model to GPU memory is what takes a lot of time. After that's done, GPU inference is way faster than CPU inference:

GPU info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0 Off |                  N/A |
|  0%   26C    P8    34W / 390W |   2552MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:B4:00.0 Off |                  N/A |
|  0%   28C    P8    24W / 350W |      3MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    360790      C   ...9TtSrW0h-py3.7/bin/python     2549MiB |

CPU info:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         1000.089
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        12 MiB
L3 cache:                        16.5 MiB
NUMA node0 CPU(s):               0-23

Following were the results for averaging 100 inferences on a hot GPU (model is loaded to GPU memory and not disposed between model.execute calls):

GPU CPU
yolov5 s model 61.9ms 146.6ms
yolov5 m model 73.9ms 255.1ms
yolov5 l model 85.1ms 386.4ms
yolov5 x model 97.3ms 609.1ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants