Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow TF GPU performance on Windows compared to Linux #6964

Closed
OevreFlataeker opened this issue Jun 5, 2019 · 20 comments
Closed

Very slow TF GPU performance on Windows compared to Linux #6964

OevreFlataeker opened this issue Jun 5, 2019 · 20 comments
Labels
models:research models that come under research directory type:performance Performance Issue

Comments

@OevreFlataeker
Copy link

OevreFlataeker commented Jun 5, 2019

System information

  • What is the top-level directory of the model you are using:
    models/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Window 10 Build 1809, Anaconda 2019.03
  • TensorFlow installed from (source or binary):
    Binary (Anaconda's tensorflow-gpu 1.13.1) but tried build from source (using Bezel) as well
  • TensorFlow version (use command below):
    1.13.1
  • Bazel version (if compiling from source):
    0.26 (during source compile)
  • CUDA/cuDNN version:
    10.0/7.6.0
  • GPU model and memory:
    GTX 970 4 GB
  • Exact command to reproduce:
    python models/research/object_detection/model_main.py --pipeline_config_path=pre-trained-model/pipeline.config --model_dir=training --num_train_steps=100000
    (with e.g. faster_rcnn_resnet101_kitti begin unpacked to "pre-trained-model")

Describe the problem

I'm fairly new to TF and am experimenting with the object_detection APIs for retraining existing models to detect custom objects.

I have two systems to experiment on: Windows 10 with a GTX 970 (4 GB) and Ubuntu 16.04 with a GTX 750 Ti (4 GB model). Both are setup with the same TF, Anaconda, CUDA, cuDNN version and training the same model. I see huge differences between the Linux GPU performance and the Windows GPU performance during training and inference.

Some examples:

Model: faster_rcnn_resnet_101_kitti:
Training Linux: ca. 90 sec for 100 steps
Training Windows: ca. 280 sec for 100 steps
Inference Linux: ca. 0.3 sec per image
Inference Windows: ca. 9-12 (!) sec per image

Model: ssd_mobilenet_v2_coco:
Training Linux: 140 sec for 100 steps
Training Windows: 270-360 sec for 100 steps
Inference Linux:: 0.03 sec per image
Inference Windows: 0.3 sec per image

(Analogous results for ssd_inception_v2_coco and also faster_rcnn_resnet50_coco)

During training, the GTX 750Ti is heavily used > 90%, but on Windows I only see occasional peaks and more or less idling around at 10 - max 15%. The Windows version DOES properly detect the GPU and logs the successful GPU placement during training and inference. CPU usage is "normal" on Linux (varying between 20-60%) whereas on Windows the CPU usage is nearly or at 100%. It "feels" as if Tensorflow is saying it uses the GPU on Windows but is actually doing everything on CPU. nvidia-smi shows the TF python process running on the GPU on Windows and Linux.

The CPU on both systems is a Core i5-3470 @ 3.2 GHz with 24 GB RAM (Linux) or 32 GB RAM (Windows). Judging by the GPU alone I would expect the Windows version to be 2-3x faster then the Linux version. I've also tried to build TF from source using bazel but did not observe any improvement in contrast to this post: #1942 (comment)

Source code / logs

None

@rlewkowicz
Copy link

@jvishnuvardhan Any comment on this? There's lots of old bug reports about abysmal performance on windows. One of them is confirmed, assigned and then nothing happened.

#1942

I've also been having issues with the drastic performance issues. I've tried conda, pip, source everything. Performance is REALLY bad on windows.

@OevreFlataeker
Copy link
Author

Thank you. So - in general? Or only for a specific model?

@rlewkowicz
Copy link

rlewkowicz commented Jun 17, 2019

All of the above. In particular I'm using https://github.com/zzh8829/yolov3-tf2 (bad ass implementation and accompanying tool chain I must add). Using the "tiny" version I get 7-15ms per frame in Linux and 30-60 in windows. Using microsoft vott makes annotations a breeze (not that it's part of this toolchain, I just learned about it from this repo). My turn around times on a fresh dataset are pretty low now.

I think it's related to cuda performance in windows. I started poking around at pytorch issues and they have the same problem which leads me to believe the root cause is windows.

https://stackoverflow.com/questions/19944429/cuda-performance-penalty-when-running-in-windows

It would be interesting to see if "TCC" solves the issue. We'd need someone with a non consumer card (because my 1500 dollar 2080ti isn't enough, thanks NVIDIA) to enable tcc and see if they can alleviate the issue.

@OevreFlataeker
Copy link
Author

I'm far away from these values I have to say but also far away from a 2080ti ;-)
Need to check out VOTT, looks very promising! Hm, but is really nearly everyone in the TF scene running Linux for object_detection and no-one Windows?

@gowthamkpr gowthamkpr added models:research models that come under research directory type:performance Performance Issue labels Sep 10, 2019
@tensorflowbutler
Copy link
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be an old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

@IamSierraCharlie
Copy link

Just a point on this. I have kept the same hardware, same images same everything really. Gone from Windows to Linux (Ubuntu 20) and am seeing massive performance improvement. I've gone from 140 - 170 seconds per hundred steps to 42-43 seconds per step. Unless you have no other option, it may be best to avoid training in Windows.

@pouryahoseini
Copy link

I also confirm that the training time on Windows 10 is at least 3 times slower than Ubuntu 20.04, on the same models, the same RTX 3090, Cuda 11.0, Cudnn 8, and Tensorflow 2.4.0 both. There is no specific warning in both the Windows and Linux versions.

@rjghendrikx
Copy link

I also have the same issue with the RTX 3090, Cuda 11, Cudnn 8 and tf 2.4 on Windows.
It seems that Nvidia Ampere requires Cuda >= 11.1 while the specific tensor flow version that supports RTX 3090 is tested against 11.0 - in other words / this platform is quite new (few months) and I am guessing that future updates will resolve these performance issues.

@trzy
Copy link

trzy commented Mar 29, 2021

I'm seeing performance that is as fast as Linux on CUDA 11.2 and TF2 for a short time (typically 1-3 epochs in Keras for a model like VGG-16) and then a mysterious drop in performance. Has anyone else observed this?

@mab001
Copy link

mab001 commented May 10, 2021

I have the same issue for the RTX 3080, cuda 11.2, cudnn 8.1 and tf 2.6 ( tf-nighty). Run perfecly smooth on Ubuntu but same setup on windows at least 3-4 slower.

@GF-Huang
Copy link

Does anybody try to run on WSL? Is it still slower than Linux?

@fperezgamonal
Copy link

fperezgamonal commented Mar 31, 2023

Just to add to the responses above. In my case I see a drop in performance after every epoch in a custom loop; that is, each epoch gets slower and slower. In fact, after only 10 epochs, each of them takes more than 2x as long.

It must be said that with the exact same code and libraries versions (CUDA 11.2, cuDNN 8.1.1, TensorFlow 2.9.2, etc.) in Ubuntu 22.04 there's no such drop in performance and it's 25% faster from the very start (this gap could easily be explained due to Windows reserving up to ~20% of the GPU's VRAM for itself, when Ubuntu only takes around 5%, if I'm not mistaken).

There are other team members training in Windows 10 and while they do experience a performance drop, in their case it is constant, it does not get worse as the training progresses. I must investigate further to check if the architecture (ResNet-18) or some other particularity of my code/system setup, affects this.

@OevreFlataeker
Copy link
Author

I have to say I am astonished that so many people are observing this behavior but it just seems to be the way it is? Is this behavior something the tensorflow team could probably do something about? Or is it a pure nvidia topic with respect to their CUDA/driver implementation on Windows compared to Linux? What do you think?

@fperezgamonal
Copy link

I'm astonished as well. Luckily I've been using exclusively Ubuntu at home and dual-booted at work from the beginning to run all my experiments there for performance reasons.

I don't know if you know this, but TensorFlow devs have announced that version 2.10.0 is the latest with native support for Windows and GPU. Latter versions will need to use the WSL (Windows Subsystem for Linux) or else simply use the CPU... So I think they've clearly decided to drop Windows 10 users altogether.
image

Regarding your technical question about why there's this slowdown, I'm still doing research about it, after seeing the different in memory allocated a few years ago. I think it relates to the WDDM stack management. This super old StackOverflow post explains it a little (although I did not dive into the hardware details as its beyond my knowledge).
The important bit is that, supposedly, Quadro and Titan GPUs can run in a different operation mode TCC, which reportedly is significantly faster, getting closer to the Linux performance. The only caveat is that the GPU can no longer be used as a display driver.

I suppose this significant drops in performance do not affect gaming or else the entire gaming community will be on fire (or to look on the bright side, we may finally have some decent gaming support on Linux :) )

Have a nice one!

@OevreFlataeker
Copy link
Author

Thanks for those insights! Interesting development. I am fine if it will work decently with WSL2. I am just surprised, as said, because I presume many people will have their beefy GPU in a Windows box for the latest and greatest games and will hardly buy a dedicated one for a Linux box just for doing ML from time to time. At least that is how it is in my case, being a ML hobbyist (RTX2060 on Win10) ... (Though I do have a GTX750Ti in my Linux server just for ML but ofc this is nothing in comparison to current models)

@fperezgamonal
Copy link

I guess it should work decently with WSL2 and that the performance should be fairly close. However, I have not tested this yet as I prefer to use my RTX 3080 Ti in Linux with a TensorFlow pip package compiled from source from maximum performance (and I love the Linux environment anyway and I'm in a working environment and the scripting is just far superior IMHO).

@OevreFlataeker
Copy link
Author

OevreFlataeker commented Mar 31, 2023

Hm, just followed along the setup instructions (installing conda and all prereq) but it seems WSL2 can't see my RTX?

$ ...
$ ...
$ CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
$ echo ${CUDNN_PATH}
/home/xxx/anaconda3/lib/python3.10/site-packages/nvidia/cudnn
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib
$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-03-31 11:30:32.107966: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[]
$ python3
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-31 11:42:08.723618: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.reduce_sum(tf.random.normal([1000, 1000])))
tf.Tensor(449.34717, shape=(), dtype=float32)
>>> print(tf.config.list_physical_devices('GPU'))
[]
>>>

Shouldn't there be a wealth of debug infos be printed on import normally? (Nvidia driver is 516.94) But apparently it recognizes it is a RT?

@fperezgamonal
Copy link

fperezgamonal commented Mar 31, 2023

Hello,

I haven't had time to test it on my end. I'll try do so on Monday and let you know if it detects my RTX 3080Ti!

Cheers!

@OevreFlataeker
Copy link
Author

Thanks, it's likely an error on my side, yet I don't see it. I found out I didn't do https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local after I've setup WSL2 but it still doesn't work

(base) $ CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
(base) $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib
(base) $ python3
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-31 17:52:17.404049: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.config.list_physical_devices('GPU'))
2023-03-31 17:52:20.904015: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (100)
[]
>>>
(base) $ nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system
Failed to properly shut down NVML: GPU access blocked by the operating system

Wondering about the "access blocked by the operating system" error...

@OevreFlataeker
Copy link
Author

OevreFlataeker commented Mar 31, 2023

It was a matter of reinstalling/updating WSL2 after the setup I already had done before!

(base) $ nvidia-smi
Fri Mar 31 18:01:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 531.41       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060         On | 00000000:01:00.0  On |                  N/A |
| 11%   47C    P8               16W / 160W|   1043MiB /  6144MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        22      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+
(base) $ CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
(base) $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib
(base) $ python3
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-31 18:01:20.240763: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>>

So steps to follow:

Finish up with the actual Tensorflow install instructions:

conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Done and should work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests