Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TENSORFLOW 1.14 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090 MULTIPLE GPU #44200

Closed
Thunder003 opened this issue Oct 21, 2020 · 22 comments
Closed

TENSORFLOW 1.14 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090 MULTIPLE GPU #44200

Thunder003 opened this issue Oct 21, 2020 · 22 comments
Assignees
Labels
comp:gpu GPU related issues TF 1.14 for issues seen with TF 1.14 type:performance Performance Issue

Comments

@Thunder003
Copy link

Thunder003 commented Oct 21, 2020

I am running Stylegan 2 model on 4x RTX 3090 and it is taking a long time to start up the training than as in 1x RTX 3090. Although, as training starts, it gets finished up earlier in 4x than in 1x. I am using CUDA 11.1 and TensorFlow 1.14 in both the GPUs.

Secondly, When I am using 1x RTX 2080ti, with CUDA 10.2 and TensorFlow 1.14, it is taking less amount to start the training as compared to 1x RTX 3090 with 11.1 CUDA and Tensorflow 1.14. Tentatively, it is taking 5 min in 1x RTX 2080ti, 30-35 minutes in 1x RTX 3090, and 1.5 hrs in 4x RTX 3090 to start the training for one of the dataset.

I'll be grateful if anyone can help me to resolve this issue.

I am using Ubuntu 16.04, Core™ i9-10980XE CPU, and 32 GB ram both in 2080ti and 3090 machines.

@Thunder003 Thunder003 added the type:performance Performance Issue label Oct 21, 2020
@ravikyram ravikyram added comp:gpu GPU related issues TF 1.14 for issues seen with TF 1.14 labels Oct 21, 2020
@ravikyram
Copy link
Contributor

@Thunder003

Can you please share the standalone code to reproduce the issue in our environment.It helps us in localizing the issue faster. Thanks!

@ravikyram ravikyram assigned ymodak and unassigned ravikyram Oct 21, 2020
@Thunder003
Copy link
Author

@ravikyram check the code here:- https://github.com/NVlabs/stylegan2

@Thunder003

Can you please share the standalone code to reproduce the issue in our environment.It helps us in localizing the issue faster. Thanks!

@Thunder003 Thunder003 changed the title TENSORFLOW 1.4 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090ti MULTIPLE GPU TENSORFLOW 1.14 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090ti MULTIPLE GPU Oct 21, 2020
@Thunder003 Thunder003 changed the title TENSORFLOW 1.14 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090ti MULTIPLE GPU TENSORFLOW 1.14 STYLEAGN 2 PERFORMANCE ISSUE ON RTX 3090 MULTIPLE GPU Oct 22, 2020
@ymodak
Copy link
Contributor

ymodak commented Oct 23, 2020

Are you the building TensorFlow from source against those cuda versions (10.2 and 11.1)?
If you are you using pre built pip packages then I suspect your gpu is not being utilized for computing. Reason being that currently we ship TF (2.3) packages that support cuda 10.1.
See tested build configurations to know more.

On a side note TF 1.14 is out of the support window, you may want to try latest TF versions such as 2.3 which offers much better performance. Thanks!

@ymodak ymodak added the stat:awaiting response Status - Awaiting response from author label Oct 23, 2020
@Thunder003
Copy link
Author

@ymodak as mentioned on Stylegan 2 GitHub (https://github.com/NVlabs/stylegan2) it is compatible with TF 1.14 and 1.15 only and 1.15 is not working for CUDA 11.1. As I am using Nvidia RTX 3090 which has Ampere GPU with CUDA 11.1 so the one possible issue could be as mentioned in https://www.tensorflow.org/install/gpu, can you please confirm whether this is the main problem and if, then can you suggest any solution for it?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 26, 2020
@mihaimaruseac
Copy link
Collaborator

You should switch to TF 2.x. We no longer fix code on TF 1.x.

@ymodak
Copy link
Contributor

ymodak commented Oct 26, 2020

@Thunder003 That's correct. Your configuration is not using gpu computing power due incompatible cuda versions. We do not provide TF binaries that support cuda 11.1 at the moment.
For this you may try building TF from source yourself.

On a side note - Current tf-nightly version supports cuda 11.0
If you are okay with unstable version (tf-nightly) you can give it a try or wait for upcoming stable TF 2.4 release.

@ymodak ymodak added the stat:awaiting response Status - Awaiting response from author label Oct 26, 2020
@Thunder003
Copy link
Author

Thanks, @ymodak for your answer. I'm trying to build from source but getting an issue:-

Could not find any cudnn.h matching version '8' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/lib'
'/lib/x86_64-linux-gnu'
'/usr'
'/usr/include/'
'/usr/include/cudnn.h'
'/usr/local/cuda'
'/usr/local/cuda-11.1'
'/usr/local/cuda-11.1/targets/x86_64-linux/lib'
Asking for detailed CUDA configuration...

I have checked that CUDNN has properly installed on /usr/include/cudnn.h path. Following this , I have copy-pasted a cudnn.h file to /usr/local/cuda/ and libcudnn* CUDNN installation file to /usr/local/cuda. Can you please tell me a solution for it.

I am building TF 1.14 with CUDA 11.1 & CUDNN 8.0.4

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 31, 2020
@mihaimaruseac
Copy link
Collaborator

Note that code might also need to change to support newer versions of CUDA.

@psycho2012
Copy link

psycho2012 commented Nov 3, 2020

Oh, I have the same problem.

I have one RTX 3090 with CUDA version of 11.1 on Windows 10. Conda was used to install the cudatoolkit which provides CUDA 10.1 and cudnn 7.6, and I have tensorflow 1.14 installed. Tensorflow recognized my RTX 3090 well, but it spent a long time to begin or finish (I had to go to sleep ...) the training process.

I wonder if I build the tensorflow1.14 (yes, I need this old version) from source combines with CUDA 11 and cudnn 8, I can use the RTX 3090 well. Thanks.

@Thunder003
Copy link
Author

@mihaimaruseac, yes, it might need to be changed. To get assured I'm trying to build from source, but got stuck in another problem( As mentioned in the quote). Can you take a look at that? Or if you think it's off-topic then I can raise another issue for this.

@psycho2012 have you got any good images with TF 1.14 on RTX 3090? I'm just getting black images with RTX 3090 installed with CUDA 11.1, TF 1.14 ( This is probably a compatibility issue of TF with CUDA version). If you are getting good images with stylegan then can you tell me which version of CUDA, CUDNN, and TF are you using with RTX 3090?

Thanks, @ymodak for your answer. I'm trying to build from source but getting an issue:-

Could not find any cudnn.h matching version '8' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/lib'
'/lib/x86_64-linux-gnu'
'/usr'
'/usr/include/'
'/usr/include/cudnn.h'
'/usr/local/cuda'
'/usr/local/cuda-11.1'
'/usr/local/cuda-11.1/targets/x86_64-linux/lib'
Asking for detailed CUDA configuration...

I have checked that CUDNN has properly installed on /usr/include/cudnn.h path. Following this , I have copy-pasted a cudnn.h file to /usr/local/cuda/ and libcudnn* CUDNN installation file to /usr/local/cuda. Can you please tell me a solution for it.

I am building TF 1.14 with CUDA 11.1 & CUDNN 8.0.4

@psycho2012
Copy link

@Thunder003 I didn't get any good results.

@Thunder003
Copy link
Author

@psycho2012 can you tell me the version of CUDA, CUDNN you used with TF 1.14 in RTX 3090?

@psycho2012
Copy link

@Thunder003 CUDA 11.1 and cuDNN 8.04, but I failed to run it on GPU for TF 1.14. I have tried to build it from source with TF 1.14, but failed. According to the experiment, I think the current version of TF 1.14 can not support CUDA11.1.

@Thunder003
Copy link
Author

Thunder003 commented Nov 5, 2020

@psycho2012 thanks for your answer. It seems like TF 1.14 is incompatible with CUDA 11.1. One thing that is scratching my head is the type of issue I'm getting when building from source( TF 1.14, CUDA 11.1, CUDNN 8.0.4), pls check the image. I have checked the path and files, they are proper. Are you getting stuck at the same step?

image

I have added more paths for Libcudnn files also, but the error persists. If you are not stuck at this step can you tell me the path you have provided, just for reference (I know it may differ)

@psycho2012
Copy link

@Thunder003 I just tried on Windows but also failed on the step of the configuration. Maybe TF 1.14 is not supported by CUDA 11.1.

@Thunder003
Copy link
Author

@psycho2012 thanks for your answer. I have checked from another source also that TF 1.14 is incompatible with CUDA 11.1. and probably the error popped-up due to that only.
I'm closing the issue now.

@C-SJK
Copy link

C-SJK commented Nov 12, 2020

@psycho2012 thanks for your answer. I have checked from another source also that TF 1.14 is incompatible with CUDA 11.1. and probably the error popped-up due to that only.
I'm closing the issue now.

Can I know how do you solve the problem?because i have the same trouble.

@Thunder003
Copy link
Author

Thunder003 commented Nov 12, 2020

@C-SJK for start-up time you can increase the cache size. But still, you may not get a good result. I'm getting a black screen only in the resulting image. I assumed that there is an incompatibility issue between TF 1.14 & CUDA 11.1.
I saw NGC is using TF 1.15.4 & CUDA 11.1. But with me, TF 1.15 is not able to detect the GPU

@C-SJK
Copy link

C-SJK commented Nov 12, 2020

@C-SJK for start-up time you can increase the cache size. But still, you may not get a good result. I'm getting a black screen only in the resulting image. I assumed that there is an incompatibility issue between TF 1.14 & CUDA 11.1.
I saw NGC is using TF 1.15.4 & CUDA 11.1. But with me, TF 1.15 is not able to detect the GPU

NGC, I know it.
Do you know how to use it?

@C-SJK
Copy link

C-SJK commented Nov 12, 2020

@psycho2012 thanks for your answer. I have checked from another source also that TF 1.14 is incompatible with CUDA 11.1. and probably the error popped-up due to that only.
I'm closing the issue now.

thanks,i get it!

@JulianPinzaru
Copy link

JulianPinzaru commented Nov 17, 2020

Oh, I have the same problem.

I have one RTX 3090 with CUDA version of 11.1 on Windows 10. Conda was used to install the cudatoolkit which provides CUDA 10.1 and cudnn 7.6, and I have tensorflow 1.14 installed. Tensorflow recognized my RTX 3090 well, but it spent a long time to begin or finish (I had to go to sleep ...) the training process.

I wonder if I build the tensorflow1.14 (yes, I need this old version) from source combines with CUDA 11 and cudnn 8, I can use the RTX 3090 well. Thanks.

Did you manage to build tf 1.14 and maybe run stylegan2 on top of it?
I am strugling to make it work. Ubuntu 18.04 rtx 3090.
sm_86 error is what I run into. Tried to change some nvcc option from 0 to 1 as suggested and ran into segmentation fault.

Setting up TensorFlow plugin "upfirdn_2d.cu": iulian device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
Compiling... Loading... Done.
Segmentation fault (core dumped)

or

venv/lib/python3.7/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/username/.cache/dnnlib/tflib-cudacache/fused_bias_act_b854f54134b47f099f4349d891d819ed.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

depending on that tweak:

         compile_opts += f' --compiler-options \'{" ".join(tf.sysconfig.get_compile_flags())}\''
  •        # compile_opts += f' --compiler-options \'-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\''
    

@johndpope
Copy link
Contributor

as stylegan2 doesn't have an issues tracker - I overloaded the issue into this git commit

related - NVlabs/stylegan2@23f8bed

there is a script we can run to upgrade stylegan2 to use tensorflow 2 - automagically.
tf_upgrade_v2
--intree stylegan2/
--outtree stylegan2-tf2/
--reportfile report.txt

however - we need to address the follow failed conversions
I beseech Nvidia to create an "UNSUPPORTED" tensorflow2 branch which we can all fix.

tf.contrib. tf.contrib.memory_stats.MaxBytesInUse
Using member tf.contrib in deprecated module tf.contrib. tf.contrib cannot be converted automatically.
tf.contrib. tf.contrib.opt.ScipyOptimizerInterface cannot be converted automatically. tf.contrib
tf.contrib. tf.contrib.opt.GGTOptimizer cannot be converted automatically

probably would help if someone from tensorflow could also guide support in getting this over the line. the nvidia labs don't wont to support tensorflow 2. but it seems like push comes to shove here. Unless we're holding out for a stylegan3 to drop - fyi @tkarras

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues TF 1.14 for issues seen with TF 1.14 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants