Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 9RC + cuDNN7 #12474

Closed
tfboyd opened this issue Aug 22, 2017 · 44 comments
Closed

CUDA 9RC + cuDNN7 #12474

tfboyd opened this issue Aug 22, 2017 · 44 comments
Assignees
Labels
type:feature Feature requests

Comments

@tfboyd
Copy link
Member

tfboyd commented Aug 22, 2017

Things have moved forward. I strongly suggest building from head with CUDA 9 and cuDNN 7. All of the necessary codes should be in the TF 1.4 tag but given we are still working on new features for FP16, I would build from head if that is of interest. I do not like to share anything I have not personally tested as I know how frustrating trying to get things to compile can be.

Everything below this line is OUT DATED as of 19-OCT

This is an unofficial and very not supported patch to make it possible to compile TensorFlow with CUDA9RC and cuDNN 7 or CUDA8 + cuDNN 7.

During testing on V100 (Volta) and ResNet-50 FP32 using CUDA 9RC + cuDNN 7 was significantly faster than CUDA 8 + cuDNN 6, which was not a surprise. I am about to test on P100s. I am sharing this patch informally so those that are interested can play with cuDNN 7 as well as CUDA 9RC before we have the official release. As we have more interesting code, e.g FP16 models, I will share it in this issue. I expect NVIDIA will start to submit official cuDNN 7 patches very soon.

Note: This patch may work on more recent versions of TensorFlow but it will likely bit rot so keep that in mind. Apply the cuDNN 7 patch and then fast-forwarding the branch might be the best approach. My git skills are not strong so do what you think is best.

  1. Download the patches
  1. Clone the tensorflow repo
    https://github.com/tensorflow/tensorflow.git

  2. Checkout the revision that the TensorFlow patch can apply to:
    git checkout db596594b5653b43fcb558a4753b39904bb62cbd~

  3. Apply the TensorFlow patch:
    git apply ~/Downloads/0001-CUDA-9.0-and-cuDNN-7.0-support.patch

  4. Run ./configure. When it asks for the CUDA version, put 9.0 (or 8). When it asks for the cudnn version, put 7.0.1 (Entering '7' worked fine for me). Make sure you put the paths to the right Cuda and cudnn versions and have your ldconfig or LD_LIBARY_PATH set to point to the CUDA 9 folder.
    ./configure

  5. Attempt to build TensorFlow, so that Eigen is downloaded. This build will fail if building for CUDA9RC but will succeed for CUDA8
    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

  6. Apply the Eigen patch:

    cd -P bazel-out/../../../external/eigen_archive
    patch -p1 < ~/Downloads/eigen.f3a22f35b044.cuda9.diff
  1. Build TensorFlow successfully
    cd -
    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

I have run this process myself on Ubuntu 14.04 with Python 2.7.

Thank for NVIDIA for the early patch and @reedwm who created most of these instructions.

If you are using Python 2.7 and gcc 4.8+ here is a .whl where I followed the instructions above and one with CUDA 8 and cuDNN 7 which I have yet to test. Stress again, this was created by me and not the TensorFlow build team. My/our goal is to engage with anyone that wants to try this out and try to have a little fun. :-)

@tfboyd tfboyd self-assigned this Aug 22, 2017
This was referenced Aug 22, 2017
@tfboyd tfboyd added the type:feature Feature requests label Aug 22, 2017
@byronyi
Copy link
Contributor

byronyi commented Aug 22, 2017

Thanks Toby. Looking forward to the performance numbers on P100.

@tanmayb123
Copy link

Quick Question (sorry if it's not the right place to ask, and if I'm missing something really obvious on NVIDIA's Website): Do CUDA 9 & cuDNN 7 provide better performance on non-Volta (Pascal) GPUs? Are these noticeable when training Neural Nets?

@tfboyd
Copy link
Member Author

tfboyd commented Aug 24, 2017 via email

@tanmayb123
Copy link

tanmayb123 commented Aug 25, 2017

I'm seeing a bit of a different result. With CUDA 8 & cuDNN 6, training a simple but deep Keras CNN model took me 225 seconds, but with CUDA 9 & cuDNN 7, it took 241 seconds (on a P100).

@tfboyd
Copy link
Member Author

tfboyd commented Aug 25, 2017

My test was CUDA 8 + cuDNN 7 and CUDA 8 + cuDNN 6. For CUDA 9RC + cuDNN 7, testing not apples to apples as I was using a slightly different versions of TF than I did for CUDA 8 + cuDNN 7 but I think likely perform the same, compared to CUDA 8 + cuDNN 7 Resnet-50 was about the same and I saw a regression with VGG16 at 8xP100 SMX running replicated nccl. With 1 GPU I saw some differences but not large enough to be certain. There are things to sort out

@tfboyd
Copy link
Member Author

tfboyd commented Aug 26, 2017

@terrylewis60 It looks like you are subscribed to the issue. If you unsubscribe you will not get notifications. I would also greatly appreciate it if you removed the profanity and childish name calling. You are free to express yourself, although you are likely outside the community guidelines. It just seems rather silly to complain about something you subscribed to. Thank you.

@domschl
Copy link
Contributor

domschl commented Aug 26, 2017

Any updates on this? While the patch works with the specific tf-commit (also with Python3), the tf version says version is 1.2.1-rc1, so rather old (that version was incompatible with current bazel 0.5.3, so bazel needs to be downgraded in order to build the tf-patch version.) The cudnn 7 changes so far are not included in HEAD. Any possibility for a branch that supports the most recent released upstream software/drivers?

@tfboyd
Copy link
Member Author

tfboyd commented Aug 26, 2017 via email

@tfboyd
Copy link
Member Author

tfboyd commented Aug 26, 2017 via email

@domschl
Copy link
Contributor

domschl commented Aug 26, 2017

The patch almost works with r1.3: it fails on tensorflow/workspace.bzl:
error: patch failed: tensorflow/workspace.bzl:664
error: tensorflow/workspace.bzl: patch does not apply
[cmake anyone? ;-) ]

@tfboyd
Copy link
Member Author

tfboyd commented Aug 26, 2017 via email

@domschl
Copy link
Contributor

domschl commented Aug 26, 2017

For now I patched the patch so that it works with release r1.3:
0002-TF-1.3-CUDA-9.0-and-cuDNN-7.0-support.patch.txt

(note: the attached patch-file needs to be renamed, the extension .txt needs to be removed.)

This patch will only work with r1.3. (workspace.bzl changes almost on a daily basis, which breaks the patch).

@tfboyd
Copy link
Member Author

tfboyd commented Aug 26, 2017 via email

@npanpaliya
Copy link
Contributor

npanpaliya commented Aug 28, 2017

Could somebody please specify the required bazel version for this patch to work on r1.3?

@domschl
Copy link
Contributor

domschl commented Aug 28, 2017

I've built with tf r1.3, 0002-patch and bazel 0.5.3 (latest bazel release). The original 0001-patch (first post) did not build with bazel 0.5.3, but required a downgrade to 0.5.2.

@npanpaliya
Copy link
Contributor

Thank you @domschl !

@renganxu
Copy link

Hi @tfboyd, how does the following command work?
cd -P bazel-out/../../../external/eigen_archive

Because there is no "external/eigen_archive" folder in tensorflow repo. There is only "third_party/eigen3/Eigen", so this command and your next patch command does not work.

@domschl
Copy link
Contributor

domschl commented Aug 30, 2017

You need to try to build tf once. That pulls the external dependencies like eigen3.

@tfboyd
Copy link
Member Author

tfboyd commented Aug 30, 2017 via email

@renganxu
Copy link

renganxu commented Sep 8, 2017

Hi @tfboyd, I tried to use your patches, but after the first build step, there is no "external/eigen_archive" folder at all. It means the command "cd -P bazel-out/../../../external/eigen_archive" failed. So when this folder is generated?

@tfboyd
Copy link
Member Author

tfboyd commented Sep 8, 2017

@hfutxrg it must be broken now. It will be in the official build in a few weeks or less. Step #6 that mentions EIGEN is what creates that folder. But if it did not work for you that is ok, this was for fun and it worked for a few people.

@renganxu
Copy link

Hi @tfboyd, how did you benchmark the performance of P100 and V100? I used the benchmarks mentioned in https://www.tensorflow.org/performance/benchmarks. But it seems the performance results are incorrect. With real imagenet dataset, I got 905 images/sec with 1 P100 and 1234 images/sec with 4 P100. But that benchmark website has 219 images/sec with 1 P100 and 852 images/sec. So there are huge difference here.

My command to run the benchmark is:
export CUDA_VISIBLE_DEVICES=0
python tf_cnn_benchmarks.py \
--model_name=resnet50 \
--batch_size=64 \
--num_batches=2000 \
--num_gpus=1 \
--data_dir=/cm/shared/scratch/database/tensorflow \
--data_name="imagenet"

Is there anything wrong with my command? Also that benchmark website has similar performance with synthetic data, but if I remove "--data_dir", the performance is 12866 images/sec which is incorrect. Any comments and suggestions are appreciated!

Thanks,
Rengan

@tfboyd
Copy link
Member Author

tfboyd commented Sep 13, 2017

You are running the trivial model not resnet50. --model=resnet50 I was totally confused for a second. I recently retested and has NVIDIA test on TensorFow 1.3 and there numbers are slightly different now on the DGX-1 but no more than 3-4%. We made a few tweaks to the model as we worked to ensure it converged to match the paper. Nothing huge just small changes. You can also use the benchmarks from the staging branch but it requires 1.4. tf_cnn_benchmarks does not worry about backwards compatibility as it is kind of a sandbox, just FYI. I am working on a better way to publish updated code and accept external PRs more often. It is more work than it looks.

@renganxu
Copy link

Thanks @tfboyd. That's a stupid mistake. After I changed to the correct option, the benchmark works correctly now.
Another question, how do we run the benchmark with 16-bit floating points (float16)? I didn't see any options to enable that.

@tfboyd
Copy link
Member Author

tfboyd commented Sep 13, 2017

fp16 is not ready just yet. We still need batch_norm updated to support it and another push from the script. Give me a few more weeks and I should have everything you need. I (and we) very much want to share the code with you as soon as it functions even if it has some issues. It will also be hard to use this patch as it is getting very stale so if all goes well in a few weeks, I hope less, we can have everything (Cuda9RC, cuDNN 7 and fp16 with something to play with) official.

Zero worries on the mistake there are a bunch of flags and it is so easy to make a mistaken.

@renganxu
Copy link

Hi @tfboyd , in the benchmark code, one way for variable update is called "independent". This should never not be used in real training, right? Because in real training, all GPUs have to communicate each other to update the gradient. The page https://www.tensorflow.org/performance/performance_models also didn't mention "independent" variable update.

@tfboyd
Copy link
Member Author

tfboyd commented Sep 21, 2017

@hfutxrg Correct independent means graidents are not shared between the towers. Only useful for testing hardware and no a "legit" way to test actual performance.

FYI, it looks like cuDNN 7 is mostly merged. I have not tested it myself yet but just FYI. Trying to keep the thread updated as I have time.

#12503

and CUDA 9
#12502

There will be followup PRs and I hope to publish FP16 examples very soon. There is a review still in progress for batch norm.

@alanpurple
Copy link
Contributor

Will this be 1.4.0?
estimated release date? please?

@tfboyd
Copy link
Member Author

tfboyd commented Oct 3, 2017

1.4 binaries will most likely be CUDA 8 + cuDNN 6. I suspect (which is not an estimate) that 1.5 will be built with CUDA 9 + cuDNN 7. I am testing the build now and working through some issues with CUDA 9 and cuDNN 7. Maybe I am doing something wrong, I will find out tomorrow. I really want to have it CUDA 8 and cuDNN 7 building in 1.4 source but if we miss the release train we miss it. There is no date for 1.4 but I suspect it will RC before the end of October. It is being gated on finishing some features but again it most likely will not have prebuilt binaries for CUDA 9 and cuDNN 7.

FYI. I will close this issue when it builds from source with no "tricks". :-)

@chunfuchen
Copy link

@tfboyd Thanks for your great help for integrating cuda rc 9 + cudnn 7. So, if I have a V100 machine, I should be able to compile tensorflow from the source codes (1.3 branch) based on your patches above and enjoy the speed up from V100, right?

Btw, will your patch work with the official cuda 9 (release few days ago) and cudnn 7? Or you will update the patch or you already did that?

Thanks again.

@zhangb223
Copy link

@tfboyd

There are always errors for me to command eigen-...-diff file.

I can't command the code cd -P bazel-out/../../../external/eigen_archive correctly. I tried when terminal was in tensorflow file and tensorflow1.3 file. But it doesn't work. Could you please tell me what
bazel-out/../../../ means and what the file**/../../../** is? Can I just substitute the exact file of bazel-out/../../../ ?

I had tried to use this way: command cd -P bazel-out/(+the exact position of )external/eigen_archive and then I execute the patch, the errors just happened at each @@ -236,7 +236,7@@(for example, others are just the same form) and the program says can't find file to patch at input line xx.

Do you know why these happen?

@tfboyd
Copy link
Member Author

tfboyd commented Oct 20, 2017

@zhangb223 @chunfuchen

I suggest taking TensorFlow from head and not using this patch. I will update the original comment.

CUDA 9 with no other changes speeds up V100s over CUDA 8 in my testing of ResNet50 FP32. I hope to share the teams FP16 code and some results but we are still working through a variety of items. Technial the tf_cnn_benchmarks.py code currently published will do FP16 but I want to make sure it is running great with great convergence (high accuracy that matches or almost matches the FP32, e.e. ~75%) and scaling.

@tfboyd tfboyd closed this as completed Oct 20, 2017
@tfboyd
Copy link
Member Author

tfboyd commented Oct 20, 2017

I marked this closed but as always (or mostly) I am happy to answer questions and share status when I have time.

@zhangb223
Copy link

Hello @tfboyd

I want to offer more information about the first error . When I execute ( cd -P tensorflow1.3/lib/python2.7/.../external/eigen_archive ) I can enter the path. However, if I add bazel-out after -P, the terminal reports: No such file or dictionary. So I want to know the function of bazel-out.

Is it a path or a command? I can’t find bazel-out in the bazel package.

Thanks!

@Arnab0989
Copy link

Hi, I am a newbie. I want some help regarding tensorflow installation.
My system has
OS - Windows 10
GPU - 940MX
I installed CudaTK 9 and updated driver and CuDNN 7 sucessfully.
even in Anaconda, tensorflow installation is also successful it shown. but still missing "dll" issues remain same.
I tried to install CUdatk8 but its telling h/w not supported for it. please help me with a suitable link or some suggestions so that I can make it work.
PC is Lenovo IdeaPad 320.
Thank you for your kind support

@erikaemma
Copy link

erikaemma commented Nov 22, 2017

Hi, my gcc version is 5.4.0 (Ubuntu 16.04), so I compile and install gcc 5.3.0 into /opt/gcc-5.3.0 ( gcc dependentments have already write into /etc/ld.so.conf.d/gcc-5.3.0.conf, and $HOME/.bashrc except gcc), and I have soft connection all /opt/gcc-5.3.0/bin to /usr/bin except the file whose name begins with 'x86_64-unknown-linux-' because I do not know the name in /usr/bin.

So that, I fail bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package with gcc: Error trying to exec 'cc1plus': Execvp: No Such File or Directory

My question is, do I have to compile it with 5.3.0?

Anybody could help me?

@acc-to-learn
Copy link

@Arnab0989 did you find some solution? Because I have exactly the same situation and I didn't find solution except that to install the old nvidia driver + cudnn6 + cuda8

@joeyearsley
Copy link
Contributor

Looking through the benchmark code and it mentions NCCL not supporting FP16, however Caffe has NCCL with FP16 support.

Will this be added to Tensorflow for 1.5? - I'm currently getting worse results with FP16 on internal results. Current Hypotheses are it is either NCCL or TF.Layers. However, I've implemented a custom getter which I thought would circumvent the tf.layers issues.

@tfboyd
Copy link
Member Author

tfboyd commented Dec 14, 2017

@joeyearsley

Can you provide more specifics? I assume V100 and where you are seeing FP16 worse than FP32 as far as throughput. And single or multi-gpu as well as which CUDA and cuDNN. That will help me to understand if and where the problem might be as well as get the right assistance and provide useful guidance.

@tfboyd
Copy link
Member Author

tfboyd commented Dec 14, 2017

@Arnab0989 and @plgkm6 I suggest opening another issue to get windows help. This was just tracking getting CUDA 9 and cuDNN 7 support in the source. I know that is a little nuanced but you will get more attention for the windows problem outside this thread with a separate issue. I believe the team is still working though build issues with windows but I am not positive.

@joeyearsley
Copy link
Contributor

joeyearsley commented Dec 14, 2017

@tfboyd Thanks for the quick reply, it's an internal benchmark which mirrors the TF benchmark but with medical images.

GPUS: 8 P100s
CUDA: 9
cuDNN: 7
MGPU type: NCCL Replicated

General computation time, for example I can double my batch size to 8 (max that fits in memory) but I only get 85 img/s vs 100 img/s with FP32 and batch size of 4.

Model is in NCHW format and uses Fused BN.

@tfboyd
Copy link
Member Author

tfboyd commented Dec 14, 2017

@joeyearsley

I would not advise FP16 on P100s for training unless it happens to work better for your situation. It is ok for testing that it works. I believe the issues is the V100 has the ability to do FP16 but do the accumulation in FP32. P100 cannot do that so you have to cast and do other "stuff". Apologies for being vague I am sharing what I have heard but I do not have direct experience. I am pretty sure I am generally correct but my details are likely off in some way. I will open an issue in the benchmark repo and link it here later today. Maybe I can get more data from the perf team.

@reedwm
Copy link
Member

reedwm commented Dec 14, 2017

/CC @chsigg, who is planning on implementing FP16 nccl support. Not sure if this will be in by TF1.5 though.

@joeyearsley As @tfboyd mentioned, P100s do not get a huge gain from fp16, and there is currently no nccl support for fp16. I don't think tf.layers would slow things down in fp16. It might be nccl-related; you could try running without nccl for both fp16 and fp32 and compare.

@joeyearsley
Copy link
Contributor

@tfboyd @reedwm Thank you for the quick replies!

Having ran that test (with and without FP16 on a PS) it seems that it is mostly the casting and other stuff.

Once again, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

15 participants