Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the way to latest CMake, VS2017, CUDA 9, cudNN 7, Win10 #14801

Closed
sylvain-bougnoux opened this issue Nov 22, 2017 · 8 comments
Closed

On the way to latest CMake, VS2017, CUDA 9, cudNN 7, Win10 #14801

sylvain-bougnoux opened this issue Nov 22, 2017 · 8 comments
Assignees
Labels
stat:community support Status - Community Support

Comments

@sylvain-bougnoux
Copy link

sylvain-bougnoux commented Nov 22, 2017

As many of us (#14126,#14691,#12052), I am trying to get TF1.4 build successfully on windows using the latest version of everything. As far as I can judge I could do it but with some hacks. As it is too long for me to complete, I would like to share what I did for help finalizing. It is too early for a PR.

I am using CMake 3.9.6 (though 3.10 came out). I have low cmake skill level.
I am not trying the python bindings.
VS2017 is the community edition.

Without GPU it is easy. The only issue is the heap overflow (C1002 or C1006 #11096). The trick is to reduce parallel build by msbuild /m:4 /p:CL_MPCount=2 ... such that 4*2 is approximately the number of core you really have (at least it worked for me). Using /Zm2000 did not work for me, despite a lot of available memory (32G).

With GPU it is more tricky: the tf_core_gpu_kernels.vcxproj does not compile at all. AFAIU, the CMake strategy changed from v3.6, to allow parallel computing. CUDA is now treated as another language. Without modifications nvcc simply returns with code error 1 (or nothing happen I am not sure). Here are my modifications (from v1.4).

From tensorflow/tensorflow/contrib/cmake/
1/ adapt cmakelists.txt a little:

  • Change CUDA 8.0 to CUDA 9.0 l.223.
  • Add enable_language("CUDA") l.224.
  • The set(CUDA_NVCC_FLAGS ...) directives do not work anymore. See below.
  • Add capabilities 6.0 and 6.1 in l.232, as well l.246. Might not be needed (it is only for performance).
  • Change 64_80 to 64_90 and 64_6 to 64_7 l.247 and 248, similarly in l.272-276.

2/ in tf_core_kernels.cmake:

  • Add set_source_files_properties(${tf_core_gpu_kernels_srcs} PROPERTIES LANGUAGE CUDA) to recognize '.cu.cc' extensions as cuda files in l.209.
  • Rename cuda_add_library(...) as add_library(...) l.210.

3/ edit (this is the trick) tf_core_gpu_kernels.vcxproj, in the release section:

  • Encompass cl.exe flags, ie /bigobj /nologo ... -Ob2 with the -Xcompiler="/bigobj ... -Ob2" directive l.147. These former flags are for the c++ compiler not for nvcc and result in the crash.
  • Add just before --expt-relaxed-constexpr, still in the AdditionalOptions.
  • Switch PerformDeviceLinkfrom false to true l.164.

Then everything compile (msbuild on tf_tutorials_example_trainer.vcxproj) (and this tuto works). The remaining point before PR is to avoid third step, i.e. give the right directives to nvcc, by understanding how the CUDA_NVCC_FLAGS works, and add the linking. Hope this solution will work without missing symbols (#6396).

Otherwise it is a nightmare: both CUDA 8 and CMake 3.6 are not aware of VS2017. CMake compilation is not incremental (#14194) and takes about 4-5H (could use precompiled headers especially in tf_core_kernels)...

@whatever1983
Copy link

Very nice. Thanks. The tensorflow team should just release a official TF1.4 CUDA9 win10 build whl file for its users. I don't get why they don't do so immediately.

@sylvain-bougnoux
Copy link
Author

Since I've discovered that despite everything ran fine, the VS2017 (until 15.4) distribution introduced a bug in the /WHOLEARCHIVE trick resulting in unfilled factories (session, device...) as mentioned in. Therefore I am stuck using vs2017 for linking my application with TF.

@aluo-x
Copy link

aluo-x commented Nov 24, 2017

@sylvain-bougnoux I don't believe the WHOLEARCHIVE flag is used when building a Python whl, am I mistaken?

@sylvain-bougnoux
Copy link
Author

sylvain-bougnoux commented Nov 27, 2017 via email

@tatatodd
Copy link
Contributor

Thanks for the notes @sylvain-bougnoux!

Adding @mrry @gunan @tfboyd since they might be interested in your notes on getting things working.

As the referenced bugs mention, support for CUDA 9 / cuDNN 7 is anticipated in TensorFlow 1.5.

Marking this as "community support" since the purpose of this issue seems to be to collect useful tips in making this all work.

@tatatodd tatatodd added the stat:community support Status - Community Support label Nov 28, 2017
@gunan gunan self-assigned this Nov 30, 2017
@gunan
Copy link
Contributor

gunan commented Nov 30, 2017

On windows, it looks like we have a bug with NVCC. Building TF with CUDA9 seems to be failing with a compiler crash. NVIDIA is helping investigate this, and once we have an update we will proceed.

I will mark this as a duplicate issue of #12052 and #14691, if you dont mind.

@gunan gunan closed this as completed Nov 30, 2017
@sylvain-bougnoux
Copy link
Author

@gunan
Where is the crash you mention?
Actually I could compile everything just by changing the parameters given to nvcc. As I could not run it properly (despite the example trainer) due to the /WHOLEARCHIVE bug, is my success an illusion?
AFAICJ it is just a matter of fixing the cmake file (or bazel).

@gunan
Copy link
Contributor

gunan commented Nov 30, 2017

http://ci.tensorflow.org/job/tf-pr-win-cmake-gpu/19/console

21:31:54        "c:\tf_jenkins\home\workspace\tf-pr-win-cmake-gpu\cmake_build\tf_python_build_pip_package.vcxproj" (default target) (1) ->
21:31:54        "C:\tf_jenkins\home\workspace\tf-pr-win-cmake-gpu\cmake_build\pywrap_tensorflow_internal.vcxproj" (default target) (4) ->
21:31:54        "C:\tf_jenkins\home\workspace\tf-pr-win-cmake-gpu\cmake_build\tf_core_gpu_kernels.vcxproj" (default target) (38) ->
21:31:54        (CustomBuild target) -> 
21:31:54          CUSTOMBUILD : Internal error : assertion failed at: "C:/dvs/p4/build/sw/rel/gpu_drv/r384/r384_00/drivers/compiler/edg/EDG_4.12/src/lookup.c", line 2652 [C:\tf_jenkins\home\workspace\tf-pr-win-cmake-gpu\cmake_build\tf_core_gpu_kernels.vcxproj]

The location of the assertion failure seems to point to a line in nvidia proprietary code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:community support Status - Community Support
Projects
None yet
Development

No branches or pull requests

5 participants