-
Notifications
You must be signed in to change notification settings - Fork 21.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows CI intermittent error: C2993: 'Derived': illegal type for non-type template parameter '__formal #25393
Comments
I got some sort of similar-ish error persistently on a test branch:
|
Not that persistently; a PR stacked on the failing one succeeded. So there is something nondeterministic going on here. |
another instance: CI on #23888 |
Looks like an NVCC bug. The workground is protecting eigen headers with |
Another one in https://app.circleci.com/jobs/github/pytorch/pytorch/3919969.
|
Can you try with 10.2? |
Mingbo, can you put general information on the issue you are seeing here in this issue? |
latest word on this is from mingbo: |
Hi. Did anyone have this issue persistently, ie that a rebuild did not "fix" it? We've had this issue in another (closed source) project. Exposing cuda to less includes (such as boost) seems to help but is not predictable. |
It's never been persistent for us, but this is mostly from our Windows CI builds where we blow everything away and rebuild from scratch, which may help. |
Yet another occurrence in the forum: https://discuss.pytorch.org/t/how-do-i-get-my-older-gpu-that-supports-cuda-to-work-with-pytorch-1-4/69005/13. It seems to be fixed by |
Summary: ## Several flags `/MP[M]`: It is a flag for the compiler `cl`. It leads to object-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC. `/maxcpucount:[M]`: It is a flag for the generator `msbuild`. It leads to project-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC. `/p:CL_MPCount=[M]`: It is a flag for the generator `msbuild`. It leads the generator to pass `/MP[M]` to the compiler. `/j[M]`: It is a flag for the generator `ninja`. It leads to object-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC. ## Reason for the change 1. Object-level multiprocessing is preferred over project-level multiprocessing. 2. ~For ninja, we don't need to set `/MP` otherwise M * M processes will be spawned.~ Actually, it is not correct because in ninja configs, there are only one source file in the command. Therefore, the `/MP` switch should be useless. 3. For msbuild, if it is called through Python configuration scripts, then `/p:CL_MPCount=[M]` will be added, otherwise, we add `/MP` to `CMAKE_CXX_FLAGS`. 4. ~It may be a possible fix for #28271, #27463 and #25393. Because `/MP` is also passed to `nvcc`.~ It is probably not true. Because `/MP` should not be effective given there is only one source file per command. ## Reference 1. https://docs.microsoft.com/en-us/cpp/build/reference/mp-build-with-multiple-processes?view=vs-2019 2. https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows 3. https://blog.kitware.com/cmake-building-with-all-your-cores/ Pull Request resolved: #33120 Differential Revision: D19817227 Pulled By: ezyang fbshipit-source-id: f8d01f835016971729c7a8d8a0d1cb8a8c2c6a5f
I was able to reproduce the issue with a more verbose output with #33693: https://app.circleci.com/jobs/github/pytorch/pytorch/3919969. And then I inspected the variables carefully and found out that the VC env is activated twice. We should really avoid that. |
You are SO COOL!!! |
I tried to build from scratch several times with this PR and could not reproduce the issue anymore. Let's assume it's fixed. |
I have also obtained this error with CUDA 10.1/cuDNN 7.6.4 on Win Server 2019.
|
@mstfbl this is a cuda bug. Please see above discussion and upgrade to cuda 11. |
@leezu I fixed the error by installing sscache for use with cuda 10.1 and cuda 10.2, as done here. https://github.com/pytorch/builder/blob/993e8b275e313641796db8a0b2869d5f3dd13828/windows/build_pytorch.bat#L98-L124 |
@mstfbl it's an intermittent error and the probability of occurrence varies depending on your system. So it's not really fixed, but you may have found a way to reduce the occurrence in your system, which may be sufficient in your case :) |
Another instance, this time inside protobuf:
Running Here is the diff of the two inputs, one of which failed and the other didn't:
And the corresponding change in the output:
This failure reproduced reliably on the CI machine. Running nvcc manually multiple times, it always fails or passes given the same inputs. I think it's a deterministic failure, but is annoyingly sensitive to any changes in the input file whatsoever. |
There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393. Tweaking the compiler settings would allow to get less errors, but it seems impossible to prevent the errors altogether.
There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393. Tweaking the compiler settings would allow to get less errors, but it seems impossible to prevent the errors altogether.
There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393. Tweaking the compiler settings would allow to get less errors, but it seems impossible to prevent the errors altogether.
Fix Readme and disable MSVC-CUDA 10.2 + Update to the new package status. Simplify the HIP-related INSTALL.md section. + Disable the MSVC-CUDA 10.2 job. There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393 and NVIDIA/thrust#1090. Tweaking the compiler settings would allow getting fewer errors, but it seems impossible to prevent the errors altogether. Related PR: #852
There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393. Tweaking the compiler settings would allow to get less errors, but it seems impossible to prevent the errors altogether.
There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393. Tweaking the compiler settings would allow to get less errors, but it seems impossible to prevent the errors altogether.
See pytorch#65612 and pytorch#25393 Fixes pytorch#65648
closing due to age |
Wait, do we think this is fixed? It's not clear to me it is... |
I think this was related to CUDA 10.2 builds on windows which we don't actually support anymore so it is probably safe to keep closed. |
Kind of similar to #25389
Sometimes our Eigen build fails this way:
It doesn't seem to reliably repro.
cc @ezyang @gchanan @zou3519 @seemethere @peterjc123
The text was updated successfully, but these errors were encountered: