-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in cufftXtMakePlanMany #1363
Comments
This is with CUDA version 11.6, the image is derived from here. I'm not sure how to check the cuFFT version but |
@albertz mentioned this issue on slack. I don't use |
It seems like the issue was with the image. It was derived from Simon's tf2.8 image but then installed torch into that image. However, torch comes with its own pip packages for some CUDA-related libraries and I guess they interfered with the original libraries in the image. Now I'm training with an image without torch and that runs since yesterday evening now. |
I just had a very similar error again:
This was with TF2.14, CUDA 11.8, cuDNN 8600 (8.6?) and a recent RETURNN version. The image definition is below:
|
I wonder, that is libcufft version 10. But there is CuFFT 12 now. Maybe we should try with a newer CuFFT version? Edit The libcufft filename versioning is confusing. It's not directly related to the actual CuFFT version (and also not to the CUDA version). We have that symlink |
Can you maybe try |
The TF2.15.0 release is just 16 hours old, is it maybe not yet available on Docker Hub?
|
I have a returnn training for a CTC model which keeps crashing with EOFError: Ran out of input. It mostly trains an epoch or maybe also several epochs but crashes after 1-2 hours.
Simon earlier referred me to this memory leak issue but the fix is already included in the rasr version I use. I also tried requesting more memory up to 20GB which did not solve the issue and also the logs say the maximum memory usage is below 10GB.
When running with
DEBUG_SIGNAL_HANDLER=1
andOPENBLAS_NUM_THREADS=1
, I get this stack trace:The text was updated successfully, but these errors were encountered: