Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in cufftXtMakePlanMany #1363

Open
vieting opened this issue Jul 13, 2023 · 7 comments
Open

Segfault in cufftXtMakePlanMany #1363

vieting opened this issue Jul 13, 2023 · 7 comments

Comments

@vieting
Copy link
Contributor

vieting commented Jul 13, 2023

I have a returnn training for a CTC model which keeps crashing with EOFError: Ran out of input. It mostly trains an epoch or maybe also several epochs but crashes after 1-2 hours.

Simon earlier referred me to this memory leak issue but the fix is already included in the rasr version I use. I also tried requesting more memory up to 20GB which did not solve the issue and also the logs say the maximum memory usage is below 10GB.

When running with DEBUG_SIGNAL_HANDLER=1 and OPENBLAS_NUM_THREADS=1, I get this stack trace:

Signal handler: signal 11:
/u/zeyer/code/playground/signal_handler.so(signal_handler+0x34)[0x7f67700bd924]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f677a04f090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f677a04f00b]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f677a04f090]
/.singularity.d/libs/libcuda.so.1(+0x2b42f2)[0x7f67165cb2f2]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x4b935c)[0x7f66c23df35c]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x21b14f)[0x7f66c214114f]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x243b75)[0x7f66c2169b75]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x246116)[0x7f66c216c116]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x27ad3e)[0x7f66c21a0d3e]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x27b82c)[0x7f66c21a182c]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x272196)[0x7f66c2198196]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x226d35)[0x7f66c214cd35]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftXtMakePlanMany+0x3d5)[0x7f66c2194855]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftMakePlanMany64+0xef)[0x7f66c2190b5f]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftMakePlanMany+0x10f)[0x7f66c219115f]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN15stream_executor3gpu11CUDAFftPlan10InitializeEPNS0_11GpuExecutorEPNS_6StreamEiPmS6_mmS6_mmNS_3fft4TypeEiPNS_1
6ScratchAllocatorE+0x33f)[0x7f673886ea9f]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN15stream_executor3gpu7CUDAFft37CreateBatchedPlanWithScratchAllocatorEPNS_6StreamEiPmS4_mmS4_mmNS_3fft4TypeEbiP
NS_16ScratchAllocatorE+0x10a)[0x7f6738870a1a]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10FFTGPUBase5DoFFTEPNS_15OpKernelContextERKNS_6TensorEPmPS3_+0x38
c)[0x7f674d4cc3fc]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow7FFTBase7ComputeEPNS_15OpKernelContextE+0x168)[0x7f674d3bcaf8]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x32a)[0x7f6737a45a5a]
@vieting
Copy link
Contributor Author

vieting commented Jul 13, 2023

This is with CUDA version 11.6, the image is derived from here. I'm not sure how to check the cuFFT version but /usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/include/cufft.h says #define CUFFT_VERSION 10900.

@vieting
Copy link
Contributor Author

vieting commented Jul 13, 2023

@albertz mentioned this issue on slack. I don't use TF_FORCE_GPU_ALLOW_GROWTH or the corresponding gpu_options.allow_growth as far as I can tell though.

@vieting
Copy link
Contributor Author

vieting commented Jul 14, 2023

It seems like the issue was with the image. It was derived from Simon's tf2.8 image but then installed torch into that image. However, torch comes with its own pip packages for some CUDA-related libraries and I guess they interfered with the original libraries in the image. Now I'm training with an image without torch and that runs since yesterday evening now.

@vieting
Copy link
Contributor Author

vieting commented Nov 15, 2023

I just had a very similar error again:

Signal handler: signal 11:                                                                                               
/var/tmp/vieting/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7f7f5bc9020b]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7f5f0f8520]                                                                
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f7f5f14c9fc]                                                      
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f7f5f0f8476]                                                              
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7f5f0f8520]                                                                
/.singularity.d/libs/libcuda.so.1(+0x2b42f2)[0x7f7f0caf52f2]                                                             
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x4b935c)[0x7f7e8f4b935c]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x21b14f)[0x7f7e8f21b14f]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x243b75)[0x7f7e8f243b75]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x246116)[0x7f7e8f246116]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x27cb8b)[0x7f7e8f27cb8b]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x27d38c)[0x7f7e8f27d38c]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x272196)[0x7f7e8f272196]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x226d35)[0x7f7e8f226d35]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(cufftXtMakePlanMany+0x3d5)[0x7f7e8f26e855]                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(cufftMakePlanMany64+0x5d)[0x7f7e8f26aacd]                        
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN15stream_executor3gpu11CUDAFftPlan10InitializeEPNS0_11GpuExecutorEPNS_6StreamEiPmS6_mmS6_mmNS_3fft4TypeEiPNS_16ScratchAllocatorE+0x9b8)[0x7f7f53dfb288]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN15stream_executor3gpu7CUDAFft37CreateBatchedPlanWithScratchAllocatorEPNS_6StreamEiPmS4_mmS4_mmNS_3fft4TypeEbiPNS_16ScratchAllocatorE+0xc5)[0x7f7f53dfd075]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2(_ZN10tensorflow10FFTGPUBase5DoFFTEPNS_15OpKernelContextERKNS_6TensorEPmPS3_+0xc28)[0x7f7f41080c88]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2(_ZN10tensorflow7FFTBase7ComputeEPNS_15OpKernelContextE+0x46a)[0x7f7f40ffb3ba]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x266)[0x7f7f53930d66]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1b21bcb)[0x7f7f538f9bcb]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN3tsl6thread16EigenEnvironmentEE10WorkerLoopEi+0x722)[0x7f7f52b91d62]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZSt13__invoke_implIvRZN3tsl6thread16EigenEnvironment12CreateThreadESt8functionIFvvEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_+0x41)[0x7f7f52b915a1]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x196e41b)[0x7f7f5374641b]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f7f5f14aac3]                                                                
/lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f7f5f1dca40]                                                               
Installed signal_handler.so.

This was with TF2.14, CUDA 11.8, cuDNN 8600 (8.6?) and a recent RETURNN version. The image definition is below:

Bootstrap: docker                                                                                                        
From: tensorflow/tensorflow:2.14.0-gpu                                                                                      
Stage: build                                                                                                                
                                                                                                                         
%post                                                                                                                    
    apt update -y                                                                                                           
                                                                                                                         
    # all the fundamental basics, zsh is need because calling the cache manager might launch the user shell                 
    DEBIAN_FRONTEND=noninteractive apt install -y wget git unzip gzip libssl-dev lsb-release zsh \                          
        bison libxml2-dev libopenblas-dev libsndfile1-dev libcrypto++-dev libcppunit-dev \                                  
        parallel xmlstarlet python3-lxml htop strace gdb sox python3-pip cmake ffmpeg vim                                   
                                                                                                                            
    # download the cache manager and place in /usr/local                                                                    
    cd /usr/local                                                                                                           
    git clone https://github.com/rwth-i6/cache-manager.git                                                                  
    cd bin                                                                                                                  
    ln -s ../cache-manager/cf cf                                                                                            
                                                                                                                            
    echo /usr/local/lib/python3.11/dist-packages/tensorflow > /etc/ld.so.conf.d/tensorflow.conf                             
    ldconfig                                                                                                                
                                                                                                                            
    apt install -y python3 python3-pip                                                                                      
                                                                                                                            
    # general                                                                                                               
    pip3 install -U pip setuptools wheel                                                                                    
    pip3 install ipdb                                                                                                       
                                                                                                                            
    # Returnn                                                                                                               
    pip3 install h5py six soundfile librosa==0.10 better-exchook dm-tree psutil                                             
                                                                                                                            
    # Sisyphus                                                                                                              
    pip3 install --ignore-installed psutil flask ipython                                                                    
    pip3 install git+https://github.com/rwth-i6/sisyphus                                                                    
                                                                                                                            
    # i6_core / i6_experiments                                                                                              
    pip3 install black==22.3.0 matplotlib typing-extensions typeguard  # sequitur-g2p==1.0.1668.23                          
                                                                                                                            
    # memory profiling                                                                                                      
    pip3 install memray objgraph Pympler

@vieting vieting reopened this Nov 15, 2023
@albertz
Copy link
Member

albertz commented Nov 15, 2023

I wonder, that is libcufft version 10. But there is CuFFT 12 now. Maybe we should try with a newer CuFFT version?

Edit The libcufft filename versioning is confusing. It's not directly related to the actual CuFFT version (and also not to the CUDA version). We have that symlink libcufft.so.10 -> libcufft.so.10.9.0.58. This file is part of libcufft-11-8 (ref), in the /usr/local/cuda-11.8 directory.

@albertz
Copy link
Member

albertz commented Nov 15, 2023

Can you maybe try tensorflow/tensorflow:2.15.0-gpu as the base image? That comes with libcufft-12-2 (ref). I think that would be the file libcufft.so.11.0.8.91 or so? In all references to this crash, I always saw libcufft.so.10, never libcufft.so.11 (although maybe just because it's too new and not tried yet).

@vieting
Copy link
Contributor Author

vieting commented Nov 15, 2023

The TF2.15.0 release is just 16 hours old, is it maybe not yet available on Docker Hub?

FATAL:   While performing build: conveyor failed to get: reading manifest 2.15.0-gpu in docker.io/tensorflow/tensorflow: manifest unknown: manifest unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants