Segfault in cufftXtMakePlanMany #1363

vieting · 2023-07-13T10:42:36Z

I have a returnn training for a CTC model which keeps crashing with EOFError: Ran out of input. It mostly trains an epoch or maybe also several epochs but crashes after 1-2 hours.

Simon earlier referred me to this memory leak issue but the fix is already included in the rasr version I use. I also tried requesting more memory up to 20GB which did not solve the issue and also the logs say the maximum memory usage is below 10GB.

When running with DEBUG_SIGNAL_HANDLER=1 and OPENBLAS_NUM_THREADS=1, I get this stack trace:

Signal handler: signal 11:
/u/zeyer/code/playground/signal_handler.so(signal_handler+0x34)[0x7f67700bd924]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f677a04f090]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f677a04f00b]
/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f677a04f090]
/.singularity.d/libs/libcuda.so.1(+0x2b42f2)[0x7f67165cb2f2]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x4b935c)[0x7f66c23df35c]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x21b14f)[0x7f66c214114f]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x243b75)[0x7f66c2169b75]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x246116)[0x7f66c216c116]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x27ad3e)[0x7f66c21a0d3e]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x27b82c)[0x7f66c21a182c]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x272196)[0x7f66c2198196]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(+0x226d35)[0x7f66c214cd35]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftXtMakePlanMany+0x3d5)[0x7f66c2194855]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftMakePlanMany64+0xef)[0x7f66c2190b5f]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/lib/libcufft.so.10(cufftMakePlanMany+0x10f)[0x7f66c219115f]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN15stream_executor3gpu11CUDAFftPlan10InitializeEPNS0_11GpuExecutorEPNS_6StreamEiPmS6_mmS6_mmNS_3fft4TypeEiPNS_1
6ScratchAllocatorE+0x33f)[0x7f673886ea9f]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN15stream_executor3gpu7CUDAFft37CreateBatchedPlanWithScratchAllocatorEPNS_6StreamEiPmS4_mmS4_mmNS_3fft4TypeEbiP
NS_16ScratchAllocatorE+0x10a)[0x7f6738870a1a]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10FFTGPUBase5DoFFTEPNS_15OpKernelContextERKNS_6TensorEPmPS3_+0x38
c)[0x7f674d4cc3fc]
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow7FFTBase7ComputeEPNS_15OpKernelContextE+0x168)[0x7f674d3bcaf8]
/usr/local/lib/tensorflow/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x32a)[0x7f6737a45a5a]

The text was updated successfully, but these errors were encountered:

vieting · 2023-07-13T10:44:24Z

This is with CUDA version 11.6, the image is derived from here. I'm not sure how to check the cuFFT version but /usr/local/lib/python3.8/dist-packages/tensorflow/python/../../nvidia/cufft/include/cufft.h says #define CUFFT_VERSION 10900.

vieting · 2023-07-13T10:48:14Z

@albertz mentioned this issue on slack. I don't use TF_FORCE_GPU_ALLOW_GROWTH or the corresponding gpu_options.allow_growth as far as I can tell though.

vieting · 2023-07-14T08:29:12Z

It seems like the issue was with the image. It was derived from Simon's tf2.8 image but then installed torch into that image. However, torch comes with its own pip packages for some CUDA-related libraries and I guess they interfered with the original libraries in the image. Now I'm training with an image without torch and that runs since yesterday evening now.

vieting · 2023-11-15T10:24:47Z

I just had a very similar error again:

Signal handler: signal 11:                                                                                               
/var/tmp/vieting/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7f7f5bc9020b]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7f5f0f8520]                                                                
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f7f5f14c9fc]                                                      
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f7f5f0f8476]                                                              
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7f5f0f8520]                                                                
/.singularity.d/libs/libcuda.so.1(+0x2b42f2)[0x7f7f0caf52f2]                                                             
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x4b935c)[0x7f7e8f4b935c]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x21b14f)[0x7f7e8f21b14f]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x243b75)[0x7f7e8f243b75]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x246116)[0x7f7e8f246116]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x27cb8b)[0x7f7e8f27cb8b]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x27d38c)[0x7f7e8f27d38c]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x272196)[0x7f7e8f272196]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(+0x226d35)[0x7f7e8f226d35]                                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(cufftXtMakePlanMany+0x3d5)[0x7f7e8f26e855]                       
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.10(cufftMakePlanMany64+0x5d)[0x7f7e8f26aacd]                        
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN15stream_executor3gpu11CUDAFftPlan10InitializeEPNS0_11GpuExecutorEPNS_6StreamEiPmS6_mmS6_mmNS_3fft4TypeEiPNS_16ScratchAllocatorE+0x9b8)[0x7f7f53dfb288]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN15stream_executor3gpu7CUDAFft37CreateBatchedPlanWithScratchAllocatorEPNS_6StreamEiPmS4_mmS4_mmNS_3fft4TypeEbiPNS_16ScratchAllocatorE+0xc5)[0x7f7f53dfd075]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2(_ZN10tensorflow10FFTGPUBase5DoFFTEPNS_15OpKernelContextERKNS_6TensorEPmPS3_+0xc28)[0x7f7f41080c88]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2(_ZN10tensorflow7FFTBase7ComputeEPNS_15OpKernelContextE+0x46a)[0x7f7f40ffb3ba]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x266)[0x7f7f53930d66]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1b21bcb)[0x7f7f538f9bcb]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN3tsl6thread16EigenEnvironmentEE10WorkerLoopEi+0x722)[0x7f7f52b91d62]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(_ZSt13__invoke_implIvRZN3tsl6thread16EigenEnvironment12CreateThreadESt8functionIFvvEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_+0x41)[0x7f7f52b915a1]
/usr/local/lib/python3.11/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x196e41b)[0x7f7f5374641b]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f7f5f14aac3]                                                                
/lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f7f5f1dca40]                                                               
Installed signal_handler.so.

This was with TF2.14, CUDA 11.8, cuDNN 8600 (8.6?) and a recent RETURNN version. The image definition is below:

Bootstrap: docker                                                                                                        
From: tensorflow/tensorflow:2.14.0-gpu                                                                                      
Stage: build                                                                                                                
                                                                                                                         
%post                                                                                                                    
    apt update -y                                                                                                           
                                                                                                                         
    # all the fundamental basics, zsh is need because calling the cache manager might launch the user shell                 
    DEBIAN_FRONTEND=noninteractive apt install -y wget git unzip gzip libssl-dev lsb-release zsh \                          
        bison libxml2-dev libopenblas-dev libsndfile1-dev libcrypto++-dev libcppunit-dev \                                  
        parallel xmlstarlet python3-lxml htop strace gdb sox python3-pip cmake ffmpeg vim                                   
                                                                                                                            
    # download the cache manager and place in /usr/local                                                                    
    cd /usr/local                                                                                                           
    git clone https://github.com/rwth-i6/cache-manager.git                                                                  
    cd bin                                                                                                                  
    ln -s ../cache-manager/cf cf                                                                                            
                                                                                                                            
    echo /usr/local/lib/python3.11/dist-packages/tensorflow > /etc/ld.so.conf.d/tensorflow.conf                             
    ldconfig                                                                                                                
                                                                                                                            
    apt install -y python3 python3-pip                                                                                      
                                                                                                                            
    # general                                                                                                               
    pip3 install -U pip setuptools wheel                                                                                    
    pip3 install ipdb                                                                                                       
                                                                                                                            
    # Returnn                                                                                                               
    pip3 install h5py six soundfile librosa==0.10 better-exchook dm-tree psutil                                             
                                                                                                                            
    # Sisyphus                                                                                                              
    pip3 install --ignore-installed psutil flask ipython                                                                    
    pip3 install git+https://github.com/rwth-i6/sisyphus                                                                    
                                                                                                                            
    # i6_core / i6_experiments                                                                                              
    pip3 install black==22.3.0 matplotlib typing-extensions typeguard  # sequitur-g2p==1.0.1668.23                          
                                                                                                                            
    # memory profiling                                                                                                      
    pip3 install memray objgraph Pympler

albertz · 2023-11-15T10:42:41Z

I wonder, that is libcufft version 10. But there is CuFFT 12 now. Maybe we should try with a newer CuFFT version?

Edit The libcufft filename versioning is confusing. It's not directly related to the actual CuFFT version (and also not to the CUDA version). We have that symlink libcufft.so.10 -> libcufft.so.10.9.0.58. This file is part of libcufft-11-8 (ref), in the /usr/local/cuda-11.8 directory.

albertz · 2023-11-15T11:03:50Z

Can you maybe try tensorflow/tensorflow:2.15.0-gpu as the base image? That comes with libcufft-12-2 (ref). I think that would be the file libcufft.so.11.0.8.91 or so? In all references to this crash, I always saw libcufft.so.10, never libcufft.so.11 (although maybe just because it's too new and not tried yet).

vieting · 2023-11-15T11:21:20Z

The TF2.15.0 release is just 16 hours old, is it maybe not yet available on Docker Hub?

FATAL:   While performing build: conveyor failed to get: reading manifest 2.15.0-gpu in docker.io/tensorflow/tensorflow: manifest unknown: manifest unknown

vieting closed this as completed Jul 14, 2023

albertz mentioned this issue Oct 22, 2023

Segmentation fault in torch.stft, cufftXtMakePlanMany pytorch/pytorch#111764

Closed

albertz mentioned this issue Nov 8, 2023

TF get_sprint_automata_for_batch: RASR segmentation fault in Speech::CTCTopologyGraphBuilder::addLoopTransition #1456

Open

albertz changed the title ~~CTC training with RASR cashes with EOFError~~ Segfault in cufftXtMakePlanMany Nov 14, 2023

vieting reopened this Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in cufftXtMakePlanMany #1363

Segfault in cufftXtMakePlanMany #1363

vieting commented Jul 13, 2023

vieting commented Jul 13, 2023

vieting commented Jul 13, 2023

vieting commented Jul 14, 2023

vieting commented Nov 15, 2023

albertz commented Nov 15, 2023 •

edited

Loading

albertz commented Nov 15, 2023

vieting commented Nov 15, 2023

Segfault in cufftXtMakePlanMany #1363

Segfault in cufftXtMakePlanMany #1363

Comments

vieting commented Jul 13, 2023

vieting commented Jul 13, 2023

vieting commented Jul 13, 2023

vieting commented Jul 14, 2023

vieting commented Nov 15, 2023

albertz commented Nov 15, 2023 • edited Loading

albertz commented Nov 15, 2023

vieting commented Nov 15, 2023

albertz commented Nov 15, 2023 •

edited

Loading