Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: initialization error #1494

Closed
albertz opened this issue Jan 13, 2024 · 3 comments
Closed

CUDA error: initialization error #1494

albertz opened this issue Jan 13, 2024 · 3 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 13, 2024

RETURNN train proc manager starting up, version 1.20240112.101454+git.e8d293ed 
Most recent trained model epoch: 523 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/models/epoch.523 
Run RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20240112.101454+git.e8d293ed, date/time 2024-01-13-23-33-03 (UTC+0000), pid 1854, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
Hostname: cn-233
...
Set PYTORCH_CUDA_ALLOC_CONF='backend:cudaMallocAsync'.
...
Available CUDA devices:
  1/1: cuda:0
       name: NVIDIA GeForce RTX 3090
       total_memory: 23.7GB
       capability: 8.6
       device_index: 1
...
ep 524 devtrain eval, step 92, ctc_4 0.314, ctc_8 0.126, ce 0.157, fer 0.017, mem_usage:cuda 2.6GB
ep 524 devtrain eval, step 93, ctc_4 0.289, ctc_8 0.151, ce 0.155, fer 0.017, mem_usage:cuda 2.6GB
ep 524 devtrain eval, step 94, ctc_4 0.340, ctc_8 0.192, ce 0.165, fer 0.019, mem_usage:cuda 2.6GB
ep 524 devtrain eval, step 95, ctc_4 0.320, ctc_8 0.177, ce 0.194, fer 0.021, mem_usage:cuda 2.6GB
devtrain: score ctc_4 0.208 ctc_8 0.085 ce 0.137 error 0.011
Memory usage (cuda): alloc cur 1.6GB alloc peak 2.6GB reserved cur 3.3GB reserved peak 3.3GB
Starting training at epoch 525, global train step 256929
start epoch 525 global train step 256929 with effective learning rate 0.0003362128120603084 ...
Memory usage (cuda): alloc cur 1.6GB alloc peak 1.6GB reserved cur 3.3GB reserved peak 3.3GB
MEMORY: sub proc MPD seq order(3393) increased RSS: rss=720.7MB pss=696.2MB uss=695.0MB shared=25.7MB
MEMORY: sub proc MPD worker(3394) increased RSS: rss=145.5MB pss=121.3MB uss=120.1MB shared=25.4MB
MEMORY: sub proc MPD worker(3395) increased RSS: rss=148.7MB pss=124.5MB uss=123.4MB shared=25.3MB
MEMORY: sub proc MPD worker(3396) increased RSS: rss=152.3MB pss=128.1MB uss=126.9MB shared=25.4MB
MEMORY: sub proc MPD worker(3397) increased RSS: rss=152.7MB pss=128.5MB uss=127.4MB shared=25.3MB
MEMORY: sub proc TDL worker 0(3765) initial: rss=4.6GB pss=1.9GB uss=7.3MB shared=4.6GB
MEMORY: total (main 3354, 2024-01-13, 23:45:22, 21 procs): pss=12.7GB uss=8.2GB
[2024-01-13 23:45:24,558] INFO: Run time: 0:12:24 CPU: 0.40% RSS: 19.37GB VMS: 231.81GB
MEMORY: sub proc MPD worker(3394) increased RSS: rss=367.8MB pss=343.6MB uss=342.4MB shared=25.4MB
MEMORY: sub proc MPD worker(3395) increased RSS: rss=370.7MB pss=346.5MB uss=345.3MB shared=25.3MB
MEMORY: sub proc MPD worker(3396) increased RSS: rss=370.8MB pss=346.6MB uss=345.4MB shared=25.4MB
MEMORY: sub proc MPD worker(3397) increased RSS: rss=370.7MB pss=346.5MB uss=345.3MB shared=25.3MB
MEMORY: total (main 3354, 2024-01-13, 23:45:29, 21 procs): pss=13.6GB uss=9.1GB
MEMORY: sub proc MPD worker(3394) increased RSS: rss=723.8MB pss=695.6MB uss=693.3MB shared=30.5MB
MEMORY: sub proc MPD worker(3395) increased RSS: rss=734.7MB pss=702.7MB uss=698.3MB shared=36.4MB
MEMORY: sub proc MPD worker(3396) increased RSS: rss=736.9MB pss=704.2MB uss=700.0MB shared=37.0MB
MEMORY: sub proc MPD worker(3397) increased RSS: rss=741.2MB pss=707.8MB uss=703.1MB shared=38.2MB
MEMORY: total (main 3354, 2024-01-13, 23:45:36, 21 procs): pss=14.9GB uss=10.5GB
[2024-01-13 23:45:39,658] INFO: Run time: 0:12:39 CPU: 0.40% RSS: 21.80GB VMS: 240.96GB
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa29b912617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-pack
ages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa29b8cd98d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa29b9cd9f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(int) + 0x8a (0x7fa29b9cde5a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x521ce (0x7fa29b9d21ce in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x543d4 (0x7fa29b9d43d4 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x7fa25c847c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x55ca7 (0x7fa29b8f7ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fa29b8efcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa29b8efe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #10: <unknown function> + 0x7c84d8 (0x7fa25cafc4d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames> 
frame #17: torch::utils::tensor_ctor(c10::DispatchKey, c10::ScalarType, torch::PythonArgs&) + 0x5f (0x7fa25ced4d3f in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so) 
frame #18: <unknown function> + 0x7b8c36 (0x7fa25caecc36 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #54: <unknown function> + 0x291b7 (0x7fa3088c01b7 in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #55: __libc_start_main + 0x7c (0x7fa3088c026c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #56: _start + 0x21 (0x401071 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11)

Fatal Python error: Aborted

Current thread 0x00007fa308896000 (most recent call first):
  Garbage-collecting
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 47 in create_tensor
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 61 in <listcomp>
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 61 in collate_batch
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 42 in fetch
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308 in _worker_loop
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1039 in __init__
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 386 in _get_iterator
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 433 in __iter__
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 335 in train_epoch
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 236 in train
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 469 in execute_main_task
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 663 in main
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, 
numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random.
_sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5
py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups,
 _cffi_backend, psutil._psutil_linux, psutil._psutil_posix, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._
C._special, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image (total: 53)
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fa29c83820b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fa3088d3f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fa30891de6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fa3088d3ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fa3088d3f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fa30891de6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fa3088d3ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fa3088bf45c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fa29d55886a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fa29d56407a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb00d9)[0x7fa29d5630d9]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(__gxx_personality_v0+0x87)[0x7fa29d563807]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(+0x11184)[0x7fa308096184]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(_Unwind_Resume+0x12e)[0x7fa308096bbe]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x128f9)[0x7fa29b9928f9]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x7fa25c847c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x7fa29b8f7ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x7fa29b8efcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7fa29b8efe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x7fa25cafc4d8]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f8d5)[0x7fa308e2a8d5]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f25c)[0x7fa308e2a25c]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x22bddf)[0x7fa308db6ddf]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_GC_New+0x72)[0x7fa308db6d12]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(PyCMethod_New+0x6d)[0x7fa308d6db5d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1bfa4e)[0x7fa308d4aa4e]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_ZN5torch5utils11tensor_ctorEN3c1011DispatchKeyENS1_10ScalarTypeERNS_10PythonArgsE+0x5f)[0x7fa25ced4d3f]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7b8c36)[0x7fa25caecc36]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e308e)[0x7fa308d6e08e]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_MakeTpCall+0x71)[0x7fa308d502c1]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x72d)[0x7fa308d9311d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x206cf2)[0x7fa308d91cf2]
...
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_RunMain+0x2cc)[0x7fa308e29e8c]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_BytesMain+0x29)[0x7fa308e29ab9]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x291b7)[0x7fa3088c01b7]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(__libc_start_main+0x7c)[0x7fa3088c026c]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11(_start+0x21)[0x401071]
RuntimeError: DataLoader worker (pid(s) 3765) exited unexpectedly
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140338199617536)>, proc 3354.

...

RuntimeError: DataLoader worker (pid(s) 3765) exited unexpectedly

Module call stack:
(No module call frames.)
Run ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
RETURNN runtime: 0:02:00
RETURNN return code: 1
Most recent trained model epoch: 524 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/models/epoch.524
Most recent trained model epoch before RETURNN run: 524
-> trained successfully 0 epoch(s)
-> break
Total RETURNN num starts: 2
Total RETURNN runtime: 0:12:41
[2024-01-13 23:45:43,257] ERROR: Executed command failed:
[2024-01-13 23:45:43,258] ERROR: Cmd: ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config']
[2024-01-13 23:45:43,258] ERROR: Args: (1, ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ/output/returnn.config'])
[2024-01-13 23:45:43,258] ERROR: Return-Code: 1
[2024-01-13 23:45:43,259] INFO: Max resources: Run time: 0:12:43 CPU: 101.8% RSS: 28.36GB VMS: 308.95GB
--------------------- Slurm Task Epilog ------------------------
Job ID: 4110960
Time: Sun Jan 14 12:45:43 AM CET 2024
Elapsed Time: 00:12:54
Billing per second for TRES: billing=248,cpu=4,gres/gpu=1,mem=30G,node=1
Show resource usage with e.g.:
sacct -j 4110960 -o Elapsed,TotalCPU,UserCPU,SystemCPU,MaxRSS,ReqTRES%60,MaxDiskRead,MaxDiskWrite
--------------------- Slurm Task Epilog ------------------------
@albertz
Copy link
Member Author

albertz commented Jan 14, 2024

I got this again.

Job ID: 4132934
Job name: i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run
Host: cn-505
Date: Sun Jan 14 04:33:12 PM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_24gb
Work dir: 
------------------
Node usage:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3982687_1  gpu_24gb ReturnnT   hilmes  R    3:35:25      1 cn-505
4110271  gpu_24gb dual_inp schueman  R 1-07:05:13      1 cn-505
4102093_1  gpu_24gb i6_core.      jxu  R 2-10:35:44      1 cn-505
4132934_1  gpu_24gb i6_core.    zeyer  R       0:00      1 cn-505
------------------
Show launch script with:
sacct -B -j 
------------------
--------------------- Slurm Task Prolog ------------------------
[2024-01-14 15:33:15,221] INFO: Run time: 0:00:00 CPU: 76.60% RSS: 73MB VMS: 221MB
[2024-01-14 15:33:20,509] INFO: Start Job: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_100k/train work/i6_core/returnn
/training/ReturnnTrainingJob.l2dwBB9n7TqS> Task: run
[2024-01-14 15:33:20,510] INFO: Inputs:
[2024-01-14 15:33:20,510] INFO: /u/zeyer/setups/combined/2021-05-31/tools/returnn
[2024-01-14 15:33:20,511] INFO: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
[2024-01-14 15:33:20,511] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/output/out.ogg.zip
[2024-01-14 15:33:20,514] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.NSdIHfk1iw2M/output/out.ogg.zip
[2024-01-14 15:33:20,516] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.RvwLniNrgMit/output/out.ogg.zip
[2024-01-14 15:33:20,518] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.VN8PpcLm5r4s/output/out.ogg.zip
[2024-01-14 15:33:20,519] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.W2k1lPIRrws8/output/out.ogg.zip
[2024-01-14 15:33:20,521] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.codes
[2024-01-14 15:33:20,522] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.vocab
Uname: uname_result(system='Linux', node='cn-505', release='5.15.0-39-generic', version='#42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022', machine='x86_64')
Load: (3.46826171875, 3.33642578125, 3.37109375)
[2024-01-14 15:33:20,523] INFO: ------------------------------------------------------------
[2024-01-14 15:33:20,523] INFO: Starting subtask for arg id: 0 args: []
[2024-01-14 15:33:20,524] INFO: ------------------------------------------------------------
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
RETURNN train proc manager starting up, version 1.20240112.101454+git.e8d293ed
Most recent trained model epoch: 581 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch
.581
Run RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20240112.101454+git.e8d293ed, date/time 2024-01-14-15-33-24 (UTC+0000), pid 3002473, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config']
Hostname: cn-505
--------------------- Slurm Task Prolog ------------------------
Job ID: 4132934
Job name: i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run
Host: cn-505
Date: Sun Jan 14 04:33:12 PM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_24gb
Work dir: 
------------------
Node usage:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3982687_1  gpu_24gb ReturnnT   hilmes  R    3:35:25      1 cn-505
4110271  gpu_24gb dual_inp schueman  R 1-07:05:13      1 cn-505
4102093_1  gpu_24gb i6_core.      jxu  R 2-10:35:44      1 cn-505
4132934_1  gpu_24gb i6_core.    zeyer  R       0:00      1 cn-505
------------------
Show launch script with:
sacct -B -j 
------------------
--------------------- Slurm Task Prolog ------------------------
[2024-01-14 15:33:15,221] INFO: Run time: 0:00:00 CPU: 76.60% RSS: 73MB VMS: 221MB
[2024-01-14 15:33:20,509] INFO: Start Job: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_100k/train work/i6_core/returnn
/training/ReturnnTrainingJob.l2dwBB9n7TqS> Task: run
[2024-01-14 15:33:20,510] INFO: Inputs:
[2024-01-14 15:33:20,510] INFO: /u/zeyer/setups/combined/2021-05-31/tools/returnn
[2024-01-14 15:33:20,511] INFO: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
[2024-01-14 15:33:20,511] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/output/out.ogg.zip
[2024-01-14 15:33:20,514] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.NSdIHfk1iw2M/output/out.ogg.zip
[2024-01-14 15:33:20,516] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.RvwLniNrgMit/output/out.ogg.zip
[2024-01-14 15:33:20,518] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.VN8PpcLm5r4s/output/out.ogg.zip
[2024-01-14 15:33:20,519] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.W2k1lPIRrws8/output/out.ogg.zip
[2024-01-14 15:33:20,521] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.codes
[2024-01-14 15:33:20,522] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.vocab
Uname: uname_result(system='Linux', node='cn-505', release='5.15.0-39-generic', version='#42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022', machine='x86_64')
Load: (3.46826171875, 3.33642578125, 3.37109375)
[2024-01-14 15:33:20,523] INFO: ------------------------------------------------------------
[2024-01-14 15:33:20,523] INFO: Starting subtask for arg id: 0 args: []
[2024-01-14 15:33:20,524] INFO: ------------------------------------------------------------
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
RETURNN train proc manager starting up, version 1.20240112.101454+git.e8d293ed
Most recent trained model epoch: 581 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch
.581
Run RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20240112.101454+git.e8d293ed, date/time 2024-01-14-15-33-24 (UTC+0000), pid 3002473, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config']
Hostname: cn-505
--------------------- Slurm Task Prolog ------------------------
Job ID: 4132934
Job name: i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run
Host: cn-505
Date: Sun Jan 14 04:33:12 PM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_24gb
Work dir: 
------------------
Node usage:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3982687_1  gpu_24gb ReturnnT   hilmes  R    3:35:25      1 cn-505
4110271  gpu_24gb dual_inp schueman  R 1-07:05:13      1 cn-505
4102093_1  gpu_24gb i6_core.      jxu  R 2-10:35:44      1 cn-505
4132934_1  gpu_24gb i6_core.    zeyer  R       0:00      1 cn-505
------------------
Show launch script with:
sacct -B -j 
------------------
--------------------- Slurm Task Prolog ------------------------
[2024-01-14 15:33:15,221] INFO: Run time: 0:00:00 CPU: 76.60% RSS: 73MB VMS: 221MB
[2024-01-14 15:33:20,509] INFO: Start Job: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_100k/train work/i6_core/returnn
/training/ReturnnTrainingJob.l2dwBB9n7TqS> Task: run
[2024-01-14 15:33:20,510] INFO: Inputs:
[2024-01-14 15:33:20,510] INFO: /u/zeyer/setups/combined/2021-05-31/tools/returnn
[2024-01-14 15:33:20,511] INFO: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
[2024-01-14 15:33:20,511] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/output/out.ogg.zip
[2024-01-14 15:33:20,514] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.NSdIHfk1iw2M/output/out.ogg.zip
[2024-01-14 15:33:20,516] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.RvwLniNrgMit/output/out.ogg.zip
[2024-01-14 15:33:20,518] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.VN8PpcLm5r4s/output/out.ogg.zip
[2024-01-14 15:33:20,519] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.W2k1lPIRrws8/output/out.ogg.zip
[2024-01-14 15:33:20,521] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.codes
[2024-01-14 15:33:20,522] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt/output/bpe.vocab
Uname: uname_result(system='Linux', node='cn-505', release='5.15.0-39-generic', version='#42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022', machine='x86_64')
Load: (3.46826171875, 3.33642578125, 3.37109375)
[2024-01-14 15:33:20,523] INFO: ------------------------------------------------------------
[2024-01-14 15:33:20,523] INFO: Starting subtask for arg id: 0 args: []
[2024-01-14 15:33:20,524] INFO: ------------------------------------------------------------
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
RETURNN train proc manager starting up, version 1.20240112.101454+git.e8d293ed
Most recent trained model epoch: 581 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch
.581
Run RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20240112.101454+git.e8d293ed, date/time 2024-01-14-15-33-24 (UTC+0000), pid 3002473, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config']
Hostname: cn-505
Installed native_signal_handler.so.
[2024-01-14 15:33:25,523] INFO: Run time: 0:00:10 CPU: 0.10% RSS: 450MB VMS: 6.52GB
MEMORY: main proc python3.11(3002473) initial: rss=328.0MB pss=307.9MB uss=301.2MB shared=26.8MB
MEMORY: sub proc python3.11(3002488) initial: rss=12.3MB pss=7.5MB uss=6.2MB shared=6.1MB
MEMORY: sub proc watch memory(3002489) initial: rss=48.2MB pss=31.8MB uss=26.7MB shared=21.6MB
MEMORY: total (main 3002473, 2024-01-14, 15:33:25, 3 procs): pss=347.2MB uss=334.1MB
Set PYTORCH_CUDA_ALLOC_CONF='backend:cudaMallocAsync'.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
CUDA_VISIBLE_DEVICES is set to '3'.
Available CUDA devices:
  1/1: cuda:0
       name: NVIDIA A10
       total_memory: 22.0GB
       capability: 8.6
       device_index: 3
...
Using device: cuda ('gpu' in config)
Using gpu device 3: NVIDIA A10
Using autocast (automatic mixed precision (AMP)) with dtype torch.bfloat16
...
Load model /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch.581.pt
MEMORY: sub proc MPD seq order(3002721) increased RSS: rss=716.1MB pss=691.7MB uss=690.5MB shared=25.7MB
MEMORY: total (main 3002473, 2024-01-14, 15:33:57, 19 procs): pss=3.7GB uss=3.3GB
  epoch 581, global train step 284468
...
Evaluating dataset 'devtrain' 
[2024-01-14 15:41:22,230] INFO: Run time: 0:08:07 CPU: 0.20% RSS: 26.41GB VMS: 293.11GB
MEMORY: sub proc watch memory(3002489) increased RSS: rss=50.3MB pss=29.4MB uss=28.4MB shared=21.9MB 
MEMORY: sub proc MPD worker(3002722) increased RSS: rss=264.7MB pss=240.4MB uss=239.2MB shared=25.5MB
MEMORY: sub proc MPD worker(3002723) increased RSS: rss=241.2MB pss=216.8MB uss=215.6MB shared=25.6MB 
MEMORY: sub proc MPD worker(3002724) increased RSS: rss=265.2MB pss=240.9MB uss=239.7MB shared=25.5MB
MEMORY: sub proc MPD worker(3002725) increased RSS: rss=265.4MB pss=240.9MB uss=239.7MB shared=25.7MB 
MEMORY: sub proc TDL worker 0(3002942) increased RSS: rss=5.4GB pss=1.8GB uss=270.5MB shared=5.1GB
MEMORY: sub proc TDL worker 0(3002983) initial: rss=5.2GB pss=1.6GB uss=6.5MB shared=5.2GB 
MEMORY: total (main 3002473, 2024-01-14, 15:41:24, 22 procs): pss=15.6GB uss=10.0GB
ep 582 devtrain eval, step 0, ctc_4 0.271, ctc_8 0.108, ce 0.153, fer 0.013, mem_usage:cuda 3.2GB 
MEMORY: sub proc MPD worker(3002722) increased RSS: rss=723.4MB pss=698.6MB uss=697.4MB shared=25.9MB
MEMORY: sub proc MPD worker(3002723) increased RSS: rss=722.4MB pss=697.6MB uss=696.4MB shared=26.0MB
MEMORY: sub proc MPD worker(3002724) increased RSS: rss=722.4MB pss=697.6MB uss=696.4MB shared=25.9MB
MEMORY: sub proc MPD worker(3002725) increased RSS: rss=722.6MB pss=697.6MB uss=696.4MB shared=26.2MB
ep 582 devtrain eval, step 1, ctc_4 0.226, ctc_8 0.086, ce 0.153, fer 0.015, mem_usage:cuda 3.2GB
ep 582 devtrain eval, step 2, ctc_4 0.131, ctc_8 0.060, ce 0.147, fer 0.014, mem_usage:cuda 3.2GB
MEMORY: sub proc TDL worker 0(3002983) increased RSS: rss=5.3GB pss=1.6GB uss=107.0MB shared=5.2GB
MEMORY: total (main 3002473, 2024-01-14, 15:41:30, 22 procs): pss=17.5GB uss=11.9GB
ep 582 devtrain eval, step 3, ctc_4 0.228, ctc_8 0.092, ce 0.156, fer 0.017, mem_usage:cuda 3.2GB
ep 582 devtrain eval, step 4, ctc_4 0.233, ctc_8 0.115, ce 0.151, fer 0.011, mem_usage:cuda 3.2GB
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc758b92617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-pack
ages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc758b4d98d in /work/tools/users/zeyer/py-envs/py3.
11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc758ecd9f8 in /work/tools/users/zeyer/py-envs/py3.11-t
orch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(int) + 0x8a (0x7fc758ecde5a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10
_cuda.so)
frame #4: <unknown function> + 0x521ce (0x7fc758ed21ce in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x543d4 (0x7fc758ed43d4 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x7fc719a47c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_pytho
n.so)
frame #7: <unknown function> + 0x55ca7 (0x7fc758b77ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fc758b6fcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc1
0.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc758b6fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.
so)
frame #10: <unknown function> + 0x7c84d8 (0x7fc719cfc4d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_pyth
on.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7fc719cfc865 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/
lib/libtorch_python.so)
<omitting python frames>
frame #24: <unknown function> + 0x92d7 (0x7fc7c4ea42d7 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #25: <unknown function> + 0xaf9d (0x7fc7c4ea5f9d in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #26: <unknown function> + 0x9321 (0x7fc7c4ea4321 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #27: <unknown function> + 0x9f94 (0x7fc7c4ea4f94 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #28: <unknown function> + 0x9186 (0x7fc7c4ea4186 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #29: <unknown function> + 0x8cc2 (0x7fc7c4ea3cc2 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_6
4-linux-gnu.so)
frame #30: <unknown function> + 0x12f41 (0x7fc7c4eadf41 in /work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_
64-linux-gnu.so)
frame #40: <unknown function> + 0x8523e (0x7fc7c5b9c23e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #41: <unknown function> + 0x10617c (0x7fc7c5c1d17c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fc68ffff640 (most recent call first):
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/socket.py", line 294 in accept
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/connection.py", line 608 in accept
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/connection.py", line 462 in accept
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/resource_sharer.py", line 138 in _serve
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x00007fc7c0eff640 (most recent call first):
  Garbage-collecting
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/reduction.py", line 51 in dumps
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 244 in _feed
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007fc7c5b14740 (most recent call first):
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/selectors.py", line 415 in select
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/connection.py", line 930 in wait
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/connection.py", line 423 in _poll
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/connection.py", line 256 in poll
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 275 in _worker_loop
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1039 in __init__
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 386 in _get_iterator
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 433 in __iter__
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 512 in eval_model
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 471 in train_epoch
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 236 in train
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 469 in execute_main_task
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 663 in main
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
 
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend, psutil._psutil_linux, psutil._psutil_posix, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image (total: 53)
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fc759a7d20b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fc7c5b53f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fc7c5b9de6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fc7c5b53ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fc7c5b53f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fc7c5b9de6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fc7c5b53ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fc7c5b3f45c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fc75a7e486a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fc75a7f007a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb00d9)[0x7fc75a7ef0d9]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(__gxx_personality_v0+0x87)[0x7fc75a7ef807]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(+0x11184)[0x7fc7c4ff5184]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(_Unwind_Resume+0x12e)[0x7fc7c4ff5bbe]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x128f9)[0x7fc758e928f9]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x7fc719a47c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x7fc758b77ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x7fc758b6fcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7fc758b6fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x7fc719cfc4d8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x305)[0x7fc
719cfc865]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1edb1d)[0x7fc7c5ffab1d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x2665e3)[0x7fc7c60735e3]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f7e9)[0x7fc7c60ac7e9]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f25c)[0x7fc7c60ac25c]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x22bddf)[0x7fc7c6038ddf]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_GC_New+0x72)[0x7fc7c6038d12]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(PyCMethod_New+0x6d)[0x7fc7c5fefb5d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_GenericGetAttrWithDict+0xd1)[0x7fc7c5ff1331]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_LookupAttr+0x3c)[0x7fc7c5ff11fc]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x26762f)[0x7fc7c607462f]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e2fe2)[0x7fc7c5feffe2]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(PyObject_CallOneArg+0x4a)[0x7fc7c5fd2bba]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x92d7)[0x7fc7c4ea42d7]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0xaf9d)[0x7fc7c4ea5f9d]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x9321)[0x7fc7c4ea4321]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x9f94)[0x7fc7c4ea4f94]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x9186)[0x7fc7c4ea4186]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x8cc2)[0x7fc7c4ea3cc2]
/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so(+0x12f41)[0x7fc7c4eadf41]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1c92d2)[0x7fc7c5fd62d2]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x38)[0x7fc7c5fd2678]
...
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x8523e)[0x7fc7c5b9c23e]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x10617c)[0x7fc7c5c1d17c]
ep 582 devtrain eval, step 5, ctc_4 0.221, ctc_8 0.084, ce 0.142, fer 0.012, mem_usage:cuda 3.2GB
ConnectionResetError: [Errno 104] Connection reset by peer
Module call stack:
(No module call frames.)
Error in sys.excepthook:
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/util/debug.py", line 150, in excepthook
    print("Unhandled exception %s in thread %s, proc %i." % (exc_type, threading.currentThread(), os.getpid()))
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 935, in __repr__
    status += " %s" % self._ident
              ~~~~~~^~~~~~~~~~~~~
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3002983) is killed by signal: Aborted. 
...
RETURNN runtime: 0:08:11
RETURNN return code: 1
Most recent trained model epoch: 582 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch
.582
Most recent trained model epoch before RETURNN run: 581
-> trained successfully 1 epoch(s)
Try again, restart RETURNN...
...
ep 582 devtrain eval, step 91, ctc_4 0.364, ctc_8 0.192, ce 0.196, fer 0.021, mem_usage:cuda 2.6GB
ep 582 devtrain eval, step 92, ctc_4 0.340, ctc_8 0.157, ce 0.183, fer 0.019, mem_usage:cuda 2.6GB
ep 582 devtrain eval, step 93, ctc_4 0.322, ctc_8 0.180, ce 0.188, fer 0.022, mem_usage:cuda 2.6GB
ep 582 devtrain eval, step 94, ctc_4 0.373, ctc_8 0.215, ce 0.186, fer 0.021, mem_usage:cuda 2.6GB
ep 582 devtrain eval, step 95, ctc_4 0.344, ctc_8 0.181, ce 0.207, fer 0.022, mem_usage:cuda 2.6GB
devtrain: score ctc_4 0.225 ctc_8 0.099 ce 0.151 error 0.014
Memory usage (cuda): alloc cur 1.6GB alloc peak 2.6GB reserved cur 3.3GB reserved peak 3.3GB
Starting training at epoch 583, global train step 284958
start epoch 583 global train step 284958 with effective learning rate 0.0007711132375000001 ...
Memory usage (cuda): alloc cur 1.6GB alloc peak 1.6GB reserved cur 3.3GB reserved peak 3.3GB
[2024-01-14 15:42:52,502] INFO: Run time: 0:09:37 CPU: 0.00% RSS: 19.16GB VMS: 220.97GB
MEMORY: sub proc MPD seq order(3003067) increased RSS: rss=721.0MB pss=696.2MB uss=695.0MB shared=26.1MB
MEMORY: sub proc MPD worker(3003068) increased RSS: rss=328.7MB pss=304.4MB uss=303.3MB shared=25.5MB
MEMORY: sub proc MPD worker(3003069) increased RSS: rss=332.2MB pss=307.9MB uss=306.7MB shared=25.5MB
MEMORY: sub proc MPD worker(3003070) increased RSS: rss=350.5MB pss=326.0MB uss=324.8MB shared=25.7MB
MEMORY: sub proc MPD worker(3003071) increased RSS: rss=349.9MB pss=325.4MB uss=324.2MB shared=25.7MB
MEMORY: sub proc TDL worker 0(3003446) initial: rss=4.6GB pss=1.9GB uss=7.3MB shared=4.6GB
MEMORY: total (main 3003027, 2024-01-14, 15:42:56, 21 procs): pss=13.4GB uss=9.0GB
MEMORY: sub proc MPD worker(3003068) increased RSS: rss=813.7MB pss=754.6MB uss=740.2MB shared=73.5MB
MEMORY: sub proc MPD worker(3003069) increased RSS: rss=814.4MB pss=754.9MB uss=740.3MB shared=74.1MB
MEMORY: sub proc MPD worker(3003070) increased RSS: rss=814.2MB pss=754.5MB uss=739.9MB shared=74.3MB
MEMORY: sub proc MPD worker(3003071) increased RSS: rss=814.5MB pss=754.4MB uss=739.7MB shared=74.8MB
MEMORY: total (main 3003027, 2024-01-14, 15:43:02, 21 procs): pss=15.2GB uss=10.6GB
[2024-01-14 15:43:02,523] INFO: Run time: 0:09:47 CPU: 0.20% RSS: 21.76GB VMS: 230.32GB
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa6d5792617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-pack
ages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa6d574d98d in /work/tools/users/zeyer/py-envs/py3.
11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa6d5acd9f8 in /work/tools/users/zeyer/py-envs/py3.11-t
orch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(int) + 0x8a (0x7fa6d5acde5a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10
_cuda.so)
frame #4: <unknown function> + 0x521ce (0x7fa6d5ad21ce in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x543d4 (0x7fa6d5ad43d4 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x7fa696647c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_pytho
n.so)
frame #7: <unknown function> + 0x55ca7 (0x7fa6d5777ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fa6d576fcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc1
0.so) 
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa6d576fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #10: <unknown function> + 0x7c84d8 (0x7fa6968fc4d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames> 
frame #17: torch::utils::tensor_ctor(c10::DispatchKey, c10::ScalarType, torch::PythonArgs&) + 0x5f (0x7fa696cd4d3f in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so) 
frame #18: <unknown function> + 0x7b8c36 (0x7fa6968ecc36 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #54: <unknown function> + 0x291b7 (0x7fa7427531b7 in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #55: __libc_start_main + 0x7c (0x7fa74275326c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #56: _start + 0x21 (0x401071 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11)

Fatal Python error: Aborted

Current thread 0x00007fa742727740 (most recent call first):
  Garbage-collecting
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 47 in create_tensor
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 61 in <listcomp>
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/pipeline.py", line 61 in collate_batch
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 42 in fetch
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308 in _worker_loop
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1039 in __init__
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 386 in _get_iterator
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 433 in __iter__
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 335 in train_epoch
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 236 in train
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 469 in execute_main_task
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 663 in main
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, 
numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random.
_sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5
py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups,
 _cffi_backend, psutil._psutil_linux, psutil._psutil_posix, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._
C._special, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image (total: 53)
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fa6d668e20b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fa742766f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fa7427b0e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fa742766ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fa742766f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fa7427b0e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fa742766ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fa74275245c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fa6d73ef86a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fa6d73fb07a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb00d9)[0x7fa6d73fa0d9]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(__gxx_personality_v0+0x87)[0x7fa6d73fa807]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(+0x11184)[0x7fa741e28184]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libgcc_s.so.1(_Unwind_Resume+0x12e)[0x7fa741e28bbe]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x128f9)[0x7fa6d5a928f9]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x7fa696647c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x7fa6d5777ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x7fa6d576fcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7fa6d576fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x7fa6968fc4d8]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f8d5)[0x7fa742cbf8d5]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x29f25c)[0x7fa742cbf25c]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x22bddf)[0x7fa742c4bddf]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_GC_New+0x72)[0x7fa742c4bd12]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(PyCMethod_New+0x6d)[0x7fa742c02b5d]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1bfa4e)[0x7fa742bdfa4e]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_ZN5torch5utils11tensor_ctorEN3c1011DispatchKeyENS1_10
ScalarTypeERNS_10PythonArgsE+0x5f)[0x7fa696cd4d3f]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7b8c36)[0x7fa6968ecc36]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1e308e)[0x7fa742c0308e]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyObject_MakeTpCall+0x71)[0x7fa742be52c1]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x72d)[0x7fa742c2811d]
...
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(_PyRun_AnyFileObject+0x43)[0x7fa742cb6a83]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_RunMain+0x2cc)[0x7fa742cbee8c]
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(Py_BytesMain+0x29)[0x7fa742cbeab9]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x291b7)[0x7fa7427531b7]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(__libc_start_main+0x7c)[0x7fa74275326c]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11(_start+0x21)[0x401071]
RuntimeError: DataLoader worker (pid(s) 3003446) exited unexpectedly
...
Run ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config']
RETURNN runtime: 0:01:32
RETURNN return code: 1
Most recent trained model epoch: 582 file: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/models/epoch.582
Most recent trained model epoch before RETURNN run: 582
-> trained successfully 0 epoch(s)
-> break
Total RETURNN num starts: 2
Total RETURNN runtime: 0:09:44
[2024-01-14 15:43:05,943] ERROR: Executed command failed:
[2024-01-14 15:43:05,943] ERROR: Cmd: ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config']
[2024-01-14 15:43:05,943] ERROR: Args: (1, ['/work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11', '/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py', '/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS/output/returnn.config'])
[2024-01-14 15:43:05,944] ERROR: Return-Code: 1
[2024-01-14 15:43:05,946] INFO: Max resources: Run time: 0:09:50 CPU: 76.6% RSS: 27.79GB VMS: 294.50GB
--------------------- Slurm Task Epilog ------------------------
Job ID: 4132934
Time: Sun Jan 14 04:43:06 PM CET 2024
Elapsed Time: 00:09:54
Billing per second for TRES: billing=248,cpu=4,gres/gpu=1,mem=30G,node=1
Show resource usage with e.g.:
sacct -j 4132934 -o Elapsed,TotalCPU,UserCPU,SystemCPU,MaxRSS,ReqTRES%60,MaxDiskRead,MaxDiskWrite
--------------------- Slurm Task Epilog ------------------------

@albertz
Copy link
Member Author

albertz commented Jan 14, 2024

I restarted, and then again I got the same crash on cn-505. So now I excluded that node. But it's in the queue now.

@albertz
Copy link
Member Author

albertz commented Jan 14, 2024

Maybe related:
pytorch/pytorch#21092
https://discuss.pytorch.org/t/runtimeerror-cuda-error-initialization-on-dataloader/162466

Which indicates some multiprocessing/fork issue in the dataloader worker?

Looking at the stack, yes, that seems to be the case. Probably it forked, then the GC tries to free some CUDA tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant