Auto-restart hanging job if faulty node detected #176

albertz · 2024-01-28T22:35:13Z

Currently the admins already have some mechanism to detect faulty nodes and then let it drain and reboot. However, if you are unlucky and it is your job currently hanging there, nothing is done about that. It needs manual inspection by the user and manual cancelation of the job (or waiting until the timelimit is hit). This is of course very suboptimal, and we are thinking about solutions. It was suggested that such logic could also be implemented as part of the Sisyphus manager.

Sisyphus already detects potential hanging jobs by checking whether the log has not been updated recently, and then prints:

[2024-01-28 22:12:30,257] WARNING: Job marked as running but logging file has not been updated: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.R8gexaONIr8S> assume it is running

So, the question is, how to detect whether it is really hanging and should be canceled and restarted.

And then the job should not go into error state, also not retry_error, but just restart it? Or maybe, if it can not be resumed, it should also automatically be cleared first?

And maybe Sisyphus should also keep a temporary local list of excluded nodes and temporarily add the node there (e.g. for 2h or so)?

Here some possible hang detections. All of them would involve logging on to the node via SSH.

Check if nvidia-smi returns an error.
(Maybe some unrelated GPU is faulty, but anyway, if your proc hangs + this is the case, I think it's ok to cancel the job.)
(This covers already a lot of cases I had, but not all.)
Check whether python -c "import torch; torch.cuda.init()" hangs.
Check py-spy dump -p <pid>, whether that hangs (given some timeout, maybe 10sec).
(One problem: Which PID actually? All the (deep) childs of slurm_script? Only the direct childs would actually not cover my RETURNN training setups with the option use_train_proc_manager, where only the sub sub proc hangs.)
(Alternative to py-spy: Maybe just strace, or gdb, or sth else? strace output looks like strace: Process 81401 attached and then nothing more comes. py-spy is also only for Python, so not generic.)
Check dmesg for some common errors? But what exactly?

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-28T23:05:53Z

On dmesg, some common errors:

Very common:

[Jan28 18:18] pcieport 0000:80:03.0: AER: Corrected error received: 0000:80:03.0
[  +0.000013] pcieport 0000:80:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  +0.000004] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00001000/00002000
[  +0.000006] pcieport 0000:80:03.0:    [12] Timeout               
[Jan28 21:42] pcieport 0000:80:03.0: AER: Corrected error received: 0000:80:03.0
[  +0.000012] pcieport 0000:80:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  +0.000005] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00001000/00002000
[  +0.000006] pcieport 0000:80:03.0:    [12] Timeout               
[Jan28 22:09] pcieport 0000:80:03.0: AER: Corrected error received: 0000:80:03.0
[  +0.000011] pcieport 0000:80:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  +0.000005] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00001000/00002000
[  +0.000005] pcieport 0000:80:03.0:    [12] Timeout

Sometimes:

[Jan28 11:22] INFO: task PROC:PID blocked for more than 120 seconds.
...
[  +0.000063]  uvm_ioctl+0x1072/0x16f0 [nvidia_uvm]

(UVM related functions on the stack. Note that the proc of the hang here is not necessarily your own proc, but it hints at a problem with UVM.)

Rare:

[  +8.493819] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 0): Illegal Instruction Encoding
[  +0.000013] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 0, TPC 0): Physical Multiple Warp Errors
[  +0.000006] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504648=0x9 0x504650=0x24 0x504644=0xd3eff2 0x50464c=0x17f
[  +0.000029] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 1): Illegal Instruction Encoding
[  +0.000005] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 0, TPC 1): Physical Multiple Warp Errors
[  +0.000004] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504e48=0x30009 0x504e50=0x24 0x504e44=0xd3eff2 0x504e4c=0x17f
[  +0.000028] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 2): Illegal Instruction Encoding
[  +0.000005] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Global Exception on (GPC 0, TPC 2): Physical Multiple Warp Errors
[  +0.000004] NVRM: Xid (PCI:0000:03:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x505648=0x20009 0x505650=0x24 0x505644=0xd3eff2 0x50564c=0x17f

Rare:

[  +0.001557] NVRM: Xid (PCI:0000:03:00): 31, pid=75125, name=python3, Ch 00000008, intr 30000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7f39_280b4000. Fault is of type FAULT_PDE ACCESS_TYPE_WRITE

michelwi · 2024-01-29T09:20:59Z

This seems like a highly site-specific problem, so my sugggestion would be to implement a generic "detect_hanging" function (working title) that does nothing by default and that can be overridden by users in their settigns.py for specific locations.
Then every user can be more or less "aggressive" with their checks.

One problem that I see is that the hang detection would need run in the manager and not the worker.

Here some possible hang detections. All of them would involve logging on to the node via SSH.

this assumes passwordless ssh access to all nodes has been setup.

Check whether python -c "import torch; torch.cuda.init()" hangs.

adds a dependency to the python env running the manager.

I am not against it. All of my counter points are easily mitigated by making the actual check a user implementation and then the user is responsible to setup the requirements.

albertz · 2024-01-29T09:30:15Z

highly site-specific problem

Is it? It might be specific to NVIDIA GPUs, and maybe also Linux, but this is rather generic then?

One problem that I see is that the hang detection would need run in the manager and not the worker.

Actually both are possible. In the worker, you could also start another subproc which does those checks. And maybe this will touch a file when the checks are successful, so then the Sisyphus manager only needs to check if this file has recently been touched.

But otherwise, why is it a problem if the hang detection runs in the manager?

Here some possible hang detections. All of them would involve logging on to the node via SSH.

this assumes passwordless ssh access to all nodes has been setup.

Yes, but this is usually the case.

michelwi · 2024-01-29T09:43:39Z

highly site-specific problem

Is it? It might be specific to NVIDIA GPUs, and maybe also Linux, but this is rather generic then?

maybe someone who buys server-grade hardware does not have these problems. Maybe other sites have additional problems. Maybe users of other distros have different error messages.

the admins already have some mechanism to detect faulty nodes

maybe they can share the output of this mechanism and then at i6 one would use a script that queries the admin database of faulty nodes rather than trying to login to each node.

albertz · 2024-01-29T10:47:24Z

Is it? It might be specific to NVIDIA GPUs, and maybe also Linux, but this is rather generic then?

maybe someone who buys server-grade hardware does not have these problems. Maybe other sites have additional problems. Maybe users of other distros have different error messages.

If the problem don't occur, then the check would also not hurt. But actually, at least for DGX with V100 SXM3 (at ITC), and also V100-SXM2 (at ITC), and also A10 (i6), I see very similar problems.

If there are additional often occuring problems, we can always extend it.

The dmesg messages are only specific to Linux, or maybe NVIDIA driver.

I get your point that this should be easily user configurable and extensible. But I also think a few of those checks are very generic and probably useful for everybody, and it might make sense to directly provide them as part of Sisyphus so that users can easily use them. Maybe a config option for each particular check (e.g. hang_check_nvidia_smi_error etc).

So, I'm thinking about having a few predefined checks which the user can enable, and then additionally maybe such a function as you said, like hang_check_custom_func or so.

And additionally maybe an option like enable_hang_check.

the admins already have some mechanism to detect faulty nodes

maybe they can share the output of this mechanism and then at i6 one would use a script that queries the admin database of faulty nodes rather than trying to login to each node.

The faulty nodes are set into drain mode. So we could just check whether a mode is in drain mode. But we would need to extend the engine API. Slurm and the others all need their own specific mechanism to check that.

albertz · 2024-02-01T10:28:05Z

I just noticed, when I see the warning "Job marked as running but logging file has not been updated", that is when task.running(task_id) returns False. However, when the usage file (usage.run.1) was recently updated, Task.running should return True. And LoggingThread inside the worker should actually update that file frequently. And LoggingThread should actually not be affected by the hang? So, why do I see this warning "Job marked as running but logging file has not been updated" then?

(I started with some initial implementation, also inside the worker, as another thread, which performs the same check, via task.running(task_id), and now I wonder why this was actually False...)

I just had a hanging case, but not so much related to the other GPU related hangs here (I think), where the worker proc looks like:

zeyer@cn-283 ~ % py-spy dump -p 1103757
Process 1103757: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis worker --engine long work/i6_core/returnn/forward/ReturnnForwardJobV2.TGTaG1BqmX3O run
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)

Thread 1103757 (idle): "MainThread"
    _try_wait (subprocess.py:1955)
    _wait (subprocess.py:1997)
    wait (subprocess.py:1262)
    call (subprocess.py:391)
    check_call (subprocess.py:408)
    run (i6_core/returnn/forward.py:326)
    run (sisyphus/task.py:188)
    worker_helper (sisyphus/worker.py:259)
    worker (sisyphus/worker.py:170)
    main (sisyphus/__main__.py:234)
    <module> (sis:14)
Thread 1103772 (idle): "Thread-2"
    ppid_map (psutil/_pslinux.py:1630)
    children (psutil/__init__.py:928)
    get_job_used_resources (sisyphus/engine.py:137)
    run (sisyphus/worker.py:101)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

And py-spy top:

Total Samples 1600
GIL: 0.00%, Active: 5.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                        
  5.00%   5.00%    1.14s     1.14s   _try_wait (subprocess.py)
  0.00%   0.00%   0.020s    0.020s   ppid_map (psutil/_pslinux.py)
  0.00%   5.00%   0.000s     1.14s   check_call (subprocess.py)
  0.00%   5.00%   0.000s     1.14s   worker_helper (sisyphus/worker.py)
  0.00%   5.00%   0.000s     1.14s   main (sisyphus/__main__.py)
  0.00%   0.00%   0.000s    0.020s   _bootstrap (threading.py)
  0.00%   0.00%   0.000s    0.020s   run (sisyphus/worker.py)
  0.00%   5.00%   0.000s     1.14s   _wait (subprocess.py)
  0.00%   5.00%   0.000s     1.14s   wait (subprocess.py)
  0.00%   5.00%   0.000s     1.14s   run (i6_core/returnn/forward.py)
  0.00%   0.00%   0.000s    0.020s   get_job_used_resources (sisyphus/engine.py)
  0.00%   5.00%   0.000s     1.14s   run (sisyphus/task.py)
  0.00%   5.00%   0.000s     1.14s   call (subprocess.py)
  0.00%   0.00%   0.000s    0.020s   children (psutil/__init__.py)
  0.00%   5.00%   0.000s     1.14s   <module> (sis)
  0.00%   5.00%   0.000s     1.14s   worker (sisyphus/worker.py)
  0.00%   0.00%   0.000s    0.020s   _bootstrap_inner (threading.py)

Via pystack:

...
    (Python) File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/psutil/__init__.py", line 928, in children
        ppid_map = _ppid_map()
      Arguments:
        recursive: True
        self: <Process at 0x7f5024d6ce90>
    (Python) File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/psutil/_pslinux.py", line 1630, in ppid_map
        data = f.read()
      Locals:
        ppid: 2
        dset: [b"I", b"2", b"0", b"0", b"0", b"-1", b"69238880", b"0", b"0", b"0", ...]
        rpar: 31
        data: b"48 (kworker/5:0H-events_highpri) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -2...
        f: <_io.BufferedReader at 0x7f5022d797a0>
        pid: 49
        procfs_path: "/proc"
        ret: {1: 0, 2: 0, 3: 2, 4: 2, 5: 2, 7: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, ...}
    (C) File "???", line 0, in _io__Buffered_read (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/lib/libpython3.11.so.1.0)
    (C) File "???", line 0, in _io_FileIO_readall_impl (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/lib/libpython3.11.so.1.0)
    (C) File "../sysdeps/unix/sysv/linux/fstatat64.c", line 168, in fstatat64 (/work/tools/users/zeyer/linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)

...
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/subprocess.py", line 1955, in _try_wait
        (pid, sts) = os.waitpid(self.pid, wait_flags)
      Arguments:                                                                
        wait_flags: 0                                                                                                                                           
        self: <Popen at 0x7f502394aad0>                                         
    (C) File "???", line 0, in os_waitpid (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/lib/libpython3.11.so.1.0)
    (C) File "../sysdeps/unix/sysv/linux/wait4.c", line 30, in wait4 (/work/tools/users/zeyer/linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)

albertz · 2024-02-01T12:48:07Z

As you see, this hangs in ppid_map. So I wonder now, maybe also in all the other cases when we see this warning "Job marked as running but logging file has not been updated", maybe the LoggingThread always also hangs in a similar way, and that's the reason the usage file is not updated, and thus we see this warning?

This ppid_map iterates all procs of the system and does open_binary("%s/%s/stat" % (procfs_path, pid)), i.e. reads stat from /proc. If this hangs, sth is really wrong with the node.

But that means, we basically already have a reliable way to see if the proc hangs. If we see this warning, it is really hanging. Or not?

I actually sometimes see the warning also for some jobs at the beginning when they are slow at loading (e.g. ReturnnForwardJobV2, and it takes a while to load the model checkpoint). And then once they have loaded the model, it continues. But how can that be? Why does the LoggingThread hang in that case?

Edit See also the added pystack output. As you see from there, it hangs when reading for pid: 49. This is 49 (cpuhp/6). How can this hang?

albertz · 2024-02-01T13:27:34Z

Hm, very strange, I then attached via strace to debug this further, and it said that the proc got a SIGSTOP? But from whom? Then, I send a SIGCONT to it, and it finished just fine:

zeyer@cn-283 ~/Programmierung/playground
 % strace -p 1103757
strace: Process 1103757 attached
--- stopped by SIGSTOP ---
^Cstrace: Process 1103757 detached

zeyer@cn-283 ~/Programmierung/playground
 % kill -SIGCONT 1103757
zeyer@cn-283 ~/Programmierung/playground
 % strace -p 1103757    
strace: Process 1103757 attached
futex(0x307aea0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
newfstatat(AT_FDCWD, "/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/forward/ReturnnForwardJobV2.TGTaG1BqmX3O/finished.run.1", 0x7ffe72b01810, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/forward/ReturnnForwardJobV2.TGTaG1BqmX3O/finished.run.1", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 8
newfstatat(8, "", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_EMPTY_PATH) = 0
ioctl(8, TCGETS, 0x7ffe72b01700)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(8, 0, SEEK_CUR)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
close(8)                                = 0
write(1, "using temp-dir: /var/tmp/2i07yxk"..., 34) = 34
getpid()                                = 1103757
write(2, "[2024-02-01 13:24:27,202] INFO: "..., 67) = 67
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, {sa_handler=0x7f502a06f2a0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, {sa_handler=0x7f502a06f2a0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, 8) = 0
rt_sigaction(SIGUSR2, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, {sa_handler=0x7f502a06f2a0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f5029c23f40}, 8) = 0
munmap(0x7f5023600000, 2616448)         = 0
munmap(0x7f5023200000, 2466984)         = 0
munmap(0x7f5022a00000, 2785456)         = 0
munmap(0x7f5022e00000, 2274048)         = 0
munmap(0x7f5022600000, 2458032)         = 0
munmap(0x7f5022200000, 2130224)         = 0
close(4)                                = 0
close(5)                                = 0
close(3)                                = 0
close(7)                                = 0
munmap(0x7f5029217000, 4096)            = 0
munmap(0x7f5029796000, 16384)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

rwth-i6/sisyphus#176 (comment)

albertz · 2024-02-05T11:30:37Z

Again a hanging job. So, now debugging why I actually get the warning "Job marked as running but logging file has not been updated".

Current procs:

% ps a --forest -u $(whoami) -o pid,comm
...
   1755 slurm_script
   1763  \_ python3.11
   1790      \_ python3.11
   1808          \_ python3.11 <defunct>
   1951          |   \_ python3.11 <defunct>
   1809          \_ python3.11 <defunct>
   1810          \_ python3.11 <defunct>
   1941          |   \_ python3.11 <defunct>
   1811          \_ python3.11
   1949              \_ python3.11
   1950              \_ watch memory
...

So, checking the Sisyphus worker:

zeyer@cn-230 ~ % pystack remote 1763 --native-all --locals                       
Traceback for thread 1789 (python3.11) [] (most recent call last):
    (C) File "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 (/work/tools/users/zeyer/linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
    (C) File "/tmp/glibc-20221013-31297-1axye8t/glibc-2.35/nptl/pthread_create.c", line 442, in start_thread (/work/tools/users/zeyer/linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)                                                                                                                                              
    (C) File "???", line 0, in pythread_wrapper (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/lib/libpython3.11.so.1.0)
    (C) File "???", line 0, in thread_run (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/lib/libpython3.11.so.1.0)
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995, in _bootstrap        
        self._bootstrap_inner()                                                                                                                                 
      Arguments:                                                                                                                                                
        self: <LoggingThread at 0x7f44068cae90>                                                                                                                 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
        self.run()                                                                                                                                              
      Arguments:                                                                                                                                                
        self: <LoggingThread at 0x7f44068cae90>                                                                                                                 
    (Python) File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/worker.py", line 139, in run
        self._cond.wait(gs.PLOGGING_INTERVAL)  
...

This seems ok.

So, current files:

 % ls -la alias/exp2023_04_25_rf/aed/v6-bhv20-nenc17-11gb-f32-bs10k-mgpu4-pavg100-wd1e_4-lrlin1e_5_443k-aux17-dynGradAccumV3a/train/
total 1772
drwxr-xr-x  6 zeyer assi      15 Feb  4 21:59 .
drwxr-xr-x 80 zeyer assi      82 Feb  4 03:45 ..
drwxr-xr-x  2 zeyer assi       5 Feb  4 21:59 engine
-rw-r--r--  1 zeyer assi       0 Jan 29 00:43 finished.create_files.1
-rw-r--r--  1 zeyer assi    1596 Feb  4 21:48 info
drwxr-xr-x  2 zeyer assi       8 Jan 29 00:43 input
-rw-r--r--  1 zeyer assi   15811 Feb  4 21:48 job.save
-rw-r--r--  1 zeyer assi    2537 Jan 29 00:43 log.create_files.1
-rw-r--r--  2 zeyer assi 5513891 Feb  5 00:09 log.run.1
drwxr-xr-x  3 zeyer assi       4 Jan 29 00:43 output
-rw-r--r--  1 zeyer assi     111 Jan 29 00:43 submit_log.create_files
-rw-r--r--  1 zeyer assi     546 Feb  4 21:48 submit_log.run
-rw-r--r--  1 zeyer assi     434 Jan 29 00:43 usage.create_files.1
-rw-r--r--  1 zeyer assi     579 Feb  5 12:24 usage.run.1
drwxr-xr-x  2 zeyer assi       8 Feb  4 23:38 work

So, the usage.run.1 was actually updated. Same listing, 3 minutes later:

-rw-r--r--  1 zeyer assi     579 Feb  5 12:27 usage.run.1

This is exactly my current time.

So, I don't understand, why do I get the warning "Job marked as running but logging file has not been updated"? The usage.run.1 file is clearly still being updated. And Task.running checks this usage file first, and if this is updated, it should return True, and then not print the warning?

So, I restarted the Sis manager, to check if/when I get the same warning again:

zeyer@indigo ~/s/c/2021-05-31 (master) [SIGINT]> /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
[2024-02-05 11:32:11,546] INFO: Loaded config: recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py (loaded module: i6_experiments.users.zeyer.experiments.exp2023_04_25_rf.i6) 
[2024-02-05 11:32:11,546] INFO: Config loaded (time needed: 22.16) 
[2024-02-05 11:32:23,058] INFO: Finished updating job states
...
[2024-02-05 11:32:24,102] INFO: error(2) running(10) waiting(573)
Clear jobs in error state? [y/N] 
Print verbose overview (v), update aliases and outputs (u), start manager (y), or exit (n)? y
[2024-02-05 11:38:37,607] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/aed/v6-bhv20-nenc17-11gb-f32-bs10k-mgpu4-pavg100-wd1e_4-lrlin1e_5_443k-aux17-dynGradAccumV3a/train work/i6_core/returnn/training/ReturnnTrainingJob.WUQ9fEnIbYFl> assume it is running

So, it takes 6 minutes, and then prints the warning again.

In Task.running:

maximal_file_age = gs.WAIT_PERIOD_JOB_FS_SYNC + gs.PLOGGING_UPDATE_FILE_PERIOD + gs.WAIT_PERIOD_JOB_CLEANUP

I have:

WAIT_PERIOD_JOB_FS_SYNC = 1
PLOGGING_UPDATE_FILE_PERIOD = 60
WAIT_PERIOD_JOB_CLEANUP = 10

I wonder, maybe the LoggingThread hanged actually as well, but then recovered, and always when I checked, it is running normally again, so I did not see it hanging? Or maybe NFS was hanging?

michelwi · 2024-02-12T09:24:20Z

When you check that the usage.run.1 file is indeed updated, do/did you check on the node of the job (cn-283) or on the node of the manager (indigo) or both?

albertz · 2024-02-12T11:49:08Z

When you check that the usage.run.1 file is indeed updated, do/did you check on the node of the job (cn-283) or on the node of the manager (indigo) or both?

I checked from the node of the manager.

My current explanation is: I think LoggingThread (or NFS? sth else?) is sometimes hanging for a bit (over a minute or so). E.g., as you see from above, there was a case where it was hanging (for quite long) in ppid_map. But then it recovers again. Otherwise I should see the warning again and again. Sometimes this is the case, but often I only see it one, two, maybe three times, in various time gaps, but then it seems to run fine again.

I'm not really sure how to verify this. I maybe need an exact timeline from an independent node to log when the file is updated. At the same time, I need frequent sampling of the worker with py-spy or so. And then, I need to check those times where the file is not updated in time and see what the py-spy output showed at that point in time. (Although, maybe when it hangs, the py-spy logger would also hang?) But I currently don't have time to set this up. But maybe it's not so important to really verify this.

The actual question is, how to reliably implement the job-hanging check. If it hangs only temporarily, it should not restart. So this means maybe just that I need to increase the timeout. Then, additionally, I maybe should extend LoggingThread by some of the suggested checks above, and write that information also into the usage file. And then, in the manager, I can check the status from the usage file.

albertz assigned dthulke, JackTemaki, critias, curufinwe and michelwi Jan 28, 2024

albertz linked a pull request Feb 1, 2024 that will close this issue

Auto-restart jobs on hang #177

Draft

albertz added a commit to albertz/playground that referenced this issue Feb 1, 2024

ref

183cc1d

rwth-i6/sisyphus#176 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-restart hanging job if faulty node detected #176

Auto-restart hanging job if faulty node detected #176

albertz commented Jan 28, 2024 •

edited

albertz commented Jan 28, 2024 •

edited

michelwi commented Jan 29, 2024

albertz commented Jan 29, 2024 •

edited

michelwi commented Jan 29, 2024

albertz commented Jan 29, 2024 •

edited

albertz commented Feb 1, 2024 •

edited

albertz commented Feb 1, 2024 •

edited

albertz commented Feb 1, 2024

albertz commented Feb 5, 2024 •

edited

michelwi commented Feb 12, 2024

albertz commented Feb 12, 2024

Auto-restart hanging job if faulty node detected #176

Auto-restart hanging job if faulty node detected #176

Comments

albertz commented Jan 28, 2024 • edited

albertz commented Jan 28, 2024 • edited

michelwi commented Jan 29, 2024

albertz commented Jan 29, 2024 • edited

michelwi commented Jan 29, 2024

albertz commented Jan 29, 2024 • edited

albertz commented Feb 1, 2024 • edited

albertz commented Feb 1, 2024 • edited

albertz commented Feb 1, 2024

albertz commented Feb 5, 2024 • edited

michelwi commented Feb 12, 2024

albertz commented Feb 12, 2024

albertz commented Jan 28, 2024 •

edited

albertz commented Jan 28, 2024 •

edited

albertz commented Jan 29, 2024 •

edited

albertz commented Jan 29, 2024 •

edited

albertz commented Feb 1, 2024 •

edited

albertz commented Feb 1, 2024 •

edited

albertz commented Feb 5, 2024 •

edited