Timeout error during training to reproduce results #7

jameelhassan · 2023-01-02T11:35:36Z

Hi,

I have been trying to train the model for Imagenet dataset using 16 V100 GPUs.
I am getting a timeout error during evaluation in the training script after the first epoch. It occurs exactly at the same point [2100/11010] iteration in evaluation. Any idea as to why this is occurring?

STACK TRACE:

Test: [ 2090/10010] eta: 1:53:32 acc1: 73.3203 (76.6279) ema_acc1: 77.7734 (76.4956) time: 0.8568 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:587] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:587] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out.
Test: [ 2100/10010] eta: 1:53:23 acc1: 79.5312 (76.6506) ema_acc1: 81.9922 (76.5298) time: 0.8566 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1549 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 12 (pid: 1567) of binary: /
nfs/users/ext_jameel.hassan/anaconda3/envs/must/bin/python
Traceback (most recent call last): File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 194, in _run_module_as_mai n
return _run_code(code, main_globals, None,
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 723, in
main()
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/elastic/mu
ltiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 719, in main
run(args)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 710, in run
elastic_launch(
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):
[0]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567

jameelhassan · 2023-01-02T15:07:54Z

Found the bug. I had mistakenly used the Train images for validation as well. This is what causes the timeout.

jameelhassan closed this as completed Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout error during training to reproduce results #7

Timeout error during training to reproduce results #7

jameelhassan commented Jan 2, 2023 •

edited

Loading

jameelhassan commented Jan 2, 2023

Timeout error during training to reproduce results #7

Timeout error during training to reproduce results #7

Comments

jameelhassan commented Jan 2, 2023 • edited Loading

train.py FAILED

Failures: [1]: time : 2023-01-02_15:24:36 host : p4-r66-a.g42cloud.net rank : 15 (local_rank: 15) exitcode : -6 (pid: 1575) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure): [0]: time : 2023-01-02_15:24:36 host : p4-r66-a.g42cloud.net rank : 12 (local_rank: 12) exitcode : -6 (pid: 1567) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 1567

jameelhassan commented Jan 2, 2023

jameelhassan commented Jan 2, 2023 •

edited

Loading

Failures:
[1]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):
[0]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567