Pretrain phase problem #45

haoshuai714 · 2022-01-25T05:00:42Z

I have a problem at pretrain phase, when the program is run by half, such as:
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close
if handler.proc.poll() is None:
File "/usr/lib/python3.6/subprocess.py", line 875, in poll
return self._internal_poll()
File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

Have you ever had a similar problem?

LiJunnan1992 · 2022-01-25T08:30:33Z

Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code.

haoshuai714 · 2022-01-25T08:52:38Z

Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you!

LiJunnan1992 · 2022-01-25T09:21:47Z

This code has been tested on Python 3.8 and pytorch 1.09.

Junjie-Ye · 2022-01-25T11:18:05Z

I have a problem at pretrain phase, such as:
Traceback (most recent call last):
File "Pretrain.py", line 215, in
Traceback (most recent call last):
File "Pretrain.py", line 215, in
main(args, config)
File "Pretrain.py", line 93, in main
main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
utils.init_distributed_mode(args)
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

haoshuai714 · 2022-01-25T11:42:08Z

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

check your python and pytorch version!

Junjie-Ye · 2022-01-25T11:46:13Z

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong?
Thanks for your answer.

haoshuai714 · 2022-01-25T11:51:08Z

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.

This author version: This code has been tested on Python 3.8 and pytorch 1.09.

haoshuai714 · 2022-01-25T11:53:04Z

If you could provide a "environment.yml" file, which contains "Install dependecies"?
such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml
Thanks!

Junjie-Ye · 2022-01-25T12:02:35Z

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoshuai714 · 2022-01-25T12:04:42Z

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

Junjie-Ye · 2022-01-25T12:50:06Z

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

I have already sent the document to your e-mail. Looking forward to your reply. Thank you!

haoshuai714 closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrain phase problem #45

Pretrain phase problem #45

haoshuai714 commented Jan 25, 2022

LiJunnan1992 commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

LiJunnan1992 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

Pretrain phase problem #45

Pretrain phase problem #45

Comments

haoshuai714 commented Jan 25, 2022

LiJunnan1992 commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

LiJunnan1992 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022

haoshuai714 commented Jan 25, 2022

Junjie-Ye commented Jan 25, 2022