Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrain phase problem #45

Closed
haoshuai714 opened this issue Jan 25, 2022 · 11 comments
Closed

Pretrain phase problem #45

haoshuai714 opened this issue Jan 25, 2022 · 11 comments

Comments

@haoshuai714
Copy link

I have a problem at pretrain phase, when the program is run by half, such as:
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close
if handler.proc.poll() is None:
File "/usr/lib/python3.6/subprocess.py", line 875, in poll
return self._internal_poll()
File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

Have you ever had a similar problem?

@LiJunnan1992
Copy link
Contributor

Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code.

@haoshuai714
Copy link
Author

Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you!

@LiJunnan1992
Copy link
Contributor

This code has been tested on Python 3.8 and pytorch 1.09.

@Junjie-Ye
Copy link

I have a problem at pretrain phase, such as:
Traceback (most recent call last):
File "Pretrain.py", line 215, in
Traceback (most recent call last):
File "Pretrain.py", line 215, in
main(args, config)
File "Pretrain.py", line 93, in main
main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
utils.init_distributed_mode(args)
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

@haoshuai714
Copy link
Author

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

check your python and pytorch version!

@Junjie-Ye
Copy link

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong?
Thanks for your answer.

@haoshuai714
Copy link
Author

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.

This author version: This code has been tested on Python 3.8 and pytorch 1.09.

@haoshuai714
Copy link
Author

If you could provide a "environment.yml" file, which contains "Install dependecies"?
such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml
Thanks!

@Junjie-Ye
Copy link

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

@haoshuai714
Copy link
Author

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

@Junjie-Ye
Copy link

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

I have already sent the document to your e-mail. Looking forward to your reply. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants