-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretrain phase problem #45
Comments
Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code. |
Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you! |
This code has been tested on Python 3.8 and pytorch 1.09. |
I have a problem at pretrain phase, such as: File "Pretrain.py", line 93, in main Have you ever had a similar problem? Thanks! |
check your python and pytorch version! |
My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? |
This author version: This code has been tested on Python 3.8 and pytorch 1.09. |
If you could provide a "environment.yml" file, which contains "Install dependecies"? |
Thanks. Would you please tell me your e-mails and I'll sent the document to you. |
|
I have already sent the document to your e-mail. Looking forward to your reply. Thank you! |
I have a problem at pretrain phase, when the program is run by half, such as:
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close
if handler.proc.poll() is None:
File "/usr/lib/python3.6/subprocess.py", line 875, in poll
return self._internal_poll()
File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1
Have you ever had a similar problem?
The text was updated successfully, but these errors were encountered: