-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cfg.DIST_ENABLE false fail #36
Comments
|
I meant that I am trying to test single GPU. But in many other places not: aot-benchmark/networks/managers/trainer.py Lines 342 to 352 in 1c3a5ec
|
E.g. here instead the code is checking aot-benchmark/networks/managers/trainer.py Lines 59 to 83 in 1c3a5ec
|
The distributed sampler is useless and meaningless for evaluation, where GPUs are asynchronous instead of synchronous. The video lengths are always different for different GPUs. |
Is this not in the trainer? |
This config is designed for training.
… On Mar 7, 2023, at 23:49, bhack ***@***.***> wrote:
Is this not in the trainer?
I am trying to train a single GPU job with cfg.DIST_ENABLE=False
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDXSYPPW5OOK75E233W25KILANCNFSM6AAAAAAVOZMQ2Q>.
You are receiving this because you commented.
|
Yes it is what I meant. Are we not going to have issue if we don't conditional wrap |
I suppose it should be ok.
… On Mar 8, 2023, at 00:00, bhack ***@***.***> wrote:
Yes it is what I meant. Are we not going to have issue if we don't conditional wrap torch.nn.parallel.DistributedDataParallel in the trainer?
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPGAB5G46JB5EUT5UALW25LSPANCNFSM6AAAAAAVOZMQ2Q>.
You are receiving this because you commented.
|
Isn't that one going to require aot-benchmark/networks/managers/trainer.py Lines 59 to 64 in 1c3a5ec
|
Sorry, I don't understand what you mean.
… On Mar 8, 2023, at 00:54, bhack ***@***.***> wrote:
Isn't that one going to require init_process_group? But it is conditional wrapped in the trainer.
https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L59-L64
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDVCSEOT5LR3KPQOQLW25R5ZANCNFSM6AAAAAAVOZMQ2Q>.
You are receiving this because you commented.
|
E.g. with aot-benchmark/networks/managers/trainer.py Line 342 in 1c3a5ec
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. Cause: As we don't have completely covered |
Is it clear now? |
I have updated the |
Thanks, I've checked your changes and they were the same I've done locally on my side in these days. What do you think now about this (pytorch/pytorch#37444)? Lines 78 to 79 in d5bd73f
|
Other then my previous comment I think that we still have now an issue with the aot-benchmark/networks/managers/trainer.py Lines 677 to 682 in d5bd73f
for key in boards['image'].keys():
AttributeError: 'list' object has no attribute 'keys' |
It should be ok now. |
It seems that we have two issues:
|
For the first issue this is the stacktrace of one of the deadlock and it could be related to #36 (comment): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 109, in join
ready = multiprocessing.connection.wait(
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout) |
Yes, it seems ok now.
Yes sorry this was a false positive related to the image range in the compositing phase, thanks for the double check. |
The same need to be fixed in the |
With
cfg.DIST_ENABLE
false all the distributed specific parts are not wrapped by a condition to check if distributed was enabled or not.The text was updated successfully, but these errors were encountered: