Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running distributed.py #4

Closed
gordonguo98 opened this issue Oct 19, 2022 · 0 comments
Closed

Error running distributed.py #4

gordonguo98 opened this issue Oct 19, 2022 · 0 comments

Comments

@gordonguo98
Copy link

gordonguo98 commented Oct 19, 2022

tandard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json
Have set cuda visible devices to 0,1,2,3,4,5,6,7
The distributed url we use is tcp://0.0.0.0:44507
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=0']
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=1']
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 714, in
File "train.py", line 714, in
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
File "train.py", line 335, in train
File "train.py", line 335, in train
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

@gordonguo98 gordonguo98 changed the title Error running generate_mirrored_partial.py Error running distributed.py Oct 19, 2022
@gordonguo98 gordonguo98 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant