You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tandard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json
Have set cuda visible devices to 0,1,2,3,4,5,6,7
The distributed url we use is tcp://0.0.0.0:44507
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=0']
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=1']
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 714, in
File "train.py", line 714, in
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
File "train.py", line 335, in train
File "train.py", line 335, in train
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
The text was updated successfully, but these errors were encountered:
gordonguo98
changed the title
Error running generate_mirrored_partial.py
Error running distributed.py
Oct 19, 2022
tandard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json
Have set cuda visible devices to 0,1,2,3,4,5,6,7
The distributed url we use is tcp://0.0.0.0:44507
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=0']
['train.py', '--config=exp_configs/mvp_configs/config_standard_attention_real_3072_partial_points_rot_90_scale_1.2_translation_0.1.json', '--group_name=group_2022_10_19-021536', '--dist_url=tcp://0.0.0.0:44507', '--rank=1']
Traceback (most recent call last):
Traceback (most recent call last):
File "train.py", line 714, in
File "train.py", line 714, in
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
train(num_gpus, args.config, args.rank, args.group_name, **train_config)
File "train.py", line 335, in train
File "train.py", line 335, in train
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
init_distributed(rank, num_gpus, group_name, **dist_config)
File "/home/hm/guoxiaofan/Point_Diffusion_Refinement/pointnet2/distributed.py", line 57, in init_distributed
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
group_name=group_name)
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/hm/anaconda3/envs/pdr/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
The text was updated successfully, but these errors were encountered: