Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There was a problem trying to train the code. #68

Open
hanmoonje opened this issue May 2, 2023 · 0 comments
Open

There was a problem trying to train the code. #68

hanmoonje opened this issue May 2, 2023 · 0 comments

Comments

@hanmoonje
Copy link

Hi, great work!

There was a problem trying to train the code.

My environment is as follows.
Ubuntu 18.04
python 3.7
cuda 10.2
pytorch=1.7.1 torchvision=0.8.2
2 x 2080 Ti
I tried to be as same as your environment as possible.

The following error occurred while trying to learn the code as it is.

GPUS_PER_NODE=2 ./tools/run_dist_launch.sh 2 ./scripts/run_experiments_coco.sh

The following error occurred when entering the result of the above code.

Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

I tried to fix the Master Port, but the following error occurred.

Traceback (most recent call last):
File "main.py", line 367, in
main(args)
File "main.py", line 115, in main
utils.init_distributed_mode(args)
File "/home/mjhan/Meta-DETR/util/misc.py", line 427, in init_distributed_mode
world_size=args.world_size, rank=args.rank)
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/mjhan/anaconda3/envs/meta_detr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

I want you to tell me how to modify the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant