Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server socket error when training while another task already run #93

Closed
robotnc opened this issue Sep 19, 2022 · 3 comments
Closed

Server socket error when training while another task already run #93

robotnc opened this issue Sep 19, 2022 · 3 comments

Comments

@robotnc
Copy link

robotnc commented Sep 19, 2022

Describe the bug

I have a server with 4GPU gtx1080 ubuntu 16.4

When I run train process using run.sh, if already another train task was already running, it will occur error:

Start training ...
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29400 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:29400 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

How to solve this case ?

@mlxu995
Copy link
Collaborator

mlxu995 commented Sep 19, 2022

this (https://zhuanlan.zhihu.com/p/439077183) may help you

@robotnc
Copy link
Author

robotnc commented Sep 23, 2022

not work.

@robotnc
Copy link
Author

robotnc commented Oct 12, 2022

Found the solution. If run in single machine multiple different task, you can set in the run.sh:

export LOCAL_RANK=0
export WORLD_SIZE=1

#and replace torchrun to python

#torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus
python
wekws/bin/train.py --gpus $gpus
--config $config
--train_data data/train/enc_data.list
--cv_data data/dev/enc_data.list \

@robotnc robotnc closed this as completed Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants