-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run code for training got some errors #20
Comments
our experiments are based on pytorch version 1.6.0, may be you should use gloo instead of nccl as backend if pytorch version is 1.7.0 |
@LeonWlw Thank you for your feedback. Could you provide more details about your GPU type, the python version and the full error backtrace message? You could also try torch 1.6.0 as @whiteshirt0429 suggested. And a known issue is the torch1.7 does not support NCCL well in 2080 Ti. |
Thank you for your quick replay.I try gloo in torch1.7. It came with same errors when using nccl. Below is the log: run.sh: init method is file:///lustre_data/user/code/wenet/examples/aishell/s0/exp/sp_spec_aug/ddp_init My env is python3.7.7, cuda10.1, gpu for tesla v100, driver version is 418.67. |
The log
Thanks! |
|
@LeonWlw you should check torch=1.6.0。maybe pip install torch=1.6.0 and torchvision=0.7.0.if you use the torch==1.7.0, torch.device may error. |
I am sorry that we dont have v100 gpus, but I think these methods may be helpful for you.
|
Try local file path is OK for ddp training. Thanks a lot for help. |
Thank u for making your code public. Is it all ready for runing now? There are many core dumed when trainning by DDP.If training by single gpu, torchscript gets some errors,too.Just like Unknown type name 'torch.device'. If I ignore torch.jit.script, also got errors.My pytorch version is 1.7.0.
The text was updated successfully, but these errors were encountered: