Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run code for training got some errors #20

Closed
LeonWlw opened this issue Dec 16, 2020 · 8 comments
Closed

Run code for training got some errors #20

LeonWlw opened this issue Dec 16, 2020 · 8 comments

Comments

@LeonWlw
Copy link

LeonWlw commented Dec 16, 2020

Thank u for making your code public. Is it all ready for runing now? There are many core dumed when trainning by DDP.If training by single gpu, torchscript gets some errors,too.Just like Unknown type name 'torch.device'. If I ignore torch.jit.script, also got errors.My pytorch version is 1.7.0.

@whiteshirt0429
Copy link
Collaborator

our experiments are based on pytorch version 1.6.0, may be you should use gloo instead of nccl as backend if pytorch version is 1.7.0

@placebokkk
Copy link
Collaborator

@LeonWlw Thank you for your feedback. Could you provide more details about your GPU type, the python version and the full error backtrace message?

You could also try torch 1.6.0 as @whiteshirt0429 suggested.

And a known issue is the torch1.7 does not support NCCL well in 2080 Ti.

@LeonWlw
Copy link
Author

LeonWlw commented Dec 17, 2020

Thank you for your quick replay.I try gloo in torch1.7. It came with same errors when using nccl. Below is the log:

run.sh: init method is file:///lustre_data/user/code/wenet/examples/aishell/s0/exp/sp_spec_aug/ddp_init
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
2020-12-17 09:55:43,725 INFO training on multiple gpu, this gpu 6
2020-12-17 09:55:43,726 INFO training on multiple gpu, this gpu 1
2020-12-17 09:55:43,728 INFO training on multiple gpu, this gpu 5
2020-12-17 09:55:43,739 INFO training on multiple gpu, this gpu 2
Traceback (most recent call last):
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
File "wenet/bin/train.py", line 94, in
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
timeout=timeout)
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
RuntimeError: flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
run.sh: line 120: 414733 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414732 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414734 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414735 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts

My env is python3.7.7, cuda10.1, gpu for tesla v100, driver version is 418.67.

@placebokkk
Copy link
Collaborator

@LeonWlw

The log RuntimeError: flock: Function not implemented suggests the errors are caused by some flock issues. Do you use NFS? It seems the flock is not enabled for your file system.

  1. I am not familiar with the file system locks. You could try which flock to check if flock is installed.
  2. Have you ever run any other torch programs in DDP mode successfully in this environment?

If training by single gpu, torchscript gets some errors,too.

  1. Could you also provide this single GPU error message?

Thanks!

@LeonWlw
Copy link
Author

LeonWlw commented Dec 17, 2020

1.I am not familiar with the file system locks. You could try which flock to check if flock is installed.
I use NFS, Already install flock

2.Have you ever run any other torch programs in DDP mode successfully in this environment?
Yes

3.Could you also provide this single GPU error message?
When I add torch.cuda.set_device(0) to train.py, training is going well using single gpu also include torchscript.

@wbgxx333
Copy link

@LeonWlw you should check torch=1.6.0。maybe pip install torch=1.6.0 and torchvision=0.7.0.

if you use the torch==1.7.0, torch.device may error.

@whiteshirt0429
Copy link
Collaborator

I am sorry that we dont have v100 gpus, but I think these methods may be helpful for you.

  1. try an other file path as your experiments path
  2. or try tcp init_method as ddp init method if file path keep unchanged

@LeonWlw
Copy link
Author

LeonWlw commented Dec 17, 2020

Try local file path is OK for ddp training. Thanks a lot for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants