Run code for training got some errors #20

LeonWlw · 2020-12-16T11:53:24Z

Thank u for making your code public. Is it all ready for runing now? There are many core dumed when trainning by DDP.If training by single gpu, torchscript gets some errors,too.Just like Unknown type name 'torch.device'. If I ignore torch.jit.script, also got errors.My pytorch version is 1.7.0.

whiteshirt0429 · 2020-12-16T12:30:21Z

our experiments are based on pytorch version 1.6.0, may be you should use gloo instead of nccl as backend if pytorch version is 1.7.0

placebokkk · 2020-12-16T14:04:01Z

@LeonWlw Thank you for your feedback. Could you provide more details about your GPU type, the python version and the full error backtrace message?

You could also try torch 1.6.0 as @whiteshirt0429 suggested.

And a known issue is the torch1.7 does not support NCCL well in 2080 Ti.

LeonWlw · 2020-12-17T02:04:02Z

Thank you for your quick replay.I try gloo in torch1.7. It came with same errors when using nccl. Below is the log:

run.sh: init method is file:///lustre_data/user/code/wenet/examples/aishell/s0/exp/sp_spec_aug/ddp_init
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
wenet/bin/train.py:76: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
configs = yaml.load(fin)
2020-12-17 09:55:43,725 INFO training on multiple gpu, this gpu 6
2020-12-17 09:55:43,726 INFO training on multiple gpu, this gpu 1
2020-12-17 09:55:43,728 INFO training on multiple gpu, this gpu 5
2020-12-17 09:55:43,739 INFO training on multiple gpu, this gpu 2
Traceback (most recent call last):
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
File "wenet/bin/train.py", line 94, in
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
Traceback (most recent call last):
File "wenet/bin/train.py", line 94, in
timeout=timeout)
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
rank=args.rank)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
File "/lustre_data/user/install_dir/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper
timeout=timeout)
RuntimeError: flock: Function not implemented
timeout=timeout)
RuntimeError: flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
terminate called after throwing an instance of 'std::system_error'
what(): flock: Function not implemented
run.sh: line 120: 414733 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414732 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414734 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts
run.sh: line 120: 414735 Aborted (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --train_data $feat_dir/$train_set/format.data --cv_data $feat_dir/dev/format.data ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $num_gpus --ddp.rank $i --ddp.dist_backend $dist_backend --num_workers 2 $cmvn_opts

My env is python3.7.7, cuda10.1, gpu for tesla v100, driver version is 418.67.

placebokkk · 2020-12-17T03:19:27Z

@LeonWlw

The log RuntimeError: flock: Function not implemented suggests the errors are caused by some flock issues. Do you use NFS? It seems the flock is not enabled for your file system.

I am not familiar with the file system locks. You could try which flock to check if flock is installed.
Have you ever run any other torch programs in DDP mode successfully in this environment?

If training by single gpu, torchscript gets some errors,too.

Could you also provide this single GPU error message?

Thanks!

LeonWlw · 2020-12-17T04:38:37Z

1.I am not familiar with the file system locks. You could try which flock to check if flock is installed.
I use NFS, Already install flock

2.Have you ever run any other torch programs in DDP mode successfully in this environment?
Yes

3.Could you also provide this single GPU error message?
When I add torch.cuda.set_device(0) to train.py, training is going well using single gpu also include torchscript.

wbgxx333 · 2020-12-17T05:32:38Z

@LeonWlw you should check torch=1.6.0。maybe pip install torch=1.6.0 and torchvision=0.7.0.

if you use the torch==1.7.0, torch.device may error.

whiteshirt0429 · 2020-12-17T05:33:00Z

I am sorry that we dont have v100 gpus, but I think these methods may be helpful for you.

try an other file path as your experiments path
or try tcp init_method as ddp init method if file path keep unchanged

LeonWlw · 2020-12-17T08:12:23Z

Try local file path is OK for ddp training. Thanks a lot for help.

LeonWlw closed this as completed Dec 17, 2020

robin1001 pinned this issue Dec 17, 2020

mikelei unpinned this issue Jan 26, 2021

zhaoyinjiang9825 mentioned this issue Aug 1, 2022

LibTorch gpu cmake error #1336

Closed

raycool mentioned this issue Aug 24, 2023

编译runtime/libtorch模块下的GPU版本，运行websocket_server_main 压测会不定时退出 #1959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run code for training got some errors #20

Run code for training got some errors #20

LeonWlw commented Dec 16, 2020

whiteshirt0429 commented Dec 16, 2020

placebokkk commented Dec 16, 2020

LeonWlw commented Dec 17, 2020 •

edited

Loading

placebokkk commented Dec 17, 2020

LeonWlw commented Dec 17, 2020

wbgxx333 commented Dec 17, 2020

whiteshirt0429 commented Dec 17, 2020

LeonWlw commented Dec 17, 2020

Run code for training got some errors #20

Run code for training got some errors #20

Comments

LeonWlw commented Dec 16, 2020

whiteshirt0429 commented Dec 16, 2020

placebokkk commented Dec 16, 2020

LeonWlw commented Dec 17, 2020 • edited Loading

placebokkk commented Dec 17, 2020

LeonWlw commented Dec 17, 2020

wbgxx333 commented Dec 17, 2020

@LeonWlw you should check torch=1.6.0。maybe pip install torch=1.6.0 and torchvision=0.7.0.

whiteshirt0429 commented Dec 17, 2020

LeonWlw commented Dec 17, 2020

LeonWlw commented Dec 17, 2020 •

edited

Loading