New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257
Comments
Tensorflow has some profiling guides: One important thing to check is if the training is disk io bounded. If that turns out to be the case you may need to consider moving the data onto an SSD. I believe you should be able to get 1.2 steps/sec with 16 GPUs, so 0.45 steps/sec with 6 GPUs. |
Hi, I have 4Gpus(v100) and I want to try to run this model. But I don't know what the number of saver_max_to_keep & worker_replicas means. Should I set the same number as you did? |
I even don't know what exact they mean. I guess from their name: |
Yes that is correct. worker_replicas = number of machines you have in the cluster, worker_gpus = number of gpus per machine. Actually reading #1 I think worker_replicas should be set to 1, because you only have 1 machine with 6 gpus in it. But I guess setting it to larger number still works. |
iirc, saver_max_to_keep refers to the max number of checkpointing files stored. @yujunlhz |
my host is one host, 6 GPUs(V100), the speed is about steps/sec: 0.12, so the 800,000 steps would take several months.
My running command is as follows:
bazel-bin/lingvo/trainer --saver_max_to_keep=3 --worker_gpus=6 --worker_replicas=3 --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/lingvo/asr_2 --logtostderr --enable_asserts=false --job=controller,trainer
through nvidia-smi: I can see that sin GPU "GPU-Util" are beteen 30%-60%
Any suggestions for the configuration?
Thanks
2021-03-19 00:03:38.408969: I lingvo/core/ops/record_yielder.cc:604] Record 1963: key=00001962
2021-03-19 00:03:39.040448: I lingvo/core/ops/record_yielder.cc:614] Emitted 2551 records from /tmp/lingvo/lingvo/speech_data/train/train.tfrecords-00034-of-00100
I0319 00:03:48.392294 139978034091776 summary_utils.py:398] Steps/second: 0.123965, Examples/second: 71.403573
I0319 00:03:48.394137 139978034091776 trainer_impl.py:199] step: 223, steps/sec: 0.12, examples/sec: 71.40 fraction_of_correct_next_step_preds:0.14076507 fraction_of_correct_next_step_preds/logits:0.14076507 grad_norm/all/loss:0.5464555 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58279711 learning_rate/loss:0.00025000001 log_pplx:6.1487246 log_pplx/logits:6.1487246 loss:6.1487246 loss/logits:6.1487246 num_samples_in_batch:576 token_normed_prob:0.0021416517 token_normed_prob/logits:0.0021416517 var_norm/all/loss:1079.627
2021-03-19 00:03:48.516815: I lingvo/core/ops/record_yielder.cc:604] Record 2094: key=00002093
I0319 00:03:57.226976 139978034091776 summary_utils.py:398] Steps/second: 0.124019, Examples/second: 71.434948
I0319 00:03:57.228809 139978034091776 trainer_impl.py:199] step: 224, steps/sec: 0.12, examples/sec: 71.43 fraction_of_correct_next_step_preds:0.14581797 fraction_of_correct_next_step_preds/logits:0.14581797 grad_norm/all/loss:0.53552681 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58274323 learning_rate/loss:0.00025000001 log_pplx:6.1034293 log_pplx/logits:6.1034293 loss:6.1034293 loss/logits:6.1034293 num_samples_in_batch:576 token_normed_prob:0.0022410566 token_normed_prob/logits:0.0022410566 var_norm/all/loss:1079.577
The text was updated successfully, but these errors were encountered: