Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

yujunlhz · 2021-03-19T01:04:13Z

my host is one host, 6 GPUs(V100), the speed is about steps/sec: 0.12, so the 800,000 steps would take several months.

My running command is as follows:
bazel-bin/lingvo/trainer --saver_max_to_keep=3 --worker_gpus=6 --worker_replicas=3 --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/lingvo/asr_2 --logtostderr --enable_asserts=false --job=controller,trainer

through nvidia-smi: I can see that sin GPU "GPU-Util" are beteen 30%-60%

Any suggestions for the configuration?

Thanks

2021-03-19 00:03:38.408969: I lingvo/core/ops/record_yielder.cc:604] Record 1963: key=00001962
2021-03-19 00:03:39.040448: I lingvo/core/ops/record_yielder.cc:614] Emitted 2551 records from /tmp/lingvo/lingvo/speech_data/train/train.tfrecords-00034-of-00100
I0319 00:03:48.392294 139978034091776 summary_utils.py:398] Steps/second: 0.123965, Examples/second: 71.403573
I0319 00:03:48.394137 139978034091776 trainer_impl.py:199] step: 223, steps/sec: 0.12, examples/sec: 71.40 fraction_of_correct_next_step_preds:0.14076507 fraction_of_correct_next_step_preds/logits:0.14076507 grad_norm/all/loss:0.5464555 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58279711 learning_rate/loss:0.00025000001 log_pplx:6.1487246 log_pplx/logits:6.1487246 loss:6.1487246 loss/logits:6.1487246 num_samples_in_batch:576 token_normed_prob:0.0021416517 token_normed_prob/logits:0.0021416517 var_norm/all/loss:1079.627
2021-03-19 00:03:48.516815: I lingvo/core/ops/record_yielder.cc:604] Record 2094: key=00002093
I0319 00:03:57.226976 139978034091776 summary_utils.py:398] Steps/second: 0.124019, Examples/second: 71.434948
I0319 00:03:57.228809 139978034091776 trainer_impl.py:199] step: 224, steps/sec: 0.12, examples/sec: 71.43 fraction_of_correct_next_step_preds:0.14581797 fraction_of_correct_next_step_preds/logits:0.14581797 grad_norm/all/loss:0.53552681 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58274323 learning_rate/loss:0.00025000001 log_pplx:6.1034293 log_pplx/logits:6.1034293 loss:6.1034293 loss/logits:6.1034293 num_samples_in_batch:576 token_normed_prob:0.0022410566 token_normed_prob/logits:0.0022410566 var_norm/all/loss:1079.577

jonathanasdf · 2021-03-19T01:11:46Z

Tensorflow has some profiling guides:
https://www.tensorflow.org/guide/profiler
https://www.tensorflow.org/guide/gpu_performance_analysis

One important thing to check is if the training is disk io bounded. If that turns out to be the case you may need to consider moving the data onto an SSD. I believe you should be able to get 1.2 steps/sec with 16 GPUs, so 0.45 steps/sec with 6 GPUs.

xsppp · 2021-03-24T08:17:00Z

Hi, I have 4Gpus(v100) and I want to try to run this model. But I don't know what the number of saver_max_to_keep & worker_replicas means. Should I set the same number as you did?

yujunlhz · 2021-03-24T14:04:57Z

I even don't know what exact they mean. I guess from their name:
saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk
worker_replicas : The number workers do the actual work of training the model in parallel
I hope @jonathanasdf can confirm.

jonathanasdf · 2021-03-24T21:04:20Z

Yes that is correct. worker_replicas = number of machines you have in the cluster, worker_gpus = number of gpus per machine. Actually reading #1 I think worker_replicas should be set to 1, because you only have 1 machine with 6 gpus in it. But I guess setting it to larger number still works.

adis98 · 2021-03-24T23:23:13Z

I even don't know what exact they mean. I guess from their name:
saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk
worker_replicas : The number workers do the actual work of training the model in parallel
I hope @jonathanasdf can confirm.

iirc, saver_max_to_keep refers to the max number of checkpointing files stored. @yujunlhz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

yujunlhz commented Mar 19, 2021

jonathanasdf commented Mar 19, 2021 •

edited

xsppp commented Mar 24, 2021

yujunlhz commented Mar 24, 2021

jonathanasdf commented Mar 24, 2021

adis98 commented Mar 24, 2021

Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

Comments

yujunlhz commented Mar 19, 2021

jonathanasdf commented Mar 19, 2021 • edited

xsppp commented Mar 24, 2021

yujunlhz commented Mar 24, 2021

jonathanasdf commented Mar 24, 2021

adis98 commented Mar 24, 2021

jonathanasdf commented Mar 19, 2021 •

edited