Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs #257

Open
yujunlhz opened this issue Mar 19, 2021 · 5 comments

Comments

@yujunlhz
Copy link

my host is one host, 6 GPUs(V100), the speed is about steps/sec: 0.12, so the 800,000 steps would take several months.

My running command is as follows:
bazel-bin/lingvo/trainer --saver_max_to_keep=3 --worker_gpus=6 --worker_replicas=3 --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/lingvo/asr_2 --logtostderr --enable_asserts=false --job=controller,trainer

through nvidia-smi: I can see that sin GPU "GPU-Util" are beteen 30%-60%

Any suggestions for the configuration?

Thanks

2021-03-19 00:03:38.408969: I lingvo/core/ops/record_yielder.cc:604] Record 1963: key=00001962
2021-03-19 00:03:39.040448: I lingvo/core/ops/record_yielder.cc:614] Emitted 2551 records from /tmp/lingvo/lingvo/speech_data/train/train.tfrecords-00034-of-00100
I0319 00:03:48.392294 139978034091776 summary_utils.py:398] Steps/second: 0.123965, Examples/second: 71.403573
I0319 00:03:48.394137 139978034091776 trainer_impl.py:199] step: 223, steps/sec: 0.12, examples/sec: 71.40 fraction_of_correct_next_step_preds:0.14076507 fraction_of_correct_next_step_preds/logits:0.14076507 grad_norm/all/loss:0.5464555 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58279711 learning_rate/loss:0.00025000001 log_pplx:6.1487246 log_pplx/logits:6.1487246 loss:6.1487246 loss/logits:6.1487246 num_samples_in_batch:576 token_normed_prob:0.0021416517 token_normed_prob/logits:0.0021416517 var_norm/all/loss:1079.627
2021-03-19 00:03:48.516815: I lingvo/core/ops/record_yielder.cc:604] Record 2094: key=00002093
I0319 00:03:57.226976 139978034091776 summary_utils.py:398] Steps/second: 0.124019, Examples/second: 71.434948
I0319 00:03:57.228809 139978034091776 trainer_impl.py:199] step: 224, steps/sec: 0.12, examples/sec: 71.43 fraction_of_correct_next_step_preds:0.14581797 fraction_of_correct_next_step_preds/logits:0.14581797 grad_norm/all/loss:0.53552681 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58274323 learning_rate/loss:0.00025000001 log_pplx:6.1034293 log_pplx/logits:6.1034293 loss:6.1034293 loss/logits:6.1034293 num_samples_in_batch:576 token_normed_prob:0.0022410566 token_normed_prob/logits:0.0022410566 var_norm/all/loss:1079.577

@jonathanasdf
Copy link
Contributor

jonathanasdf commented Mar 19, 2021

Tensorflow has some profiling guides:
https://www.tensorflow.org/guide/profiler
https://www.tensorflow.org/guide/gpu_performance_analysis

One important thing to check is if the training is disk io bounded. If that turns out to be the case you may need to consider moving the data onto an SSD. I believe you should be able to get 1.2 steps/sec with 16 GPUs, so 0.45 steps/sec with 6 GPUs.

@xsppp
Copy link

xsppp commented Mar 24, 2021

Hi, I have 4Gpus(v100) and I want to try to run this model. But I don't know what the number of saver_max_to_keep & worker_replicas means. Should I set the same number as you did?

@yujunlhz
Copy link
Author

I even don't know what exact they mean. I guess from their name:
saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk
worker_replicas : The number workers do the actual work of training the model in parallel
I hope @jonathanasdf can confirm.

@jonathanasdf
Copy link
Contributor

Yes that is correct. worker_replicas = number of machines you have in the cluster, worker_gpus = number of gpus per machine. Actually reading #1 I think worker_replicas should be set to 1, because you only have 1 machine with 6 gpus in it. But I guess setting it to larger number still works.

@adis98
Copy link

adis98 commented Mar 24, 2021

I even don't know what exact they mean. I guess from their name:
saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk
worker_replicas : The number workers do the actual work of training the model in parallel
I hope @jonathanasdf can confirm.

iirc, saver_max_to_keep refers to the max number of checkpointing files stored. @yujunlhz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants