avoid "divided by zero error" in logging_outputs when --use-bmuf is e… #812

freewym · 2019-06-18T18:41:37Z

… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own. On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.

…nabled

facebook-github-bot

@nayansinghal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-25T16:01:35Z

@nayansinghal merged this pull request in b3864b2.

…… (#812) Summary: … enabled. When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own. On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0. Pull Request resolved: facebookresearch/fairseq#812 Reviewed By: myleott, yqwangustc Differential Revision: D15908614 Pulled By: nayansinghal fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d

* add README to docs * update documents for distillation * upload png files

…… (#812) Summary: … enabled. When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own. On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0. Pull Request resolved: facebookresearch/fairseq#812 Reviewed By: myleott, yqwangustc Differential Revision: D15908614 Pulled By: nayansinghal fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d

avoid "divided by zero error" in logging_outputs when --use-bmuf is e…

e034a91

…nabled

facebook-github-bot added the CLA Signed label Jun 18, 2019

facebook-github-bot reviewed Jun 19, 2019

View reviewed changes

facebook-github-bot closed this in b3864b2 Jun 25, 2019

facebook-github-bot added the Merged label Jun 25, 2019

freewym deleted the bug_fix2 branch June 25, 2019 17:05

yfyeung pushed a commit to yfyeung/fairseq that referenced this pull request Dec 6, 2023

Add docs for distillation (facebookresearch#812)

142420b

* add README to docs * update documents for distillation * upload png files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid "divided by zero error" in logging_outputs when --use-bmuf is e… #812

avoid "divided by zero error" in logging_outputs when --use-bmuf is e… #812

Uh oh!

freewym commented Jun 18, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avoid "divided by zero error" in logging_outputs when --use-bmuf is e… #812

avoid "divided by zero error" in logging_outputs when --use-bmuf is e… #812

Uh oh!

Conversation

freewym commented Jun 18, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants