Skip to content

Conversation

@freewym
Copy link
Contributor

@freewym freewym commented Jun 18, 2019

… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own. On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nayansinghal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@nayansinghal merged this pull request in b3864b2.

@freewym freewym deleted the bug_fix2 branch June 25, 2019 17:05
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
…… (#812)

Summary:
… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own.  On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica  and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.
Pull Request resolved: facebookresearch/fairseq#812

Reviewed By: myleott, yqwangustc

Differential Revision: D15908614

Pulled By: nayansinghal

fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this pull request Feb 19, 2021
…… (#812)

Summary:
… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own.  On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica  and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.
Pull Request resolved: facebookresearch/fairseq#812

Reviewed By: myleott, yqwangustc

Differential Revision: D15908614

Pulled By: nayansinghal

fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d
yfyeung pushed a commit to yfyeung/fairseq that referenced this pull request Dec 6, 2023
* add README to docs

* update documents for distillation

* upload png files
Harleen8118 pushed a commit to Harleen8118/IBERT that referenced this pull request Jun 26, 2025
…… (#812)

Summary:
… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own.  On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica  and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.
Pull Request resolved: facebookresearch/fairseq#812

Reviewed By: myleott, yqwangustc

Differential Revision: D15908614

Pulled By: nayansinghal

fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d
caltia pushed a commit to caltia/fairseq that referenced this pull request Jul 8, 2025
…… (#812)

Summary:
… enabled.

When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own.  On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica  and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0.
Pull Request resolved: facebookresearch/fairseq#812

Reviewed By: myleott, yqwangustc

Differential Revision: D15908614

Pulled By: nayansinghal

fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants