Skip to content

Conversation

@yqwangustc
Copy link
Contributor

Summary:
We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Differential Revision: D15170357

…ain_step (facebookresearch#2)

Summary:
Pull Request resolved: fairinternal/fairspeq#2

Pull Request resolved: facebookresearch#689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 1c3defd70bf97b2f4e2f1b39661c735907258194
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a2901f9.

Harleen8118 pushed a commit to Harleen8118/IBERT that referenced this pull request Jun 26, 2025
…ain_step (#2)

Summary:
Pull Request resolved: fairinternal/fairspeq#2

Pull Request resolved: facebookresearch/fairseq#689

We found not raising OOM during trainer.train_step causes various
issue, including NCCL hangs / gloo sync errors because gradient is not synced
properly. Before we found the root cause, let's give users an option to raise
OOMs.

Reviewed By: jmp84

Differential Revision: D15170357

fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants