Skip to content

fix(nccl): increased nccl timeout#11

Merged
knzo25 merged 1 commit intotier4:mainfrom
knzo25:fix/nccl_timeout
Mar 5, 2025
Merged

fix(nccl): increased nccl timeout#11
knzo25 merged 1 commit intotier4:mainfrom
knzo25:fix/nccl_timeout

Conversation

@knzo25
Copy link
Contributor

@knzo25 knzo25 commented Mar 5, 2025

Summary

During the BEVFusion deployment I made several changes that are not yet in the main branch.
To keep track of the changes, I will split it in several self contained PRs.

In this one, I increased the timeout from nccl, which is triggered by the metrics evaluation taking over 10 minutes (600s seems to be the default value).

Change point

Increased the timeout of nccl

Related error:
https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out

Note

It applies to all 3d detection config files, but does not affect (negatively) their runtime .

Test performed

  • Training with two gpus worked when previously it used to failed during evaluation in the first epoch (note: during the first few epochs evaluation tends to take longer due to a high number of false positives)

…uating the model takes too long

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
@knzo25 knzo25 self-assigned this Mar 5, 2025
@knzo25 knzo25 marked this pull request as ready for review March 5, 2025 01:57
@knzo25 knzo25 requested a review from scepter914 as a code owner March 5, 2025 01:57
@KSeangTan
Copy link
Collaborator

I experienced similar NCCL timeout error during my multi-head experiments, thus it's a neccessary and important change

@knzo25 knzo25 merged commit 7c88ff0 into tier4:main Mar 5, 2025
2 checks passed
SamratThapa120 pushed a commit that referenced this pull request Mar 27, 2025
Signed-off-by: scepter914 <scepter914@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants