-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Build nccl single-threaded #83173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build nccl single-threaded #83173
Conversation
Closes #82888 This is a tentative fix. NCCL is built by ninja so should be run in parallel with other jobs and so we are doubling up on parallelism. [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit a38ae4b (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Closes #82888 This is a tentative fix. make is called by ninja so should be run in parallel with other jobs already. [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would unnecessarily slows down the build, wouldn't it? Also, I'm not sure why -j 6
causes OOMS after version update, unless NCCL build itself spawns number of parallel builds now
If you're only building nccl, it will be slower. However, within the context of a complete pytorch build any unused threads will be used for other pytorch build jobs.
It's running 11 compile jobs in parallel, 6 nccl + (6-1) for pytorch. It may have been very close to running OOM before, and a small increase in memory consumption by one of the updated files has tipped it over the edge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what are we loosing here
@pytorchbot merge -g |
@janeyx99 is there a dashboard somewhere to monitor typical job runtime? (want to check if this one significantly increases buildtime) |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @peterbell10. |
@malfet This is probably useful to you https://hud.pytorch.org/tts/pytorch/pytorch/master?jobName=trunk%20%2F%20libtorch-linux-bionic-cuda11.6-py3.7-gcc7%20%2F%20build for monitoring the build time for specific jobs |
Summary: Closes #82888 This is a tentative fix. make is called by ninja so should be run in parallel with other jobs already. Pull Request resolved: #83173 Approved by: https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1c83ec8f61209b1ce51543092d06edc23daffaf8 Reviewed By: seemethere Differential Revision: D38600293 fbshipit-source-id: c2aca0077b9fbbb088c4a2f4dd9c3bd819c8cca0
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. [ghstack-poisoned]
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. ghstack-source-id: 85f5588 Pull Request resolved: #83696
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. [ghstack-poisoned]
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. [ghstack-poisoned]
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. [ghstack-poisoned]
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. [ghstack-poisoned]
Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. Pull Request resolved: #83696 Approved by: https://github.com/malfet
Summary: Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. Pull Request resolved: #83696 Approved by: https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2000eba4547f885dc937c4335bee4ba1a71b4df5 Reviewed By: weiwangmeta Differential Revision: D39011903 Pulled By: weiwangmeta fbshipit-source-id: d2431718f6c32c4ddaed7b252c230bc5680780ad
Stack from ghstack (oldest at bottom):
Closes #82888
This is a tentative fix. make is called by ninja so should be run in
parallel with other jobs already.