Skip to content

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Aug 10, 2022

Stack from ghstack (oldest at bottom):

Closes #82888

This is a tentative fix. make is called by ninja so should be run in
parallel with other jobs already.

Closes #82888

This is a tentative fix. NCCL is built by ninja so should be run in
parallel with other jobs and so we are doubling up on parallelism.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 10, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit a38ae4b (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10 peterbell10 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 10, 2022
Closes #82888

This is a tentative fix. make is called by ninja so should be run in
parallel with other jobs already.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Aug 10, 2022
Closes #82888

This is a tentative fix. NCCL is built by ninja so should be run in
parallel with other jobs and so we are doubling up on parallelism.

ghstack-source-id: c4e65ab
Pull Request resolved: #83173
@peterbell10 peterbell10 marked this pull request as ready for review August 10, 2022 17:26
@peterbell10 peterbell10 requested a review from a team as a code owner August 10, 2022 17:26
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would unnecessarily slows down the build, wouldn't it? Also, I'm not sure why -j 6 causes OOMS after version update, unless NCCL build itself spawns number of parallel builds now

@peterbell10
Copy link
Collaborator Author

peterbell10 commented Aug 10, 2022

This would unnecessarily slows down the build, wouldn't it?

If you're only building nccl, it will be slower. However, within the context of a complete pytorch build any unused threads will be used for other pytorch build jobs.

Also, I'm not sure why -j 6 causes OOMS after version update, unless NCCL build itself spawns number of parallel builds now

It's running 11 compile jobs in parallel, 6 nccl + (6-1) for pytorch. It may have been very close to running OOM before, and a small increase in memory consumption by one of the updated files has tipped it over the edge.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what are we loosing here

@malfet
Copy link
Contributor

malfet commented Aug 10, 2022

@pytorchbot merge -g

@malfet
Copy link
Contributor

malfet commented Aug 10, 2022

@janeyx99 is there a dashboard somewhere to monitor typical job runtime? (want to check if this one significantly increases buildtime)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@huydhn
Copy link
Contributor

huydhn commented Aug 10, 2022

@malfet This is probably useful to you https://hud.pytorch.org/tts/pytorch/pytorch/master?jobName=trunk%20%2F%20libtorch-linux-bionic-cuda11.6-py3.7-gcc7%20%2F%20build for monitoring the build time for specific jobs

facebook-github-bot pushed a commit that referenced this pull request Aug 11, 2022
Summary:
Closes #82888

This is a tentative fix. make is called by ninja so should be run in
parallel with other jobs already.

Pull Request resolved: #83173
Approved by: https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1c83ec8f61209b1ce51543092d06edc23daffaf8

Reviewed By: seemethere

Differential Revision: D38600293

fbshipit-source-id: c2aca0077b9fbbb088c4a2f4dd9c3bd819c8cca0
@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/381/head branch August 14, 2022 14:18
peterbell10 added a commit that referenced this pull request Aug 18, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Aug 18, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

ghstack-source-id: 85f5588
Pull Request resolved: #83696
peterbell10 added a commit that referenced this pull request Aug 24, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Aug 24, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Aug 24, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Aug 24, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2022
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.
Pull Request resolved: #83696
Approved by: https://github.com/malfet
facebook-github-bot pushed a commit that referenced this pull request Aug 26, 2022
Summary:
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.

This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.

Pull Request resolved: #83696
Approved by: https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/2000eba4547f885dc937c4335bee4ba1a71b4df5

Reviewed By: weiwangmeta

Differential Revision: D39011903

Pulled By: weiwangmeta

fbshipit-source-id: d2431718f6c32c4ddaed7b252c230bc5680780ad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants