-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only produce tensorboard logs on rank 0 by default #339
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 16, 2024
ghstack-source-id: 7bd4cc24d89dcffe95eb512ff236387fa8d1582b Pull Request resolved: #339
facebook-github-bot
added
the
CLA Signed
This label is managed by the Meta Open Source bot.
label
May 16, 2024
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. Plus some minor changes, e.g. require 2.4.0.dev for torch version we are using more and more recent changes from core. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 16, 2024
ghstack-source-id: 1d228f271db275dd229fae61b3ca064141afcacb Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 17, 2024
ghstack-source-id: 1a8427b4434d626ab8688fd1605adb35a702068e Pull Request resolved: #339
not sure why the 1D compile test is failing... |
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 22, 2024
ghstack-source-id: d21ea029e6ec72596e68d231f5bf74df32e3c663 Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 22, 2024
ghstack-source-id: e471ebb034764268da5e15336af9299f1ff2ad46 Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 22, 2024
ghstack-source-id: d38148fed2e51654b45b59a086cd5bac03e77179 Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 23, 2024
ghstack-source-id: ba3afbd496d80c9b51ab49142de57f1e0a4e7cb1 Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 28, 2024
ghstack-source-id: 1fbc146696046326bff72cfeb192625ccfda055e Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 29, 2024
ghstack-source-id: c6cf5ef43918478b27d65944ec1c217cf2794fe2 Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 29, 2024
ghstack-source-id: 8d4a50e453d0be2b4a4400ac09a1a793ce8726e5 Pull Request resolved: #339
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 29, 2024
ghstack-source-id: 79d54f750374c8c54460b562a16724b10df547e0 Pull Request resolved: #339
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. [ghstack-poisoned]
tianyu-l
added a commit
that referenced
this pull request
May 29, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339
wanchaol
approved these changes
May 29, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
tianyu-l
added a commit
that referenced
this pull request
May 29, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339
tianyu-l
added a commit
to tianyu-l/torchtitan_intern24
that referenced
this pull request
Aug 16, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339
philippguevorguian
pushed a commit
to YerevaNN/YNNtitan
that referenced
this pull request
Aug 17, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.