Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only produce tensorboard logs on rank 0 by default #339

Merged
merged 12 commits into from
May 29, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented May 16, 2024

Stack from ghstack (oldest at bottom):

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

tianyu-l added a commit that referenced this pull request May 16, 2024
ghstack-source-id: 7bd4cc24d89dcffe95eb512ff236387fa8d1582b
Pull Request resolved: #339
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 16, 2024
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

Plus some minor changes, e.g. require 2.4.0.dev for torch version we are using more and more recent changes from core.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 16, 2024
ghstack-source-id: 1d228f271db275dd229fae61b3ca064141afcacb
Pull Request resolved: #339
@tianyu-l tianyu-l linked an issue May 16, 2024 that may be closed by this pull request
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 17, 2024
ghstack-source-id: 1a8427b4434d626ab8688fd1605adb35a702068e
Pull Request resolved: #339
@tianyu-l tianyu-l requested a review from wanchaol May 17, 2024 22:58
@tianyu-l
Copy link
Contributor Author

not sure why the 1D compile test is failing...

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 22, 2024
ghstack-source-id: d21ea029e6ec72596e68d231f5bf74df32e3c663
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 22, 2024
ghstack-source-id: e471ebb034764268da5e15336af9299f1ff2ad46
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 22, 2024
ghstack-source-id: d38148fed2e51654b45b59a086cd5bac03e77179
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 23, 2024
ghstack-source-id: ba3afbd496d80c9b51ab49142de57f1e0a4e7cb1
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 28, 2024
ghstack-source-id: 1fbc146696046326bff72cfeb192625ccfda055e
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 29, 2024
ghstack-source-id: c6cf5ef43918478b27d65944ec1c217cf2794fe2
Pull Request resolved: #339
1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 29, 2024
ghstack-source-id: 8d4a50e453d0be2b4a4400ac09a1a793ce8726e5
Pull Request resolved: #339
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 29, 2024
ghstack-source-id: 79d54f750374c8c54460b562a16724b10df547e0
Pull Request resolved: #339
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request May 29, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
Pull Request resolved: #339
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tianyu-l tianyu-l merged commit 482f5ae into gh/tianyu-l/12/base May 29, 2024
4 checks passed
tianyu-l added a commit that referenced this pull request May 29, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
Pull Request resolved: #339
@tianyu-l tianyu-l deleted the gh/tianyu-l/12/head branch May 29, 2024 21:51
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
Pull Request resolved: pytorch#339
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
Pull Request resolved: pytorch#339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add config option to only produce tensorboard logs on rank 0
3 participants