only produce tensorboard logs on rank 0 by default #339

tianyu-l · 2024-05-16T22:57:12Z

Stack from ghstack (oldest at bottom):

-> only produce tensorboard logs on rank 0 by default #339

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

[ghstack-poisoned]

ghstack-source-id: 7bd4cc24d89dcffe95eb512ff236387fa8d1582b Pull Request resolved: #339

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. Plus some minor changes, e.g. require 2.4.0.dev for torch version we are using more and more recent changes from core. [ghstack-poisoned]

ghstack-source-id: 1d228f271db275dd229fae61b3ca064141afcacb Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: 1a8427b4434d626ab8688fd1605adb35a702068e Pull Request resolved: #339

tianyu-l · 2024-05-17T22:58:37Z

not sure why the 1D compile test is failing...

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: d21ea029e6ec72596e68d231f5bf74df32e3c663 Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: e471ebb034764268da5e15336af9299f1ff2ad46 Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: d38148fed2e51654b45b59a086cd5bac03e77179 Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: ba3afbd496d80c9b51ab49142de57f1e0a4e7cb1 Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: 1fbc146696046326bff72cfeb192625ccfda055e Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: c6cf5ef43918478b27d65944ec1c217cf2794fe2 Pull Request resolved: #339

1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. 2. Remove `torch` dependency in `requirements.txt` as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests. [ghstack-poisoned]

ghstack-source-id: 8d4a50e453d0be2b4a4400ac09a1a793ce8726e5 Pull Request resolved: #339

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. [ghstack-poisoned]

ghstack-source-id: 79d54f750374c8c54460b562a16724b10df547e0 Pull Request resolved: #339

For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes. [ghstack-poisoned]

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339

wanchaol

lgtm

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

only produce tensorboard logs on rank 0 by default

9e07239

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request May 16, 2024

only produce tensorboard logs on rank 0 by default

ee0b8f4

ghstack-source-id: 7bd4cc24d89dcffe95eb512ff236387fa8d1582b Pull Request resolved: #339

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 16, 2024

tianyu-l added a commit that referenced this pull request May 16, 2024

only produce tensorboard logs on rank 0 by default

0efa88e

ghstack-source-id: 1d228f271db275dd229fae61b3ca064141afcacb Pull Request resolved: #339

tianyu-l linked an issue May 16, 2024 that may be closed by this pull request

add config option to only produce tensorboard logs on rank 0 #304

Closed

tianyu-l added a commit that referenced this pull request May 17, 2024

only produce tensorboard logs on rank 0 by default

36f3d30

ghstack-source-id: 1a8427b4434d626ab8688fd1605adb35a702068e Pull Request resolved: #339

tianyu-l requested a review from wanchaol May 17, 2024 22:58

tianyu-l added a commit that referenced this pull request May 22, 2024

only produce tensorboard logs on rank 0 by default

80553d0

ghstack-source-id: d21ea029e6ec72596e68d231f5bf74df32e3c663 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 22, 2024

only produce tensorboard logs on rank 0 by default

53b45a0

ghstack-source-id: e471ebb034764268da5e15336af9299f1ff2ad46 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 22, 2024

only produce tensorboard logs on rank 0 by default

77cff88

ghstack-source-id: d38148fed2e51654b45b59a086cd5bac03e77179 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 23, 2024

only produce tensorboard logs on rank 0 by default

dbcc6dc

ghstack-source-id: ba3afbd496d80c9b51ab49142de57f1e0a4e7cb1 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 28, 2024

only produce tensorboard logs on rank 0 by default

98c17df

ghstack-source-id: 1fbc146696046326bff72cfeb192625ccfda055e Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 29, 2024

only produce tensorboard logs on rank 0 by default

54aecd2

ghstack-source-id: c6cf5ef43918478b27d65944ec1c217cf2794fe2 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 29, 2024

only produce tensorboard logs on rank 0 by default

0f928ee

ghstack-source-id: 8d4a50e453d0be2b4a4400ac09a1a793ce8726e5 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 29, 2024

only produce tensorboard logs on rank 0 by default

97d00cf

ghstack-source-id: 79d54f750374c8c54460b562a16724b10df547e0 Pull Request resolved: #339

tianyu-l added a commit that referenced this pull request May 29, 2024

only produce tensorboard logs on rank 0 by default

cb4fd58

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339

wanchaol approved these changes May 29, 2024

View reviewed changes

tianyu-l merged commit 482f5ae into gh/tianyu-l/12/base May 29, 2024
4 checks passed

tianyu-l added a commit that referenced this pull request May 29, 2024

only produce tensorboard logs on rank 0 by default

6a8455e

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: #339

tianyu-l deleted the gh/tianyu-l/12/head branch May 29, 2024 21:51

tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024

only produce tensorboard logs on rank 0 by default

4d28f76

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

only produce tensorboard logs on rank 0 by default

0779207

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only produce tensorboard logs on rank 0 by default #339

only produce tensorboard logs on rank 0 by default #339

tianyu-l commented May 16, 2024 •

edited

Loading

tianyu-l commented May 17, 2024

wanchaol left a comment

only produce tensorboard logs on rank 0 by default #339

only produce tensorboard logs on rank 0 by default #339

Conversation

tianyu-l commented May 16, 2024 • edited Loading

tianyu-l commented May 17, 2024

wanchaol left a comment

Choose a reason for hiding this comment

tianyu-l commented May 16, 2024 •

edited

Loading