-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flight Recorder - feature requests #117883
Comments
1, 2, 3, 6 - I think we should just implement these. It's clear enough not to need additional design. 4 - we already store sequence id, so what would we store except sequence id - 1? Is this meant to handle when an entire set of ops from a process group have fallen off the end? Can this realistically happen? Most of the process groups are sequentially tied together. 5 - don't all ranks have to initialize process groups in the same order? If not we should do this. |
My read of this was that we don't include info in the dump that decodes the PG mappings. If you have the trainer logs we should have enough info about the initialization of sub-PGs, but i think we can add this same info into the dump to make it more self-contained. Easier to do after fixing (1) as we can add a separate key for pg_defs or something. |
agree it's potentially more confusing than helpful if we log the time that our async thread marks a work as complete (which is not closely coupled to the time it actually finished on gpu). Otoh, i can think of a useful metric we can compute. If we compare 'duration_ms' to 'completed - started' we can estimate the times when the system is under heavy contention for cuda apis and the watchdog loop slows down. |
Wonder if we could also dump the latest 'queued' and latest 'completed' collectives for each rank, in addition to the last N collectives, which might be easy to identify which collective is the last sync point of all ranks, and which rank might be the culprit when timeout happens (e.g., never enter the collectives). |
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 [ghstack-poisoned]
Addresses (3) from #117883 [ghstack-poisoned]
Addresses (6) from #117883 [ghstack-poisoned]
…rder Dump" Addresses (3) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Addresses (3) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
…eated in ns" Addresses (6) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Addresses (6) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
…uce a dict" Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
I might be missing the reasoning behind this. I think we assume that it's easy to tune the ring buffer size so that it will with high confidence include all the relevant collectives. It should be pretty unlikely that you manage to schedule for example 20k more collectives after one that hangs, and IIUC cuda driver doesn't even let you push more than about 2000 kernels into its queue at a time. If for some reason 20k events in the buffer isn't enough, it should be possible to 10x that without making the files too big either. But maybe I'm missing your point? |
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
Not sure what in particular this was in reference to. But for a lot of this feedback it was as much about "can you produce the data" as it is "is it easy to writ ethe tool". I'd rather not have to write tooling that makes a bunch of assumptions / assertions about hte data format and would generally prefer that the data format be easy to interpret as is |
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: #118044 Approved by: https://github.com/zdevito
Addresses (3) from #117883 Pull Request resolved: #118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044
Addresses (6) from #117883 Pull Request resolved: #118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046
In reference to (4) from @bmaurer above
I'm just confused what this request means. Is the last dropped record just the N+1 th element in a queue of size N? If so, you get the same value by increasing the Queue size, but then you could still ask for one more. I think I missed the point here. and from @shuqiangzhang
This seems like more of 'up front processing' -- e.g. I'm assuming the latest queued and completed collectives will be inside the buffer of size N, as we tune N biased on the large size. Then it's just whether we maintain the 'latest' per category as conveniences or let users parse this out from the event list later on. I assume parsing is easy enough here. (but I think @bmaurer is asking for the opposite) Happy to add it if helpful. |
cc @lobanova |
From @bmaurer
From @jackphelanmeta
Bugs/ code refactors
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k @rohan-varma @zdevito @shuqiangzhang @wconstab
The text was updated successfully, but these errors were encountered: