Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flight Recorder - feature requests #117883

Open
4 of 9 tasks
wconstab opened this issue Jan 19, 2024 · 8 comments
Open
4 of 9 tasks

Flight Recorder - feature requests #117883

wconstab opened this issue Jan 19, 2024 · 8 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@wconstab
Copy link
Contributor

wconstab commented Jan 19, 2024

From @bmaurer

  • 1 The top level pickle object should be a dict rather than an array. Right now we have: [<nccl op 1>, …]. I’m suggesting {operations:[<nccl opt 1]}
  • 2 With the use of a dict above, we should make the file self describing (i.e., it should have a record of which rank it was, host name, MAST job, attempt, etc).
  • 3 Consider adding a data version tag (file format version 3) in the dict to make tooling versioning easier
  • 4 Different ranks will have different “starting” points for their trace. In other words, rank 1 and rank 2 might both have executed an operation in PG 1 with seq_id 5. But rank 2 might have had this sequence fall off of their buffer. Maybe we should record the “last dropped record” for each PG so that we can know to avoid analyzing those cases
  • 5 It’s impossible to tell what the members of a given PG ID are (and if PG IDs are expected to be unique across the job or if you might have two PG ID “1”s with distinct sets of ranks. I know we’ve done some optimizations in initialization that could lead to this”
  • 6 Consider recording time in nanoseconds vs micros — it’s a nit but nanos is how you get time from the kernel so anything else is doing extra division, losing data, and still a uint64
  • 7 Consider adding a completion time (e.g. to help highlight cases where a given rank is always late in joining / completes the collective “too fast”)
  • 9 had to tweek timeouts to get a dump from all ranks (TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC from 2 sec -> 60)

From @jackphelanmeta

  • 8 support writing the dump as chrome tracing format

Bugs/ code refactors

  • fut.get() waits forever instead of timeout (fix: [C10D] Fix nccl flightrecorder ignored dump timeout #118142)
  • waitForDumpOrTimeout is overly complicated, we should simplify to not have 2 timevals
  • launchAsyncDebugDump is called from several places. why not just call it from one place in heartbeat monitor and use tcpstore to signal it?
  • ~ProcessGroupNCCL calls abort, which can hang, we might want to move that to destroy_process_group api instead (and get it out of the path of exceptions shutting down python turning into a collective hang)
  • avoid symbolize() if dumping from ~ProcessGroupNCCL
  • should we check how many people wrote the dump and log that?

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k @rohan-varma @zdevito @shuqiangzhang @wconstab

@cpuhrsch cpuhrsch added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 20, 2024
@zdevito
Copy link
Contributor

zdevito commented Jan 22, 2024

1, 2, 3, 6 - I think we should just implement these. It's clear enough not to need additional design.

4 - we already store sequence id, so what would we store except sequence id - 1? Is this meant to handle when an entire set of ops from a process group have fallen off the end? Can this realistically happen? Most of the process groups are sequentially tied together.

5 - don't all ranks have to initialize process groups in the same order? If not we should do this.
7 - CPU completion time is not GPU completion time. No rank will be marked complete in the logs until the last message comes in that causes all the gpu kernels to exit. We can't measure the last message from the GPU except by changing nccl. Measuring it here would be misleading: it would look like all ranks basically end at the same time. My preference is to either report the information we'd really want (gpu completion time), or no information at all (current state). We maybe can measure kernel start time, since the straggler might start late. But this won't catch all late messages.
9 - If we already exit when we see the last dump, we should set this to a minute. It is probably just timing out too early.

@wconstab
Copy link
Contributor Author

5 It’s impossible to tell what the members of a given PG ID are (and if PG IDs are expected to be unique across the job or if you might have two PG ID “1”s with distinct sets of ranks. I know we’ve done some optimizations in initialization that could lead to this”

My read of this was that we don't include info in the dump that decodes the PG mappings. If you have the trainer logs we should have enough info about the initialization of sub-PGs, but i think we can add this same info into the dump to make it more self-contained. Easier to do after fixing (1) as we can add a separate key for pg_defs or something.

@wconstab
Copy link
Contributor Author

7 - CPU completion time is not GPU completion time. No rank will be marked complete in the logs until the last message comes in that causes all the gpu kernels to exit. We can't measure the last message from the GPU except by changing nccl. Measuring it here would be misleading: it would look like all ranks basically end at the same time. My preference is to either report the information we'd really want (gpu completion time), or no information at all (current state). We maybe can measure kernel start time, since the straggler might start late. But this won't catch all late messages.

agree it's potentially more confusing than helpful if we log the time that our async thread marks a work as complete (which is not closely coupled to the time it actually finished on gpu). Otoh, i can think of a useful metric we can compute. If we compare 'duration_ms' to 'completed - started' we can estimate the times when the system is under heavy contention for cuda apis and the watchdog loop slows down.

@shuqiangzhang
Copy link
Contributor

Wonder if we could also dump the latest 'queued' and latest 'completed' collectives for each rank, in addition to the last N collectives, which might be easy to identify which collective is the last sync point of all ranks, and which rank might be the culprit when timeout happens (e.g., never enter the collectives).

wconstab added a commit that referenced this issue Jan 23, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

ghstack-source-id: 9e4100a39d8fd57fe64647c3c8abfdd1c4c0e706
Pull Request resolved: #118044
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (3) from #117883

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (3) from #117883

ghstack-source-id: 552171e77e1da5d495c90e452a7afa532531fccf
Pull Request resolved: #118046
wconstab added a commit that referenced this issue Jan 23, 2024
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (6) from #117883

ghstack-source-id: da821bc7038f9bc0aa02f089f95ead0ef3abf834
Pull Request resolved: #118047
wconstab added a commit that referenced this issue Jan 23, 2024
…rder Dump"

Addresses (3) from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (3) from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
…eated in ns"

Addresses (6) from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (6) from #117883

ghstack-source-id: ac105b1c0cd83ef57fd8ed222e91a9fd9a6844cb
Pull Request resolved: #118047
wconstab added a commit that referenced this issue Jan 23, 2024
Addresses (6) from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
…uce a dict"

Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Jan 23, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Jan 23, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
pytorchmergebot pushed a commit that referenced this issue Jan 23, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
pytorchmergebot pushed a commit that referenced this issue Jan 23, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
@wconstab
Copy link
Contributor Author

I might be missing the reasoning behind this. I think we assume that it's easy to tune the ring buffer size so that it will with high confidence include all the relevant collectives. It should be pretty unlikely that you manage to schedule for example 20k more collectives after one that hangs, and IIUC cuda driver doesn't even let you push more than about 2000 kernels into its queue at a time. If for some reason 20k events in the buffer isn't enough, it should be possible to 10x that without making the files too big either. But maybe I'm missing your point?

wconstab added a commit that referenced this issue Jan 26, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
wconstab added a commit that referenced this issue Jan 26, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
wconstab added a commit that referenced this issue Jan 26, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
wconstab added a commit that referenced this issue Jan 26, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
@bmaurer
Copy link

bmaurer commented Jan 26, 2024

I might be missing the reasoning behind this. I think we assume that it's easy to tune the ring buffer size so that it will with high confidence include all the relevant collectives. It should be pretty unlikely that you manage to schedule for example 20k more collectives after one that hangs, and IIUC cuda driver doesn't even let you push more than about 2000 kernels into its queue at a time. If for some reason 20k events in the buffer isn't enough, it should be possible to 10x that without making the files too big either. But maybe I'm missing your point?

Not sure what in particular this was in reference to. But for a lot of this feedback it was as much about "can you produce the data" as it is "is it easy to writ ethe tool". I'd rather not have to write tooling that makes a bunch of assumptions / assertions about hte data format and would generally prefer that the data format be easy to interpret as is

wconstab added a commit that referenced this issue Jan 27, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
wconstab added a commit that referenced this issue Jan 27, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
wconstab added a commit that referenced this issue Jan 27, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
wconstab added a commit that referenced this issue Jan 27, 2024
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: #118044
Approved by: https://github.com/zdevito
wconstab added a commit that referenced this issue Jan 27, 2024
Addresses (3) from #117883

Pull Request resolved: #118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
wconstab added a commit that referenced this issue Jan 27, 2024
Addresses (6) from #117883

Pull Request resolved: #118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
@wconstab
Copy link
Contributor Author

In reference to (4) from @bmaurer above

Maybe we should record the “last dropped record” for each PG so that we can know to avoid analyzing those cases

I'm just confused what this request means. Is the last dropped record just the N+1 th element in a queue of size N? If so, you get the same value by increasing the Queue size, but then you could still ask for one more. I think I missed the point here.

and from @shuqiangzhang

Wonder if we could also dump the latest 'queued' and latest 'completed' collectives for each rank, in addition to the last N

This seems like more of 'up front processing' -- e.g. I'm assuming the latest queued and completed collectives will be inside the buffer of size N, as we tune N biased on the large size. Then it's just whether we maintain the 'latest' per category as conveniences or let users parse this out from the event list later on. I assume parsing is easy enough here. (but I think @bmaurer is asking for the opposite) Happy to add it if helpful.

@ezyang
Copy link
Contributor

ezyang commented Feb 2, 2024

cc @lobanova

@c-p-i-o c-p-i-o self-assigned this Jun 6, 2024
@c-p-i-o c-p-i-o added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants