Flight Recorder - feature requests #117883

wconstab · 2024-01-19T21:58:38Z

1 The top level pickle object should be a dict rather than an array. Right now we have: [<nccl op 1>, …]. I’m suggesting {operations:[<nccl opt 1]}
2 With the use of a dict above, we should make the file self describing (i.e., it should have a record of which rank it was, host name, MAST job, attempt, etc).
3 Consider adding a data version tag (file format version 3) in the dict to make tooling versioning easier
4 Different ranks will have different “starting” points for their trace. In other words, rank 1 and rank 2 might both have executed an operation in PG 1 with seq_id 5. But rank 2 might have had this sequence fall off of their buffer. Maybe we should record the “last dropped record” for each PG so that we can know to avoid analyzing those cases
5 It’s impossible to tell what the members of a given PG ID are (and if PG IDs are expected to be unique across the job or if you might have two PG ID “1”s with distinct sets of ranks. I know we’ve done some optimizations in initialization that could lead to this”
6 Consider recording time in nanoseconds vs micros — it’s a nit but nanos is how you get time from the kernel so anything else is doing extra division, losing data, and still a uint64
7 Consider adding a completion time (e.g. to help highlight cases where a given rank is always late in joining / completes the collective “too fast”)
9 had to tweek timeouts to get a dump from all ranks (TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC from 2 sec -> 60)

8 support writing the dump as chrome tracing format

Bugs/ code refactors

fut.get() waits forever instead of timeout (fix: [C10D] Fix nccl flightrecorder ignored dump timeout #118142)
waitForDumpOrTimeout is overly complicated, we should simplify to not have 2 timevals
launchAsyncDebugDump is called from several places. why not just call it from one place in heartbeat monitor and use tcpstore to signal it?
~ProcessGroupNCCL calls abort, which can hang, we might want to move that to destroy_process_group api instead (and get it out of the path of exceptions shutting down python turning into a collective hang)
avoid symbolize() if dumping from ~ProcessGroupNCCL
should we check how many people wrote the dump and log that?

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k @rohan-varma @zdevito @shuqiangzhang @wconstab

zdevito · 2024-01-22T23:10:34Z

1, 2, 3, 6 - I think we should just implement these. It's clear enough not to need additional design.

4 - we already store sequence id, so what would we store except sequence id - 1? Is this meant to handle when an entire set of ops from a process group have fallen off the end? Can this realistically happen? Most of the process groups are sequentially tied together.

5 - don't all ranks have to initialize process groups in the same order? If not we should do this.
7 - CPU completion time is not GPU completion time. No rank will be marked complete in the logs until the last message comes in that causes all the gpu kernels to exit. We can't measure the last message from the GPU except by changing nccl. Measuring it here would be misleading: it would look like all ranks basically end at the same time. My preference is to either report the information we'd really want (gpu completion time), or no information at all (current state). We maybe can measure kernel start time, since the straggler might start late. But this won't catch all late messages.
9 - If we already exit when we see the last dump, we should set this to a minute. It is probably just timing out too early.

wconstab · 2024-01-22T23:20:53Z

5 It’s impossible to tell what the members of a given PG ID are (and if PG IDs are expected to be unique across the job or if you might have two PG ID “1”s with distinct sets of ranks. I know we’ve done some optimizations in initialization that could lead to this”

My read of this was that we don't include info in the dump that decodes the PG mappings. If you have the trainer logs we should have enough info about the initialization of sub-PGs, but i think we can add this same info into the dump to make it more self-contained. Easier to do after fixing (1) as we can add a separate key for pg_defs or something.

wconstab · 2024-01-22T23:22:24Z

7 - CPU completion time is not GPU completion time. No rank will be marked complete in the logs until the last message comes in that causes all the gpu kernels to exit. We can't measure the last message from the GPU except by changing nccl. Measuring it here would be misleading: it would look like all ranks basically end at the same time. My preference is to either report the information we'd really want (gpu completion time), or no information at all (current state). We maybe can measure kernel start time, since the straggler might start late. But this won't catch all late messages.

agree it's potentially more confusing than helpful if we log the time that our async thread marks a work as complete (which is not closely coupled to the time it actually finished on gpu). Otoh, i can think of a useful metric we can compute. If we compare 'duration_ms' to 'completed - started' we can estimate the times when the system is under heavy contention for cuda apis and the watchdog loop slows down.

shuqiangzhang · 2024-01-22T23:29:40Z

Wonder if we could also dump the latest 'queued' and latest 'completed' collectives for each rank, in addition to the last N collectives, which might be easy to identify which collective is the last sync point of all ranks, and which rank might be the culprit when timeout happens (e.g., never enter the collectives).

Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 [ghstack-poisoned]

Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 ghstack-source-id: 9e4100a39d8fd57fe64647c3c8abfdd1c4c0e706 Pull Request resolved: #118044

Addresses (3) from #117883 [ghstack-poisoned]

Addresses (3) from #117883 ghstack-source-id: 552171e77e1da5d495c90e452a7afa532531fccf Pull Request resolved: #118046

Addresses (6) from #117883 [ghstack-poisoned]

Addresses (6) from #117883 ghstack-source-id: da821bc7038f9bc0aa02f089f95ead0ef3abf834 Pull Request resolved: #118047

…rder Dump" Addresses (3) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

Addresses (3) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

…eated in ns" Addresses (6) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

Addresses (6) from #117883 ghstack-source-id: ac105b1c0cd83ef57fd8ed222e91a9fd9a6844cb Pull Request resolved: #118047

Addresses (6) from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

…uce a dict" Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]