Observability: structured logging + training instrumentation (#3176) by felipemello1 · Pull Request #3176 · pytorch/torchtitan

felipemello1 · 2026-04-30T15:40:47Z

Summary:

TLDR: Enables structured-logging in torchtitan. Time spans, scalars and events can be logged, per rank, with metadata, to a jsonl or database, using python standard logger. This can then be converted into a gantt chart. It is cheap, runs on every step, and useful to find stragglers, high level bottlenecks, how asynchronous an RL run is, debug timeouts, weird behavior on some rank on some step, etc.

For details of the APIs, please read the added readme in torchtitan/observability/structured_loggger/README.md.

Examples:

Other comments

To minimize the impact of lines changed in the trainer.py code:

We use decorators where we can, instead of context managers
We call tags and spans directly inside of some functions, e.g. ckpt, garbage collectors and profiler
initialization: We add it one time in the build call base class, so everything that calls build is automatically tracked.

Is it safe?

The workhorse here is logger.info. So the same way you treat logger.info, you should treat the structured logger.

It can be disabled with a global flag.

If the user forgets to setup init_structured_logger, we still call logger.info, but nothing is saved anywhere.

It detects is_compiling and skips if true.

Profiler default paths

To align with expectations set by some internal tooling, we also changed the kineto and memory profilers default path.

Test

NGPU=8 LOG_RANK=0 ./run_train.sh \
    --module llama3 --config llama3_8b \                                                                                                                          
    --training.steps 5 \                                                                 
    --debug.seed=42 --debug.deterministic \                                                                                                                       
    --dump_folder="$DUMP" \                                                              
    --profiler.enable_profiling --profiler.profile_freq=5 \                                                                                                       
    --profiler.enable_memory_snapshot

meta-codesync · 2026-04-30T15:40:56Z

@felipemello1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101049878.

…orch#3176) Summary: **TLDR**: Enables structured-logging in torchtitan. Time spans, scalars and events can be logged, per rank, with metadata, to a jsonl or database, using python standard logger. This can then be converted into a gantt chart. It is cheap, runs on every step, and useful to find stragglers, high level bottlenecks, how asynchronous an RL run is, debug timeouts, weird behavior on some rank on some step, etc. **For details of the APIs, please read the added readme in** torchtitan/observability/structured_loggger/README.md. Examples: {F1989149827} ## Other comments To minimize the impact of lines changed in the trainer.py code: - We use decorators where we can, instead of context managers - We call tags and spans directly inside of some functions, e.g. ckpt, garbage collectors and profiler - **initialization**: We add it one time in the build call base class, so everything that calls build is automatically tracked. ## Is it safe? The workhorse here is logger.info. So the same way you treat logger.info, you should treat the structured logger. It can be disabled with a global flag. If the user forgets to setup `init_structured_logger`, we still call logger.info, but nothing is saved anywhere. It detects `is_compiling` and skips if true. ## Profiler default paths To align with expectations set by some internal tooling, we also changed the kineto and memory profilers default path. Differential Revision: D101049878

…#3176) Summary: **TLDR**: Enables structured-logging in torchtitan. Time spans, scalars and events can be logged, per rank, with metadata, to a jsonl or database, using python standard logger. This can then be converted into a gantt chart. It is cheap, runs on every step, and useful to find stragglers, high level bottlenecks, how asynchronous an RL run is, debug timeouts, weird behavior on some rank on some step, etc. **For details of the APIs, please read the added readme in** torchtitan/observability/structured_loggger/README.md. Examples: {F1989149827} ## Other comments To minimize the impact of lines changed in the trainer.py code: - We use decorators where we can, instead of context managers - We call tags and spans directly inside of some functions, e.g. ckpt, garbage collectors and profiler - **initialization**: We add it one time in the build call base class, so everything that calls build is automatically tracked. ## Is it safe? The workhorse here is logger.info. So the same way you treat logger.info, you should treat the structured logger. It can be disabled with a global flag. If the user forgets to setup `init_structured_logger`, we still call logger.info, but nothing is saved anywhere. It detects `is_compiling` and skips if true. ## Profiler default paths To align with expectations set by some internal tooling, we also changed the kineto and memory profilers default path. Differential Revision: D101049878

rakkit · 2026-04-30T19:55:18Z

questions: this merge generate_gantt_trace("outputs/structured_logs/", "outputs/gantt.json") seems gonna merge all files in this folder. for use case, like we need to resume job multiple times (runs on same dump folder) then the logs will come from differnt runs and it is werid if we merge them together.

and do we by default expect this to runs on full ranks + every step? lts gonna be like O(1M) steps per 5T token on each rank.

felipemello1 · 2026-04-30T20:04:20Z

@rakkit, good questions!

like we need to resume job multiple times (runs on same dump folder) then the logs will come from differnt runs and it is werid if we merge them together.

If we are resuming, i assume it would be ok to have them together, no? But we could think about adding a run_id to the logs metadata, so users can filter by it. Another alternative is having the logs directed to a different dump_folder

and do we by default expect this to runs on full ranks + every step? lts gonna be like O(1M) steps per 5T token on each rank.

We expect to have N rank files, yes. For the gantt, we could add flags so people can choose to only display N ranks, for example, or only the last N entries. But the main benefit here is being to query it with some database, e.g. if you have a nccl timeout, to find where/when it timedout, for example.

On the logger side, we could cycle the file, limiting to N max lines, or breaking into new files every N entries.

I thought that these refinements could be added as follow-up as users stress test it. Does it match your intuition?

rakkit · 2026-04-30T20:14:00Z

i assume it would be ok to have them together.

yes put them together is ok. just feel weird, or at least we should allow filtering by runs. on slurm clusters we usually get differnt nodes and usually when problem comes, we actually easily know and locate "which runs is borken" in that case checking the special runs log will be much easier and clean.

I thought that these refinements could be added as follow-up as users stress test it

yeah it just we recently suffers a lot from scaling. everything on 1~2k gpu x GPFS is mess and i got PTSD on this

felipemello1 · 2026-04-30T20:18:31Z

i assume it would be ok to have them together.

yes put them together is ok. just feel weird, or at least we should allow filtering by runs. on slurm clusters we usually get differnt nodes and usually when problem comes, we actually easily know and locate "which runs is borken" in that case checking the special runs log will be much easier and clean.

I thought that these refinements could be added as follow-up as users stress test it

yeah it just we recently suffers a lot from scaling. everything on 1~2k gpu x GPFS is mess and i got PTSD on this

makes sense! I can look into adding a 'run_id' metadata. But, as it is, it should be very easy for users to also add whatever handler/metadata they prefer (example in the readme)

yeah it just we recently suffers a lot from scaling. everything on 1~2k gpu x GPFS is mess and i got PTSD on this

Makes sense! If/when you give it a try, let me know. I will think a bit more about remediating this in a follow up as well.

At first glance, do you see this type of logger being useful for your 1-2k gpu run?

rakkit · 2026-04-30T20:29:41Z

makes sense! I can look into adding a 'run_id' metadata. But, as it is, it should be very easy for users to also add whatever handler/metadata they prefer (example in the readme)

dumb solution is we take some magic hash or we broadcast some ID.

speaking of that, would be help if we can also log info like, "i am [global-rank] 0, [fsdp-rank] 0, [dp-rank] 0, [cp-rank] ....." etc from each parallel dims.

At first glance, do you see this type of logger being useful for your 1-2k gpu run?

yes i think it gonna be help for both small and large scale for debug and found problem. technically we could write an even short discussion of metadata structure and ask codex/claude to vide code some magic view for diagnose. for large scale the main concern is the log itsel, like it should not slow down training and be friendly to filesystem

felipemello1 · 2026-04-30T20:35:09Z

would be help if we can also log info like, "i am [global-rank] 0, [fsdp-rank] 0, [dp-rank] 0, [cp-rank] ....." etc from each parallel dims.

We log the global rank. I think that this can be done in postprocessing if we can map rank to parallel dim.

for large scale the main concern is the log itsel, like it should not slow down training and be friendly to filesystem

agreed. Internally I saw people doing it two ways:

Log to local host, and a side channel takes it and logs to database
Log directly to database

The jsonl handler is a naive way of doing it, saving directly to shared FS. For large scale, i assume one would want to change their handlers if writing to FS slows down the run. I would need to check if python's logger.info is blocking or not in this case, but thats basically all we are doing: logger.info with extra metadata.

fegin

Did a quick browse, mostly looks good to me as we already discussed internally. I have one high-level question, which I asked in the checkpointer too. If we have structured logging, should we trim some logger.info, which try to record the time span of that action?

Another comment, we don't have to do this in this PR. But I think we should review some code blocks. If the indentation is too deep and the block is too long, and it is actually put in a time span, this code block may deserve a helper function and use a decorator style, such as model init.

Finally, for @rakkit's scaling question, do you think we can add an option to only log in certain ranks? For example, sl gets the rank by using dist.get_rank() and only rank % X == 0 does the logging. For OSS users, distributed filesystem is sometimes expensive unlike that in Meta internally.

fegin · 2026-05-01T06:11:48Z

+            begin = time.monotonic()
+            logger.info("Saving the checkpoint (or staging if async is enabled).")


Do we still need this if we have the event logger?

fegin · 2026-05-01T06:13:11Z

-            self._save_last_step(curr_step)
-            return
+        sl.add_step_tag("checkpoint_save")
+        with sl.log_trace_span("checkpoint_save"):


Is it bad that we actually use function decorator. I understand there will be a very small span for every step. Just don't know how bad if we actually do this.

felipemello1 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners April 30, 2026 15:40

pytorch-bot Bot added the ciflow/8gpu label Apr 30, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 30, 2026

felipemello1 changed the title ~~Observability: structured logging + pre-training instrumentation~~ Observability: structured logging + trainer instrumentation Apr 30, 2026

meta-codesync Bot changed the title ~~Observability: structured logging + trainer instrumentation~~ Observability: structured logging + pre-training instrumentation (#3176) Apr 30, 2026

felipemello1 force-pushed the export-D101049878 branch from 0d2beab to 98fe90d Compare April 30, 2026 18:54

meta-codesync Bot changed the title ~~Observability: structured logging + pre-training instrumentation (#3176)~~ Observability: structured logging + training instrumentation (#3176) Apr 30, 2026

felipemello1 force-pushed the export-D101049878 branch from 98fe90d to 7c8f373 Compare April 30, 2026 19:23

felipemello1 force-pushed the export-D101049878 branch from 7c8f373 to 6f624b1 Compare April 30, 2026 19:36

fegin reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability: structured logging + training instrumentation (#3176)#3176

Observability: structured logging + training instrumentation (#3176)#3176
felipemello1 wants to merge 1 commit intopytorch:mainfrom
felipemello1:export-D101049878

felipemello1 commented Apr 30, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 30, 2026

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 •

edited

Loading

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 •

edited

Loading

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 •

edited

Loading

Uh oh!

fegin left a comment •

edited

Loading

Uh oh!

fegin May 1, 2026

Uh oh!

fegin May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		begin = time.monotonic()
		logger.info("Saving the checkpoint (or staging if async is enabled).")

Conversation

felipemello1 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other comments

Is it safe?

Profiler default paths

Test

Uh oh!

meta-codesync Bot commented Apr 30, 2026

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakkit commented Apr 30, 2026

Uh oh!

felipemello1 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin May 1, 2026

Choose a reason for hiding this comment

Uh oh!

fegin May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felipemello1 commented Apr 30, 2026 •

edited

Loading

felipemello1 commented Apr 30, 2026 •

edited

Loading

felipemello1 commented Apr 30, 2026 •

edited

Loading

felipemello1 commented Apr 30, 2026 •

edited

Loading

fegin left a comment •

edited

Loading