Improve PyTorch/XLA Documentation and Clarify SPMD Usage

## 📚 Documentation

### [Feature Request / Documentation Improvement] Improve PyTorch/XLA Documentation and Clarify SPMD Usage

Hello PyTorch/XLA team,

During my TPU grant I encountered many undocumented pitfalls and unclear behaviors, which made the setup process very time-consuming and confusing.

I’d like to ask for clarification and improvement on several key points that caused me significant confusion and wasted time.  
Perhaps the documentation seems clear to experienced users, but when reading it for the first time, there are many implicit assumptions and missing explanations.

---

### General Request
Please improve the documentation — make it more **explicit** and **practical**, especially for multi-host and SPMD setups.  

For example, while it’s indeed mentioned in the [*Running on TPU Pods*](https://docs.pytorch.org/xla/master/learn/pytorch-on-xla-devices.html#running-on-tpu-pods) section that the code must be launched on all hosts, this information is **buried too deep** and is **not referenced** in other critical sections like “Troubleshooting Basics.”  
It would be much clearer if you placed a visible note near the top of documentation saying something like:

> ⚠️ For multi-host TPU setups, you must launch the code on all hosts simultaneously.  
> See [Running on TPU Pods (multi-host)](...) for details.

This would help avoid confusion, since right now it’s easy to miss and leads to situations where the code just hangs with no clear reason.

---

### Specific Questions and Issues

1. What is recommended to use — `.launch` or `spmd`?  
2. Should SPMD be started on all hosts as well?  
3. In SPMD, is the batch size **global** or **per-host**?  
   - How is data distributed if each process sees all devices and I have 4 hosts with 4 devices each?  
   - If the batch size is global, what is the purpose of having multiple hosts? Only for data loading?  
   - How does XLA decide what data goes to which device — does it shard across all devices globally or only locally per host?  
4. How to correctly use `scan/scan_layers` if the transformer block takes multiple arguments and one of them is of type `torch.bool`?  
5. `assume_pure` seems to break if the model contains `nn.Parameter`. Is it even correct to use it like that?  
   - Can I reuse “params and buffers” between steps, or should I retrieve them every time before a training pass?  
6. `syncfree.AdamW(model.parameters(), lr=lr, betas=(0.9, 0.95), weight_decay=0)` seems to trigger recompilation around step ~323 (possibly due to `beta2`, not sure).  
7. In SPMD, how to correctly get the process ID? `world_size` and `global_ordinal` don’t work. Should I use `process_index`? `is_master_ordinal(local=False)` also doesn’t work.  
8. Please add a note to the docs: when logging, it’s better to use `flush=True`, otherwise logs might not appear (which is confusing). Also, wrap training code in `try/except`, since exceptions sometimes don’t log either.  
9. How can I perform **sampling and logging** in SPMD mode if I want **only one host** to handle these tasks (not all hosts)?  
10. Please provide **fully explicit examples** — with comments, no abstractions, step-by-step explanations of what each part does and how it can be modified.  
11. Compilation caching seems broken — when trying to load, it says “not implemented.”  
12. Can I pass only one `input_sharding=xs.ShardingSpec(mesh, ('fsdp', None))` to `MpDeviceLoader` if my dataset returns a tuple of 10 tensors with different shapes?  
13. `xm.rendezvous` seems to do nothing in SPMD mode (at least before the training loop).  
14. How to verify that all hosts are actually training **one shared model**, and not each training separately?  
15. In the docs, `HybridMesh(ici_mesh_shape, dcn_mesh_shape, ('data','fsdp','tensor'))` is shown,  
    but in practice it only works if you pass named arguments like `ici_mesh_shape=ici_mesh_shape`, otherwise it errors out.  
16. How to correctly do **gradient checkpointing** per layer with FSDP?  
17. How to correctly do **gradient clipping**?  
18. If model weights are expected to remain in FP32 when using `autocast`, please **explicitly state that in the training docs** — it would help avoid second-guessing.  
19. What is a **reasonable compilation time** during training? Mine can take **20–30 minutes**.
20. What are the actual intended purposes of `torch_xla.step()` and `torch_xla.compile()`?  
    - Since PyTorch/XLA already compiles and executes lazily, it’s unclear when and why these should be used explicitly.  

---

All of this was tested on `v4-32 TPU`.  
Maybe some of it is covered somewhere in the docs and I just missed it, but I hope you can clarify and improve the documentation.

Thank you for your time and support.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve PyTorch/XLA Documentation and Clarify SPMD Usage #9681

📚 Documentation

[Feature Request / Documentation Improvement] Improve PyTorch/XLA Documentation and Clarify SPMD Usage

General Request

Specific Questions and Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve PyTorch/XLA Documentation and Clarify SPMD Usage #9681

Description

📚 Documentation

[Feature Request / Documentation Improvement] Improve PyTorch/XLA Documentation and Clarify SPMD Usage

General Request

Specific Questions and Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions