PyTorch optimizations

This is just an overview and collection of references.

Potential optimizations (speed and/or memory):

Distributed / Multi GPU training. RETURNN config: check torch_distributed
Automatic mixed precision (AMP), e.g. to use float16 (fp16). RETURNN config: torch_amp = "float16"
PyTorch scripting and tracing (https://github.com/rwth-i6/returnn/issues/1436)
torch.compile
TorchDynamo
torch.optim._multi_tensor.AdamW
apex.optimizers.FusedAdam (might be integrated into PyTorch? https://github.com/pytorch/pytorch/issues/71274)
Asynchronous data loading and augmentation. RETURNN config: torch_dataloader_opts = {"num_workers": 1}, maybe use together with MultiProcDataset if more workers are needed, see here

References:

PyTorch Tutorials > PyTorch Recipes > Performance Tuning Guide
Hugging Face: Efficient Training on a Single GPU (all the mentioned things should be applicable to vanilla PyTorch as well)
[Benchmark] HF Trainer on RTX-3090, many interesting benchmarks: fp16, bf16, tf32, grad accum, grad checkpointing, batch size, adamw apex fused

Provide feedback