Skip to content

Files

Latest commit

f649ee6 · May 8, 2025

History

History

debug

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 20, 2024
Apr 12, 2024
Sep 5, 2023
Jul 18, 2024
Jul 18, 2024
Jul 18, 2024
Feb 3, 2025
Jan 14, 2024
Mar 23, 2024
May 8, 2025
Sep 5, 2023
Nov 13, 2023

Debugging and Troubleshooting

Guides

Tools

  • Debug Tools

  • torch-distributed-gpu-test.py - this a torch.distributed diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.

  • NicerTrace - this is an improved trace python module with multiple additional flags added to the constructor and more useful output.