Framework | Description |
---|---|
Pytorch/MXNet/Tensorflow | ... |
JAX | Still for research (JAX is Autograd and XLA) |
Framework | Description |
---|---|
TensorRT | Nvidia / Special Support for GPU |
AI Template | New from Meta (renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.) |
Framework | Description |
---|---|
triton-inference-server | optimized cloud and edge inferencing from nvidia |
Page | Description | Possible cause |
---|---|---|
torch.inverse multi-threading RuntimeError: lazy wrapper should be called at most once |
Multithreading error | parallel unit testing extra cuda synchronizations |
NotImplementedError when using torch.distributed.launch for multiGPUs |
Data Parallel Error | pytorch native DistributedDataParallel module |
Option to let DistributedDataParallel know in advance unused parameters at each forward pass | DistributedDataParallel Performance | find_unused_parameters=True |
PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 · Issue #117081 · pytorch/pytorch (github.com) | Speed reduction for new pytorch version | ? |
RuntimeError: CUDA error: an illegal memory access was encountered using vmap and model ensembling call for cuda system | Multiple models process multiple batches of data and models call for cuda to process data | ? |
Segmentation faults in DataLoader (in latest torch version) | This happens with num_workers=16 or 12 , 8 , 4 , 3 . |
? |
Page | from | Description |
---|---|---|
PyTorch Model Performance Analysis and Optimization | Post | Performance |
PyTorch Profiler | Tensorboard visualized pytorch profiler | |
(beta) Pytorch Layer Profiler | Detailed layer-to-layer profiler of pytorch | |
microsoft/AI-System: System for AI Education Resource. (github.com) | Microsoft | An online AI System Course to help students learn the whole stack of systems that support AI |
Ways | Common Troubleshooting | |
---|---|---|
torchrun for multi-machine distributed | -nproc_per_node=4\<br />nnodes=2<br />node_rank=0<br />rdzv_id=456<br />rdzv_backend=c10d<br />rdzv_endpoint=172.31.43.139:29603<br />multinode_torchrun.py 50 10 |
* nodes communication * network interface (firewall) |
Slrum scheduler | #SBATCH--job-name=multinode-example<br/>#SBATCH--nodes=4<br/>#SBATCH--ntasks=4<br/>#SBATCH--gpus-per-task=1#SBATCH --cpus-per-task=4<br/>nodes=( $( scontrol show hostnames $SLURM JOB NODELIST ) )nodes array=($nodes)head_node=$(nodes_array[0]}head node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address |
nodes bandwidth issues |