Related Works & Studies

Common MLSys Framework

Framework	Description
Pytorch/MXNet/Tensorflow	...
JAX	Still for research (JAX is Autograd and XLA)

Framework	Description
TensorRT	Nvidia / Special Support for GPU
AI Template	New from Meta (renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.)

Framework	Description
triton-inference-server	optimized cloud and edge inferencing from nvidia

Page	Description	Possible cause
`torch.inverse` multi-threading RuntimeError: lazy wrapper should be called at most once	Multithreading error	parallel unit testing extra cuda synchronizations
`NotImplementedError` when using `torch.distributed.launch`for multiGPUs	Data Parallel Error	pytorch native `DistributedDataParallel` module
Option to let DistributedDataParallel know in advance unused parameters at each forward pass	DistributedDataParallel Performance	find_unused_parameters=True
PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 · Issue #117081 · pytorch/pytorch (github.com)	Speed reduction for new pytorch version	?
RuntimeError: CUDA error: an illegal memory access was encountered using vmap and model ensembling call for cuda system	Multiple models process multiple batches of data and models call for cuda to process data	?
Segmentation faults in DataLoader (in latest torch version)	This happens with `num_workers=16` or `12`, `8`, `4`, `3`.	？

Page	form	Description
Solving real-world optimization tasks using physics-informed neural computing	Scientific Reports
PyTorch distributed: experiences on accelerating data parallel training: Proceedings of the VLDB Endowment: Vol 13, No 12 (acm.org)	VLDB
BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach	ACM on Management of Data
EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs	SC '23

Page	from	Description

PyTorch Model Performance Analysis and Optimization	Post	Performance
PyTorch Profiler		Tensorboard visualized pytorch profiler
(beta) Pytorch Layer Profiler		Detailed layer-to-layer profiler of pytorch
microsoft/AI-System: System for AI Education Resource. (github.com)	Microsoft	An online AI System Course to help students learn the whole stack of systems that support AI

Ways		Common Troubleshooting
torchrun for multi-machine distributed	`-nproc_per_node=4\<br />nnodes=2<br />node_rank=0<br />rdzv_id=456<br />rdzv_backend=c10d<br />rdzv_endpoint=172.31.43.139:29603<br />multinode_torchrun.py 50 10`	* nodes communication * network interface (firewall)
Slrum scheduler	`#SBATCH--job-name=multinode-example<br/>#SBATCH--nodes=4<br/>#SBATCH--ntasks=4<br/>#SBATCH--gpus-per-task=1#SBATCH --cpus-per-task=4<br/>nodes=( $( scontrol show hostnames $SLURM JOB NODELIST ) )nodes array=($nodes)head_node=$(nodes_array[0]}head node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address`	nodes bandwidth issues