-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Describe the feature request
Nvidia has introduced a feature called time-slicing on GPUs (see here and here). However, this feature doesn't natively support memory-isolation between replicas- Unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all.
.
As far as I can tell, ONNX doesn't currently have support for safely managing GPU memory while working with GPU time-slicing, as I've seen errors resulting from memory interference. Is safely managing GPU memory while using GPU time-slicing something that folks have considered supporting for ONNX, or have I missed some existing support?
(See https://bruce-lee-ly.medium.com/nvidia-gpu-virtual-memory-management-7fdc4122226b for reference).
Describe scenario use case
Being able to run multiple small services on a single GPU can lead to cost savings across a broad range of applications, so as to avoid renting more GPUs than necessary. In my particular use case, it would cut the number of required GPUs by about 2/3rds. This also indirectly has some minor environmental benefits, as it reduces the about of required electricity for computation.