-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Open
Labels
performanceissues related to performance regressionsissues related to performance regressionsquantizationissues related to quantizationissues related to quantizationstaleissues that have not been addressed in a while; categorized by a botissues that have not been addressed in a while; categorized by a bot
Description
Describe the issue
I observed that ORT takes 11541.5MB of GPU memory with CUDAExecutionProvider while quantizing a model of size 1.3GB. The model has a single input of shape 1x2x1024x2048. I was able to reduce the memory usage using the following optimizations, but it wont reduce further than what I have shared above.
sess_options.add_session_config_entry("session.use_device_allocator_for_initializers", "1")
("CUDAExecutionProvider, {"arena_extend_strategy": "kSameAsRequested"})
run_options.add_run_config_entry("memory.enable_memory_arena_shrinkage", f"cpu:0;{gpu_str}")
Is there a more optimal options configuration that can reduce the GPU memory utilization even further?
To reproduce
NA
Urgency
NA
Platform
Linux
OS Version
ubuntu 24.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.22
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda 12.9
Model File
NA
Is this a quantized model?
Yes
Metadata
Metadata
Assignees
Labels
performanceissues related to performance regressionsissues related to performance regressionsquantizationissues related to quantizationissues related to quantizationstaleissues that have not been addressed in a while; categorized by a botissues that have not been addressed in a while; categorized by a bot