-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatic model load / unload or a lockable store extension #3583
Comments
We have the feature request for the automatic model load / unload based on heuristic which matches what your first suggestion. For your second suggestion, I think it will be easier to have a client dedicated for sending load / unload request and the other clients will request the proxy client to load / unload models while the proxy client uses the counter idea that you purposed to determine whether it communicates with the server or not. |
@GuanLuo Thanks for your response. After more research, I actually found a way to prevent CUDA OOM 100% times using Python backend and rate limiter, but it's extremely slow : (1) load model into memory during model load (2) on execute, open a new process to handle all onnx related inference including loading models from RAM to GRAM. (3) delete model from memory during model unload. I did some profiling with one of my models:
With my largest but shallower model:
Based on my observations, I want to implement a new onnxruntime backend to solve my problem as of now:
As suggested in backend repository, I'm here to ask if such backend already exists and if you have any suggestions before I start implementing it. |
I second this as a great addition. |
No, there is not a backend that does that kind of memory management, and the onnxruntime backend does not have any plans to implement it. |
I have some tens of models to be served on a Triton server and it takes tons of GRAM if I load them all together into GPU, not to mention some backends like ONNX / torch does not release memory after each execution. Therefore, I have to load / unload models from time to time to prevent CUDA OOM. My current solution is like this: (1) a client requests load of all models it needs. (2) the client does inference. (3) the client requests unload of all models it used. This solution would fail if client A requests model X to be unloaded after client B requests model X to be loaded and before client B calls model X, in which case client B will receive an error saying model X is unavailable.
The ideal solution for me would be triton implementing auto model load/unload: triton detects potential CUDA OOM and unloads other models before execution. Or whenever CUDA OOM occurs, triton unloads other models and retry the inference. If a model is in model repository but not loaded, triton automatically loads it.
If above cannot be done in short term, a lockable store (serving as counter) at triton server side (or anything alike) will be very helpful (like Python variable ref count). with this, i can unload a model only if ref count is 0. The proposed extension should support 6 methods:
register(key:str, init_val:int)
,list(key:Optional[str])
,add_val(key:str, val:int)
,unregister(key:str)
,acquire_lock(key:str, lock_timeout:int)
,release_lock(key:str)
The text was updated successfully, but these errors were encountered: