Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic model load / unload or a lockable store extension #3583

Closed
zeruniverse opened this issue Nov 18, 2021 · 4 comments
Closed

automatic model load / unload or a lockable store extension #3583

zeruniverse opened this issue Nov 18, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@zeruniverse
Copy link

I have some tens of models to be served on a Triton server and it takes tons of GRAM if I load them all together into GPU, not to mention some backends like ONNX / torch does not release memory after each execution. Therefore, I have to load / unload models from time to time to prevent CUDA OOM. My current solution is like this: (1) a client requests load of all models it needs. (2) the client does inference. (3) the client requests unload of all models it used. This solution would fail if client A requests model X to be unloaded after client B requests model X to be loaded and before client B calls model X, in which case client B will receive an error saying model X is unavailable.

The ideal solution for me would be triton implementing auto model load/unload: triton detects potential CUDA OOM and unloads other models before execution. Or whenever CUDA OOM occurs, triton unloads other models and retry the inference. If a model is in model repository but not loaded, triton automatically loads it.

If above cannot be done in short term, a lockable store (serving as counter) at triton server side (or anything alike) will be very helpful (like Python variable ref count). with this, i can unload a model only if ref count is 0. The proposed extension should support 6 methods: register(key:str, init_val:int), list(key:Optional[str]), add_val(key:str, val:int), unregister(key:str), acquire_lock(key:str, lock_timeout:int), release_lock(key:str)

@GuanLuo
Copy link
Contributor

GuanLuo commented Nov 23, 2021

We have the feature request for the automatic model load / unload based on heuristic which matches what your first suggestion.

For your second suggestion, I think it will be easier to have a client dedicated for sending load / unload request and the other clients will request the proxy client to load / unload models while the proxy client uses the counter idea that you purposed to determine whether it communicates with the server or not.

@GuanLuo GuanLuo added the enhancement New feature or request label Nov 23, 2021
@zeruniverse
Copy link
Author

@GuanLuo Thanks for your response. After more research, I actually found a way to prevent CUDA OOM 100% times using Python backend and rate limiter, but it's extremely slow : (1) load model into memory during model load (2) on execute, open a new process to handle all onnx related inference including loading models from RAM to GRAM. (3) delete model from memory during model unload. I did some profiling with one of my models:

method inference time (s)
regular ONNX backend 0.296
ONNX backend with client call load / unload each time (load from /dev/shm) 0.663
Python backend 2.24

With my largest but shallower model:

method inference time (s)
regular ONNX backend 0.201
ONNX backend with client call load / unload each time (load from /dev/shm) 1.56
Python backend 5.83

Based on my observations, I want to implement a new onnxruntime backend to solve my problem as of now:

  1. during model load, it only loads model to memory (or simpler, copy to /dev/shm)
  2. during model unload, it only unloads model from memory
  3. during model execute, it first loads model from memory to GRAM, then executes and finally unloads model from GRAM

As suggested in backend repository, I'm here to ask if such backend already exists and if you have any suggestions before I start implementing it.

@lminer
Copy link

lminer commented Jan 19, 2022

I second this as a great addition.

@deadeyegoodwin
Copy link
Contributor

No, there is not a backend that does that kind of memory management, and the onnxruntime backend does not have any plans to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants