automatic model load / unload or a lockable store extension #3583

zeruniverse · 2021-11-18T09:45:51Z

I have some tens of models to be served on a Triton server and it takes tons of GRAM if I load them all together into GPU, not to mention some backends like ONNX / torch does not release memory after each execution. Therefore, I have to load / unload models from time to time to prevent CUDA OOM. My current solution is like this: (1) a client requests load of all models it needs. (2) the client does inference. (3) the client requests unload of all models it used. This solution would fail if client A requests model X to be unloaded after client B requests model X to be loaded and before client B calls model X, in which case client B will receive an error saying model X is unavailable.

The ideal solution for me would be triton implementing auto model load/unload: triton detects potential CUDA OOM and unloads other models before execution. Or whenever CUDA OOM occurs, triton unloads other models and retry the inference. If a model is in model repository but not loaded, triton automatically loads it.

If above cannot be done in short term, a lockable store (serving as counter) at triton server side (or anything alike) will be very helpful (like Python variable ref count). with this, i can unload a model only if ref count is 0. The proposed extension should support 6 methods: register(key:str, init_val:int), list(key:Optional[str]), add_val(key:str, val:int), unregister(key:str), acquire_lock(key:str, lock_timeout:int), release_lock(key:str)

The text was updated successfully, but these errors were encountered:

GuanLuo · 2021-11-23T19:30:30Z

We have the feature request for the automatic model load / unload based on heuristic which matches what your first suggestion.

For your second suggestion, I think it will be easier to have a client dedicated for sending load / unload request and the other clients will request the proxy client to load / unload models while the proxy client uses the counter idea that you purposed to determine whether it communicates with the server or not.

zeruniverse · 2021-11-24T15:57:48Z

@GuanLuo Thanks for your response. After more research, I actually found a way to prevent CUDA OOM 100% times using Python backend and rate limiter, but it's extremely slow : (1) load model into memory during model load (2) on execute, open a new process to handle all onnx related inference including loading models from RAM to GRAM. (3) delete model from memory during model unload. I did some profiling with one of my models:

method	inference time (s)
regular ONNX backend	0.296
ONNX backend with client call load / unload each time (load from /dev/shm)	0.663
Python backend	2.24

With my largest but shallower model:

method	inference time (s)
regular ONNX backend	0.201
ONNX backend with client call load / unload each time (load from /dev/shm)	1.56
Python backend	5.83

Based on my observations, I want to implement a new onnxruntime backend to solve my problem as of now:

during model load, it only loads model to memory (or simpler, copy to /dev/shm)
during model unload, it only unloads model from memory
during model execute, it first loads model from memory to GRAM, then executes and finally unloads model from GRAM

As suggested in backend repository, I'm here to ask if such backend already exists and if you have any suggestions before I start implementing it.

lminer · 2022-01-19T01:26:14Z

I second this as a great addition.

deadeyegoodwin · 2022-01-20T23:51:47Z

No, there is not a backend that does that kind of memory management, and the onnxruntime backend does not have any plans to implement it.

GuanLuo added the enhancement New feature or request label Nov 23, 2021

deadeyegoodwin closed this as completed Jan 20, 2022

johnml1135 mentioned this issue Nov 6, 2023

Real-time Inference dynamic loading/unloading of models sillsdev/serval#211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic model load / unload or a lockable store extension #3583

automatic model load / unload or a lockable store extension #3583

zeruniverse commented Nov 18, 2021

GuanLuo commented Nov 23, 2021

zeruniverse commented Nov 24, 2021

lminer commented Jan 19, 2022

deadeyegoodwin commented Jan 20, 2022

automatic model load / unload or a lockable store extension #3583

automatic model load / unload or a lockable store extension #3583

Comments

zeruniverse commented Nov 18, 2021

GuanLuo commented Nov 23, 2021

zeruniverse commented Nov 24, 2021

lminer commented Jan 19, 2022

deadeyegoodwin commented Jan 20, 2022