Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-time Inference dynamic loading/unloading of models #211

Open
johnml1135 opened this issue Nov 6, 2023 · 5 comments
Open

Real-time Inference dynamic loading/unloading of models #211

johnml1135 opened this issue Nov 6, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@johnml1135
Copy link
Collaborator

Assuming that we use clearml-serving real time inferencing, we may need to spin our own dynamic loading/unloading algorithm the reason is that the core Triton inference server from NVIDIA does not do it without buying the enterprise plan.

If we were to do this ourselves, we would need to make the explicit calls to the management API and implement a simple algorithm such as:

  • Assume:
    • All models are of the same size when loaded
    • The max number of instances of an individual model is 1
  • Config:
    • Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
    • Auto-unload model if not being used for x minutes (default 5?)
    • Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)
  • Algorithm:
    • Load in the model with the largest number of elements in it's queue - and only pull in one at a time
    • If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
    • Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
    • If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
    • If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate
@johnml1135 johnml1135 added this to the 1.4 NMT Dynamic Suggestions milestone Nov 6, 2023
@johnml1135 johnml1135 added the enhancement New feature or request label Nov 6, 2023
@ddaspit
Copy link
Contributor

ddaspit commented Nov 6, 2023

We have something like this implemented for SMT models.

@johnml1135
Copy link
Collaborator Author

ClearML may have it already implemented:

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments.
Does that make sense ?

@johnml1135
Copy link
Collaborator Author

Just to confirm, yes, ClearML can do automatic loading/unloading, but each load/unload will take time: there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO).

@johnml1135 johnml1135 removed this from the Serval API 1.3 milestone Dec 2, 2023
@robosina
Copy link

robosina commented Jan 15, 2024

@johnml1135 @ddaspit, Hello, I have a question about this matter, will this loading/unloading happen automatically, or do we need to do something to enable it? Also, if you could provide me with some links/documentation, I would appreciate it.

@johnml1135
Copy link
Collaborator Author

@robosina - this is currently a wish list item and a conceptual design, it has not been implemented into Serval. The core technology that would perform the loading/unloading would be https://github.com/allegroai/clearml-serving, which is a layer on top of https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/. I would review those products for dynamic loading/unloading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants