Preload/Unload Ollama models before prompting #116

Munsio · 2024-05-15T09:25:08Z

For better measurements we need to preload the ollama model before prompting to it. We also need to cleanup afterwards

Tasks:

Check how to preload models - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-pre-load-a-model-to-get-faster-response-times
Check how to unload models - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately
Check how to query if a model is loaded - https://www.reddit.com/r/ollama/comments/1cex92f/possible_to_show_currently_loaded_models_via_api/
- newest version has ollama ps to check all loaded models
- however, if we use an empty prompt request to trigger model pre-loading, we can be sure that after the API answers this request, the model is indeed loaded see here
Implement it into the evaluation run to preload ollama models when they should be used

The text was updated successfully, but these errors were encountered:

bauersimon · 2024-05-15T10:00:25Z

So the problem is to know when the preloading process has finished, right? An empty request starts the preload... Could it be that this request is completed once the model is finished loading?

zimmski · 2024-05-15T10:05:11Z

So the problem is to know when the preloading process has finished, right? An empty request starts the preload... Could it be that this request is completed once the model is finished loading?

Would say yes. If the model answers, it is loaded.

Munsio · 2024-05-15T10:29:54Z

@bauersimon @zimmski updated the description on "check if model is loaded"

bauersimon · 2024-05-15T12:05:20Z

I meant that I believe the API only completes the request once the model is loaded (I think it's happening here). So there is no need for the artificial "respond with y" query. Easy to verify with a server and CURL and two differently sized models. If the response for an empty request from the API is consistently slower for the bigger model, the response must happen only when the loading is done.

bauersimon · 2024-05-15T13:50:59Z

Checked again and now I am 100% certain we don't need to send a dummy prompt as the API also responds to an empty request only after the model is loaded. The Ollama API has a response property load_duration and while that property is not returned on an empty request, we can see here how it is computed. And indeed the checkpointLoaded := time.Now() happens directly after the special case where the empty prompt request is handled. Hence, when the empty prompt request is answered, the model is already loaded.

Also tried this with some curl requests and loading qwen:0.5b took the empty request always ~2000ms (plus/minus a few ms) and loading qwen:4b took the empty request always ~2300ms (plus/minus a few ms) to complete.

In comparison, asking quen:0.5b to "respond with y" took 4000ms... so double... probably even worse for larger models.

…e we only measure the raw inference time Closes #116

…inference time Part of #116

bauersimon · 2024-05-27T07:50:55Z

@Munsio / @ruiAzevedo19 since #121 is merged, this issue is done? Or did you encounter any other things we need to take a look at?

…inference time Part of #116

Munsio added the enhancement New feature or request label May 15, 2024

Munsio added this to the v0.5.0 milestone May 15, 2024

Munsio self-assigned this May 15, 2024

bauersimon added a commit that referenced this issue May 16, 2024

Preload Ollama models before inference and unload afterwards to ensur…

3262972

…e we only measure the raw inference time Closes #116

bauersimon self-assigned this May 16, 2024

bauersimon added a commit that referenced this issue May 16, 2024

Preload Ollama models before inference and unload afterwards to ensur…

067ef1c

…e we only measure the raw inference time Closes #116

bauersimon added a commit that referenced this issue May 16, 2024

Preload Ollama models before inference to ensure we only measure raw …

6d5fde3

…inference time Part of #116

bauersimon added a commit that referenced this issue May 16, 2024

Preload Ollama models before inference to ensure we only measure raw …

db0205d

…inference time Part of #116

bauersimon added a commit that referenced this issue May 17, 2024

Preload Ollama models before inference to ensure we only measure raw …

356f075

…inference time Part of #116

bauersimon added a commit that referenced this issue May 17, 2024

Preload Ollama models before inference to ensure we only measure raw …

92056c7

…inference time Part of #116

bauersimon added a commit that referenced this issue May 17, 2024

Preload Ollama models before inference to ensure we only measure raw …

211d836

…inference time Part of #116

bauersimon added a commit that referenced this issue May 17, 2024

Preload Ollama models before inference to ensure we only measure raw …

2cfb18e

…inference time Part of #116

Munsio pushed a commit that referenced this issue May 23, 2024

Preload Ollama models before inference to ensure we only measure raw …

bc4dd27

…inference time Part of #116

Munsio pushed a commit that referenced this issue May 23, 2024

Preload Ollama models before inference to ensure we only measure raw …

346d418

…inference time Part of #116

Munsio pushed a commit that referenced this issue May 23, 2024

Preload Ollama models before inference to ensure we only measure raw …

3b5dc28

…inference time Part of #116

Munsio closed this as completed May 27, 2024

Munsio pushed a commit that referenced this issue Jun 3, 2024

Preload Ollama models before inference to ensure we only measure raw …

a70d940

…inference time Part of #116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preload/Unload Ollama models before prompting #116

Preload/Unload Ollama models before prompting #116

Munsio commented May 15, 2024 •

edited by bauersimon

Loading

bauersimon commented May 15, 2024

zimmski commented May 15, 2024

Munsio commented May 15, 2024

bauersimon commented May 15, 2024 •

edited

Loading

bauersimon commented May 15, 2024

bauersimon commented May 27, 2024

Preload/Unload Ollama models before prompting #116

Preload/Unload Ollama models before prompting #116

Comments

Munsio commented May 15, 2024 • edited by bauersimon Loading

Tasks:

bauersimon commented May 15, 2024

zimmski commented May 15, 2024

Munsio commented May 15, 2024

bauersimon commented May 15, 2024 • edited Loading

bauersimon commented May 15, 2024

bauersimon commented May 27, 2024

Munsio commented May 15, 2024 •

edited by bauersimon

Loading

bauersimon commented May 15, 2024 •

edited

Loading