-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Expose implementation details of the KV Cache #776
Comments
Are you sure you need KV cache access? If you're just wanting to pre-process a prompt and re-use it then it sounds to me more like you want to save/load states. The high level executors (Instruct/Interact Executor) expose methods to save a file which can be used to load that conversation back up later, I believe that should contain the saved KV cache for that sequence. There are various ways to access the KV cache exposed in a few different places. The "raw" low level API is in
The If you're using the |
Yep, I've thought about using those APIs, but I believe my use case is a bit more specific. The prompts themselves aren't static, but parts of the prompt are. For example, with RAG, the prompt might be:
I've also thought about using the APIs exposed in I haven't tested this yet, but I'm not sure the |
Actually, I dug a little bit through the llama.cpp code base, and it seems that all the |
For an overall API design, Llama-cpp-python exposes a |
Yep so my suggestion was to evaluate everything before
Some of the functions here are specifically for debugging only, but not all of them (it will be mentioned in the comments). The debugging ones are exposed in a higher level wrapper through e.g.
I can't find any docs on LLamaCache, do you have a link to any? From a look at the implementation here, it looks like it automatically loads and saves states (using |
Yeah, I just read through the code, it's not really documented. Here's where it's actually used though. https://github.com/abetlen/llama-cpp-python/blob/165b4dc6c188f8fda2fc616154e111f710484eba/llama_cpp/llama.py#L1073C1-L1089C1. It seems |
Background & Description
Since I'm using LLamaSharp on the edge, initial prompt evaluation is high fixed cost. Since my prompts are static (for the most part), I'd like to pre-initialize the KV cache and have end users consume it as a
gguf
file, trading compute for storage. Based on my read of the code base, there is no specific abstraction for the llama.cpp KV store.This probably ties into #684.
I'm interested in working on this, just marking the concept down to avoid any conflicts / idea-planning.
API & Usage
KV caches are specific to a model? Maybe there should be a whole separate class that deals with the abstraction of the kv cache. llama.cpp already has APIs designed to programmatically manipulate it.
How to implement
Expose
llama_kv_cache_*
implementations in a high-level manner.The text was updated successfully, but these errors were encountered: