Skip to content

Conversation

tc-wolf
Copy link
Owner

@tc-wolf tc-wolf commented Oct 15, 2024

Allow for saving / reloading KV cache without saving the logits, since not needed for my use case + sampling strategy.

- Create `reload_from_cache_state` method
- Still using LLamaState as container
- Use low level `ctx.get_logits_ith` to get last calculated logits.
- Add StateReloadError so that can be fallible.
- Change Llama class to use this instead of `load_state` directly.
- Default implementation still uses `load_state`.
- Use ptr.contents, not ptr in `np.array`
- Get dtype from return type on annotated signature
- Explicitly set copy=True and dtype on np.array
  - Should not strictly be necessary since pointer is typed
Catch StateReloadError and add logging if runs into this when running.
- Fix loading state (from_buffer -> from_buffer_copy since bytes aren't mutable)
- Add tests (E2E, errors when should, reloads successfully, logits correct, etc.)

Have to set LLAMA_TEST_MODEL to point to model path in order to get this
to run.
- Check when saving that model doesn't need logits
- Ad note in `reload_from_cache` state to revisit
- Make default to *not* save logits
  - Error if needed and save_logits False in build_cache
-  Handle reloading with/without scores if needed + available
- Add more tests
- Make llama_state / small_model module scope (so don't need to reload for each test)
  - Setting env var in `.env` file
@tc-wolf tc-wolf merged commit 6634017 into bumped_llama_cpp_with_disk_cache Oct 21, 2024
@tc-wolf tc-wolf deleted the optimize_kv_cache_size branch October 21, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant