Use llama-rs for inference, introduce session caching #109

nsarrazin · 2023-03-29T18:20:55Z

The problem

llama.cpp couldn't compile on a bunch of platforms, hopefully this improve things.

Also computing all the previous context took a long time, slowing chats to a crawl on longer sessions.

Solution

llama-rs, a rust solution, might have better compatibility. It also implements caching sessions so hopefully this improves performance on longer chats by caching the sessions to disk.

nsarrazin · 2023-04-02T15:48:39Z

Not really relevant anymore. Also llama-rs still uses ggml in C, so it doesn't solve any of the compatibility problem.

nsarrazin added 2 commits March 29, 2023 18:29

use llama-rs for generation

88b60b2

fixed bug in dockerfile

433ec01

nsarrazin mentioned this pull request Mar 29, 2023

error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch #48

Closed

nsarrazin closed this Apr 2, 2023

gaby deleted the feature/llama-rs-caching branch June 19, 2023 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use llama-rs for inference, introduce session caching #109

Use llama-rs for inference, introduce session caching #109

nsarrazin commented Mar 29, 2023

nsarrazin commented Apr 2, 2023

Use llama-rs for inference, introduce session caching #109

Use llama-rs for inference, introduce session caching #109

Conversation

nsarrazin commented Mar 29, 2023

The problem

Solution

nsarrazin commented Apr 2, 2023