New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with llama.cpp #898
Comments
@0x090909 the documentation provides an example of using a custom LLM here: https://gpt-index.readthedocs.io/en/latest/how_to/custom_llms.html#example-using-a-custom-llm-model It will be up to you to handle passing the text to the model and returning the newly generated tokens. |
@logan-markewich Thank you |
@logan-markewich .. sorry, but that link is dead now |
I'm using llama.cpp a lot. C++ inference of 4 bit quantized and optimised models is VERY performant. Significantly faster than FP16 python code One approach to use llama.cpp with llama_index would be to use Abetlen's https://github.com/abetlen/llama-cpp-python pip pakcage hich replicates OpenAIa's API... though this (I assume) would require runnign two processes to get the job done Ideally, there would be a llama_index Custom LLM module, that used an existing llama.cpp install .. |
Another good, C++ FAST local LLM inference engine is https://github.com/OpenNMT/CTranslate2... which can use Llama family models.. as well as other family LLM models. (However it does not support MacOS Metal GPUs - which I have) |
You can use any LLM that langchain offers (which happens to include llama.cpp) Using v0.7.4
Ngl but the last time I tried, llama.cpp is still not as fast as I'd like, and I think the context window was only 512 which is quite tiny for llama index |
thanks @logan-markewich 🙏 So langchain supports llama.cpp via llama-cpp-python library... which is fine - usually just 1 release version behind llama.cpp. llama-cpp-python and llama.cpp happily run Mac Arm64 & Metal BTW: llama.cpp currently supports context size up to 2048, the C++ devs are currently working on extending context size via RoPE scaling. llama.cpp is by far the best & fastest self hosted LLM inference I have found for Mac Silicon (Metal) |
Here's how I upgraded llama-cpp-python to support MacOS MEtal GPU
BTW: xcode is required to compile the llama.cpp binary |
Hello,
Im reading the documentation and it seems that this indexer cannot be used with https://github.com/ggerganov/llama.cpp.
Am I correct?
If so, will it be integrated in the future?
Thanks
The text was updated successfully, but these errors were encountered: