Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with llama.cpp #898

Closed
0x090909 opened this issue Mar 26, 2023 · 8 comments
Closed

Integration with llama.cpp #898

0x090909 opened this issue Mar 26, 2023 · 8 comments

Comments

@0x090909
Copy link

Hello,

Im reading the documentation and it seems that this indexer cannot be used with https://github.com/ggerganov/llama.cpp.

Am I correct?

If so, will it be integrated in the future?

Thanks

@logan-markewich
Copy link
Collaborator

logan-markewich commented Mar 26, 2023

@0x090909 the documentation provides an example of using a custom LLM here: https://gpt-index.readthedocs.io/en/latest/how_to/custom_llms.html#example-using-a-custom-llm-model

It will be up to you to handle passing the text to the model and returning the newly generated tokens.

@0x090909
Copy link
Author

@logan-markewich Thank you

@ianscrivener
Copy link

@logan-markewich .. sorry, but that link is dead now

@ianscrivener
Copy link

I'm using llama.cpp a lot. C++ inference of 4 bit quantized and optimised models is VERY performant. Significantly faster than FP16 python code

One approach to use llama.cpp with llama_index would be to use Abetlen's https://github.com/abetlen/llama-cpp-python pip pakcage hich replicates OpenAIa's API... though this (I assume) would require runnign two processes to get the job done

Ideally, there would be a llama_index Custom LLM module, that used an existing llama.cpp install .. llama_cpp-bin=/somwhere/llama.cpp/bin/main... leveraging the python C++ headers as per llama-cpp-python

@ianscrivener
Copy link

ianscrivener commented Jul 10, 2023

Another good, C++ FAST local LLM inference engine is https://github.com/OpenNMT/CTranslate2... which can use Llama family models.. as well as other family LLM models. (However it does not support MacOS Metal GPUs - which I have)

@logan-markewich
Copy link
Collaborator

logan-markewich commented Jul 10, 2023

You can use any LLM that langchain offers (which happens to include llama.cpp)

Using v0.7.4

from llama_index import SeviceContext, set_global_service_context 

llm = <setup langchain llm>
service_context = ServiceContext.from_defaults(llm=llm, context_window=<context window of llm>, chunk_size=<some value about 25%smaller than the context window of the llm>)
set_global_service_context(service_context)

Ngl but the last time I tried, llama.cpp is still not as fast as I'd like, and I think the context window was only 512 which is quite tiny for llama index

@ianscrivener
Copy link

ianscrivener commented Jul 10, 2023

thanks @logan-markewich 🙏

So langchain supports llama.cpp via llama-cpp-python library... which is fine - usually just 1 release version behind llama.cpp. llama-cpp-python and llama.cpp happily run Mac Arm64 & Metal

BTW: llama.cpp currently supports context size up to 2048, the C++ devs are currently working on extending context size via RoPE scaling.

llama.cpp is by far the best & fastest self hosted LLM inference I have found for Mac Silicon (Metal)

@ianscrivener
Copy link

Here's how I upgraded llama-cpp-python to support MacOS MEtal GPU

pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

BTW: xcode is required to compile the llama.cpp binary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants