-
-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Llama plugin workarounds
I'm trying summarize multiple threads, while a plugin update is pending. Streaming support was already added last night, so it isn't mentioned here. Today is Mon 11, Sep 2023.
1. Using the new GGUF models
Install the plugin:
llm install -U llm-llama-cpp
Now install the latest llama-cpp-python with Metal support:
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 llm install llama-cpp-python
The current documentation shows this advice, but it will lead to an error. GGML .bin files are no longer supported by Llama.cpp:
llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
To see the error, the plugin supports a verbose mode:
llm -m llama-2-7b-chat.ggmlv3.q8_0 'five names for a cute pet skunk' -o verbose true
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from /Users/ph/Library/Application Support/io.datasette.llm/llama-cpp/models/llama-2-7b-chat.ggmlv3.q8_0.bin
llama_load_model_from_file: failed to load model
If you try to download the same model, but with the new GGUF format, the plugin gives an error, because download-model expects a .bin extension.
llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
Usage: llm llama-cpp download-model [OPTIONS] URL
Try 'llm llama-cpp download-model --help' for help.
Error: Invalid value: URL must end with .bin
A fix for this would be the plugin to accept all formats supported by llama.cpp. See this comment.
Pending that, you can download the model directly to the correct location and then add it:
cd ~/Library/Application\ Support/io.datasette.llm/llama-cpp/models
wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
llm llama-cpp add-model llama-2-7b-chat.Q8_0.gguf
Now you can run the model as usual:
llm models list
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'
Or in verbose mode:
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk' -o verbose true
But Metal is not yet active. There are no lines that begin like this:
ggml_metal_add_buffer: allocated...
2. Activate Metal support
Until there's a new plugin version, you need to patch lla-llama-cpp to activate Metal support. This is quite straightforward.
git clone https://github.com/simonw/llm-llama-cpp.git
cd llm-llama-cpp
Then you need to add n_gpu_layers
to the end of this line in llm_llama_cpp.py
, as described in Simon's comment.
model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000, n_gpu_layers=1
Now reinstall the plugin from this folder:
llm install .
And try generating content again. You should see in Activity Monitor, how the GPU is now used. Verbose mode (shown above) would include mentions of Metal.
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'
3. Activate longer answers
At this point the answers are still truncated. You need to add a max_tokens
to the plugin as described in Simon's comment.
stream = llm_model(prompt_text, stream=True, max_tokens=4000)
Now reinstall the plugin from this folder:
llm install .
And try one more time:
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'
All done
That should be the whole list. I'm sure there will be a new plugin version soon, but in the meanwhile, hope this helps someone.