Fixing Metal, truncated answers and GGML errors (howto)

### Llama plugin workarounds

I'm trying summarize multiple threads, while a plugin update is pending. Streaming support [was already added](https://github.com/simonw/llm-llama-cpp/issues/6) last night, so it isn't mentioned here. Today is Mon 11, Sep 2023.

#### 1. Using the new GGUF models

Install the plugin:

	llm install -U llm-llama-cpp

Now install the latest llama-cpp-python **with Metal support**:

	CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 llm install llama-cpp-python

The current documentation shows this advice, but it will lead to an error. GGML .bin files are no longer supported by Llama.cpp:

	llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin

To see the error, the plugin supports a verbose mode:

	llm -m llama-2-7b-chat.ggmlv3.q8_0 'five names for a cute pet skunk' -o verbose true
	gguf_init_from_file: invalid magic number 67676a74
	error loading model: llama_model_loader: failed to load model from /Users/ph/Library/Application Support/io.datasette.llm/llama-cpp/models/llama-2-7b-chat.ggmlv3.q8_0.bin
	
	llama_load_model_from_file: failed to load model

If you try to download the same model, but with the new GGUF format, the plugin gives an error, because download-model expects a .bin extension.

	llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
	Usage: llm llama-cpp download-model [OPTIONS] URL
	Try 'llm llama-cpp download-model --help' for help.
	
	Error: Invalid value: URL must end with .bin

**A fix for this would be the plugin to accept all formats supported by llama.cpp.** See [this comment](https://github.com/simonw/llm-llama-cpp/issues/10#issuecomment-1707347835).

Pending that, you can download the model directly to the correct location and then add it:

	cd ~/Library/Application\ Support/io.datasette.llm/llama-cpp/models
	
	wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
	
	llm llama-cpp add-model llama-2-7b-chat.Q8_0.gguf

**Now you can run the model as usual:**

```
llm models list
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'
```

Or in verbose mode:

```
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk' -o verbose true
```

But Metal is not yet active. There are no lines that begin like this:

	ggml_metal_add_buffer: allocated...

#### 2. Activate Metal support

Until there's a new plugin version, you need to patch lla-llama-cpp to activate Metal support. This is quite straightforward.

	git clone https://github.com/simonw/llm-llama-cpp.git
	cd llm-llama-cpp

Then you need to add `n_gpu_layers` to the end of this line in `llm_llama_cpp.py`, as described [in Simon's comment](https://github.com/simonw/llm-llama-cpp/issues/7#issuecomment-1664961202).

	 model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000, n_gpu_layers=1
	 
Now reinstall the plugin from this folder:

	llm install .
	
And try generating content again. You should see in Activity Monitor, how the GPU is now used. Verbose mode (shown above) would include mentions of Metal.

	llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'

#### 3. Activate longer answers

At this point the answers are still truncated. You need to add a `max_tokens` to the plugin as described [in Simon's comment](https://github.com/simonw/llm-llama-cpp/issues/6#issuecomment-1664960492).

	stream = llm_model(prompt_text, stream=True, max_tokens=4000)

Now reinstall the plugin from this folder:

	llm install .
	
And try one more time:

	llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'

### All done

That should be the whole list. I'm sure there will be a new plugin version soon, but in the meanwhile, hope this helps someone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixing Metal, truncated answers and GGML errors (howto) #14

Llama plugin workarounds

1. Using the new GGUF models

2. Activate Metal support

3. Activate longer answers

All done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Fixing Metal, truncated answers and GGML errors (howto) #14

Description

Llama plugin workarounds

1. Using the new GGUF models

2. Activate Metal support

3. Activate longer answers

All done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions