Name		Name	Last commit message	Last commit date
parent directory ..
android		android
CMakeLists.txt		CMakeLists.txt
MobileVLM-README.md		MobileVLM-README.md
README-glmedge.md		README-glmedge.md
README-granitevision.md		README-granitevision.md
README-minicpmo2.6.md		README-minicpmo2.6.md
README-minicpmv2.5.md		README-minicpmv2.5.md
README-minicpmv2.6.md		README-minicpmv2.6.md
README-quantize.md		README-quantize.md
README.md		README.md
clip-quantize-cli.cpp		clip-quantize-cli.cpp
clip.cpp		clip.cpp
clip.h		clip.h
convert_image_encoder_to_gguf.py		convert_image_encoder_to_gguf.py
glmedge-convert-image-encoder-to-gguf.py		glmedge-convert-image-encoder-to-gguf.py
glmedge-surgery.py		glmedge-surgery.py
llava-cli.cpp		llava-cli.cpp
llava.cpp		llava.cpp
llava.h		llava.h
llava_surgery.py		llava_surgery.py
llava_surgery_v2.py		llava_surgery_v2.py
minicpmv-cli.cpp		minicpmv-cli.cpp
minicpmv-convert-image-encoder-to-gguf.py		minicpmv-convert-image-encoder-to-gguf.py
minicpmv-surgery.py		minicpmv-surgery.py
qwen2_vl_surgery.py		qwen2_vl_surgery.py
qwen2vl-cli.cpp		qwen2vl-cli.cpp
requirements.txt		requirements.txt

README.md

LLaVA

Currently this implementation supports llava-v1.5 variants, as well as llava-1.6 llava-v1.6 variants.

The pre-converted 7b and 13b models are available. For llava-1.6 a variety of prepared gguf models are available as well 7b-34b

After API is confirmed, more models will be supported / uploaded.

Usage

Build with cmake or run make llama-llava-cli to build it.

After building, run: ./llama-llava-cli to see the usage. For example:

./llama-llava-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf --image path/to/an/image.jpg

note: A lower temperature like 0.1 is recommended for better quality. add --temp 0.1 to the command to do so. note: For GPU offloading ensure to use the -ngl flag just like usual

LLaVA 1.5

Clone a LLaVA and a CLIP model (available options). For example:

git clone https://huggingface.co/liuhaotian/llava-v1.5-7b

git clone https://huggingface.co/openai/clip-vit-large-patch14-336

Install the required Python packages:

pip install -r examples/llava/requirements.txt

Use llava_surgery.py to split the LLaVA model to LLaMA and multimodel projector constituents:

python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b

Use convert_image_encoder_to_gguf.py to convert the LLaVA image encoder to GGUF:

python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b

Use examples/convert_legacy_llama.py to convert the LLaMA part of LLaVA to GGUF:

python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown

Now both the LLaMA part and the image encoder are in the llava-v1.5-7b directory.

LLaVA 1.6 gguf conversion

First clone a LLaVA 1.6 model:

git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b

Install the required Python packages:

pip install -r examples/llava/requirements.txt

Use llava_surgery_v2.py which also supports llava-1.5 variants pytorch as well as safetensor models:

python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/

you will find a llava.projector and a llava.clip file in your model directory

Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:

mkdir vit
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json

Create the visual gguf model:

python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision

This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP

Then convert the model to gguf format:

python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown

And finally we can run the llava cli using the 1.6 model version:

./llama-llava-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf --image some-image.jpg -c 4096

note llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)

note llava-1.6 greatly benefits from batched prompt processing (defaults work)

note if the language model in step 6) is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.

import os
import transformers

model_path = ...
llm_export_path = ...

tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)

tokenizer.save_pretrained(llm_export_path)
model.language_model.save_pretrained(llm_export_path)

Then, you can convert the LLM using the convert_hf_to_gguf.py script, which handles more LLM architectures.

llava-cli templating and llava-1.6 prompting

llava-1.5 models all use the same vicuna prompt, here you can just add your image question like -p "Provide a full description." For llava-1.5 models which are not vicuna (mistral and Yi) you need to adapt system prompt as well as user prompt, for this purpose llava-cli has a basic templating system:

For Mistral and using llava-cli binary: Add this: -p "<image>\nUSER:\nProvide a full description.\nASSISTANT:\n" The mistral template for llava-1.6 seems to be no system print and a USER/ASSISTANT role

How to know if you are running in llava-1.5 or llava-1.6 mode

When running llava-cli you will see a visual information right before the prompt is being processed:

Llava-1.5: encode_image_with_clip: image embedding created: 576 tokens

Llava-1.6 (anything above 576): encode_image_with_clip: image embedding created: 2880 tokens

Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6

TODO

Support non-CPU backend for the image encoding part.
Support different sampling methods.
Support more model variants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava

llava

README.md

LLaVA

Usage

LLaVA 1.5

LLaVA 1.6 gguf conversion

llava-cli templating and llava-1.6 prompting

How to know if you are running in llava-1.5 or llava-1.6 mode

TODO

Files

llava

Directory actions

More options

Directory actions

More options

Latest commit

History

llava

Folders and files

parent directory

README.md

LLaVA

Usage

LLaVA 1.5

LLaVA 1.6 gguf conversion

llava-cli templating and llava-1.6 prompting

How to know if you are running in llava-1.5 or llava-1.6 mode

TODO