Quick setup for llama cpp python backend

The llama-cpp-python project itself has installation notes on its PyPI page, but with some bits missing.

Windows: "how to run llama.cpp on windows"

An M1/M2 Mac llama.cpp install recipe I found. After this you should be able to easily install llama-cpp-python as a Python package. Reports that this recipe also works with WSL2 on Windows.

My quick Linux recipe

llama-cpp-python comes with its own copy of llama.cpp. Need to set build flags & ensure the wheel will be rebuilt

pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U --no-cache-dir llama-cpp-python

Can then run the server e.g. like so:

# Replace with whatever GGML module you've downloaded
HOST=0.0.0.0 python3 -m llama_cpp.server --n_gpu_layers=40 --model /opt/mlai/cache/huggingface/dl/TheBloke_Nous-Hermes-13B-GGML/nous-hermes-13b.ggmlv3.q4_K_M.bin

In my demo case --n_gpu_layers=40 uses under 8GB of VRAM. Tweak to taste.

Make sure you see BLAS=1 in the startup flags, to confirm GPU is being used.

This issue has helpful info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick setup for llama cpp python backend

My quick Linux recipe

Clone this wiki locally