Skip to content

Quick setup for llama cpp python backend

Uche Ogbuji edited this page Jun 13, 2023 · 2 revisions

The llama-cpp-python project itself has installation notes on its PyPI page, but with some bits missing.

Windows: "how to run llama.cpp on windows"

An M1/M2 Mac llama.cpp install recipe I found. After this you should be able to easily install llama-cpp-python as a Python package. Reports that this recipe also works with WSL2 on Windows.

My quick Linux recipe

llama-cpp-python comes with its own copy of llama.cpp. Need to set build flags & ensure the wheel will be rebuilt

pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U --no-cache-dir llama-cpp-python

Can then run the server e.g. like so:

# Replace with whatever GGML module you've downloaded
HOST=0.0.0.0 python3 -m llama_cpp.server --n_gpu_layers=40 --model /opt/mlai/cache/huggingface/dl/TheBloke_Nous-Hermes-13B-GGML/nous-hermes-13b.ggmlv3.q4_K_M.bin

In my demo case --n_gpu_layers=40 uses under 8GB of VRAM. Tweak to taste.

Make sure you see BLAS=1 in the startup flags, to confirm GPU is being used.

This issue has helpful info.