Skip to content

xorbitsai/xllamacpp

 
 

Repository files navigation

xorbits

xllamacpp - a Python wrapper of llama.cpp

PyPI Latest Release License Discord Twitter


This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine.

Compare to llama-cpp-python

The following table provide an overview of the current implementations / features:

implementations / features xllamacpp llama-cpp-python
Wrapper-type cython ctypes
API Server & Params API Llama API
Server implementation C++ Python through wrapped LLama API
Continuous batching yes no
Thread safe yes no
Release package prebuilt build during installation

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Prerequisites for Prebuilt Wheels

Before pip installing xllamacpp, please ensure your system meets the following requirements based on your build type:

  • CPU (aarch64):

    • Requires ARMv8-A or later architecture
    • For best performance, build from source if your CPU supports advanced instruction sets
  • CUDA (Linux):

    • Requires glibc 2.35 or later
    • Compatible NVIDIA GPU with appropriate drivers (CUDA 12.4 or 12.8)
  • ROCm (Linux):

    • Requires glibc 2.35 or later
    • Requires gcc 10 or later (ROCm libraries have this dependency)
    • Compatible AMD GPU with ROCm support (ROCM 6.3.4)

Install

Note on Performance and Compatibility

For maximum performance, you can build xllamacpp from source to optimize for your specific native CPU architecture. The pre-built wheels are designed for broad compatibility.

Specifically, the aarch64 wheels are built for the armv8-a architecture. This ensures they run on a wide range of ARM64 devices, but it means that more advanced CPU instruction sets (like SVE) are not enabled. If your CPU supports these advanced features, building from source will provide better performance.

  • From pypi for CPU or Mac:
pip install -U xllamacpp
  • From github pypi for CUDA (use --force-reinstall to replace the installed CPU version):

    • CUDA 12.4

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
    • CUDA 12.8

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu128
  • From github pypi for HIP AMD GPU (use --force-reinstall to replace the installed CPU version):

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.3.4

Build from Source

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

About

xllamacpp - a Python wrapper of llama.cpp

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 56.7%
  • Python 19.2%
  • C 13.0%
  • Cython 10.0%
  • Other 1.1%