llama-go: Run LLMs locally with Go

Go bindings for llama.cpp, enabling you to run large language models locally with GPU acceleration. Production-ready library with thread-safe concurrent inference and comprehensive test coverage. Integrate LLM inference directly into Go applications with a clean, idiomatic API.

This is an active fork of go-skynet/go-llama.cpp, which hasn't been maintained since October 2023. The goal is keeping Go developers up-to-date with llama.cpp whilst offering a lighter, more performant alternative to Python-based ML stacks like PyTorch and/or vLLM.

Documentation: See getting started guide, building guide, API guide, examples, Go package docs, and llama.cpp for model format and engine details.

Quick start

# Clone with submodules
git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-go

# Build the library
make libbinding.a

# Download a test model
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

# Run an example
export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD
go run ./examples/simple -m Qwen3-0.6B-Q8_0.gguf -p "Hello world" -n 50

Basic usage

package main

import (
    "fmt"
    llama "github.com/tcpipuk/llama-go"
)

func main() {
    model, err := llama.LoadModel(
        "/path/to/model.gguf",
        llama.WithF16Memory(),
        llama.WithContext(512),
    )
    if err != nil {
        panic(err)
    }
    defer model.Close()

    response, err := model.Generate("Hello world", llama.WithMaxTokens(50))
    if err != nil {
        panic(err)
    }

    fmt.Println(response)
}

When building, set these environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD

Key capabilities

Text generation and embeddings: Generate text with LLMs or extract embeddings for semantic search, clustering, and similarity tasks.

GPU acceleration: Supports NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), Intel (SYCL), and cross-platform acceleration (Vulkan, OpenCL). Eight backend options cover virtually all modern GPU hardware, plus distributed inference via RPC.

Production ready: Comprehensive test suite with almost 400 test cases and CI validation including CUDA builds. Active development tracking llama.cpp releases - maintained for production use, not a demo project.

Advanced features: Cache common prompt prefixes to avoid recomputing system prompts across thousands of generations. Serve multiple concurrent requests with a single model loaded in VRAM (no weight duplication). Stream tokens as they generate for ChatGPT-style typing effects. Speculative decoding for 2-3× generation speedup.

Architecture

The library bridges Go and C++ using CGO, keeping the heavy computation in llama.cpp's optimised C++ code whilst providing a clean Go API. This minimises CGO overhead whilst maximising performance.

Key components:

wrapper.cpp/wrapper.h - CGO interface to llama.cpp
Clean Go API with comprehensive godoc comments
llama.cpp/ - Git submodule tracking upstream releases

The design uses functional options for configuration, dynamic context pooling for thread safety, automatic KV cache prefix reuse for performance, resource management with finalizers, and streaming callbacks via cgo.Handle for safe Go-C interaction.

Licence

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.forgejo/workflows		.forgejo/workflows
docs		docs
examples		examples
llama.cpp @ d2ee056		llama.cpp @ d2ee056
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile.build		Dockerfile.build
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
context_pool.go		context_pool.go
doc.go		doc.go
embeddings_test.go		embeddings_test.go
error_handling_test.go		error_handling_test.go
generate.go		generate.go
generate_internal.go		generate_internal.go
generate_tokens.go		generate_tokens.go
generation_test.go		generation_test.go
go.mod		go.mod
go.sum		go.sum
gpu_layers_test.go		gpu_layers_test.go
llama_cublas.go		llama_cublas.go
llama_hipblas.go		llama_hipblas.go
llama_metal.go		llama_metal.go
llama_openblas.go		llama_openblas.go
llama_opencl.go		llama_opencl.go
llama_rpc.go		llama_rpc.go
llama_suite_test.go		llama_suite_test.go
llama_sycl.go		llama_sycl.go
llama_vulkan.go		llama_vulkan.go
model.go		model.go
model_loading_test.go		model_loading_test.go
options_generate.go		options_generate.go
options_model.go		options_model.go
prefix_caching_test.go		prefix_caching_test.go
renovate.json		renovate.json
speculative_test.go		speculative_test.go
streaming_test.go		streaming_test.go
thread_config_test.go		thread_config_test.go
thread_safety_test.go		thread_safety_test.go
tokenisation_test.go		tokenisation_test.go
util.go		util.go
wrapper.cpp		wrapper.cpp
wrapper.h		wrapper.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama-go: Run LLMs locally with Go

Quick start

Basic usage

Key capabilities

Architecture

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tcpipuk/llama-go

Folders and files

Latest commit

History

Repository files navigation

llama-go: Run LLMs locally with Go

Quick start

Basic usage

Key capabilities

Architecture

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages