Download the model
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./R1-Distill-1.5B
A minimal C++ example that loads a Hugging Face transformer checkpoint
(the DeepSeek-R1-Distill-Qwen-1.5B
model in R1-Distill-1.5B/) and runs token-by-token inference, with no
Python in the hot path.
It is built on top of llama.cpp: the
project pins a specific tag, fetches it via CMake FetchContent, links a tiny
main.cpp against libllama, and uses Apple Metal / Accelerate / BLAS where
available.
prompt ──► [main.cpp]
│ llama_model_load_from_file
│ llama_init_from_model
│ llama_tokenize
│ llama_decode ◄── KV cache, attention, MLP (libllama + ggml)
│ llama_sampler_sample
▼
tokens (streamed to stdout)
.
├── CMakeLists.txt # fetches llama.cpp@b9048, builds bin/run_model
├── Makefile # configure / build / convert / run wrapper
├── main.cpp # ~200-line C++ inference driver (llama.cpp C API)
├── R1-Distill-1.5B/ # the original Hugging Face checkpoint (safetensors)
└── R1-Distill-1.5B.gguf # produced by `make convert` (gitignored)
| Tool | Minimum | Notes |
|---|---|---|
| CMake | 3.18 | brew install cmake or your distro package |
| C++ toolchain | C++17 capable | AppleClang / clang / gcc all work |
| Git | any recent version | needed for FetchContent to clone llama.cpp |
| Python | 3.10 – 3.12 | only for the one-time safetensors → gguf conversion |
| Disk space | ~8 GB free | source + build artefacts + GGUF model |
Why a specific Python range? llama.cpp's converter pins
numpy~=1.26andtorch~=2.6. Those wheels are only published for Python ≤ 3.12. TheMakefileauto-detectspython3.12/3.11/3.10onPATHand falls back to plainpython3; override withmake convert PYTHON=/path/to/python.
The C++ binary itself has no Python or PyTorch dependency at runtime — it
only depends on libllama / libggml* which are statically/dynamically
linked alongside it.
# 1. Build the C++ binary (clones llama.cpp@b9048 on first run, ~2 minutes).
make build
# 2. Convert the HF safetensors checkpoint to GGUF (one-time, ~2 minutes).
make convert
# 3. Run inference.
make run PROMPT="Why is the sky blue?" N_PREDICT=200Or call the binary directly:
./build/bin/run_model \
-m R1-Distill-1.5B.gguf \
-p "The capital of France is" \
-n 128 \
--temp 0.8Sample output (greedy, --temp 0):
The capital of France is Paris, the capital of Germany is Berlin,
the capital of Italy is Rome, the capital of Spain is Madrid, ...
| Target | What it does |
|---|---|
make help |
Print the target list and current variable values |
make configure |
Run CMake configure; clones llama.cpp into build/_deps/ |
make build |
Configure + compile build/bin/run_model |
make convert |
Create .venv, install converter deps, write R1-Distill-1.5B.gguf |
make run |
Build (if needed) and run inference with the variables below |
make clean |
Delete build/ |
make distclean |
Also delete .venv/ and the generated .gguf |
All can be set on the make command line (make run TEMP=0).
| Variable | Default | Meaning |
|---|---|---|
MODEL_DIR |
R1-Distill-1.5B |
HF checkpoint directory (must contain config.json) |
GGUF |
$(MODEL_DIR).gguf |
Output path for the converted model |
OUTTYPE |
f16 |
Conversion dtype: f16, bf16, q8_0, q4_k_m, tq1_0, … |
PROMPT |
Hello, my name is |
Prompt fed to the model |
N_PREDICT |
128 |
Max tokens to generate |
TEMP |
0.8 |
Sampling temperature (<=0 = greedy) |
CTX |
2048 |
Context window (model trained for 131072) |
NGL |
999 |
GPU layers to offload (0 = CPU only) |
PYTHON |
auto-detected | Python interpreter used to create .venv for conversion |
CMAKE_ARGS |
(empty) | Extra flags passed to cmake -S . -B build |
Pass anything CMake / llama.cpp understands via CMAKE_ARGS:
# CPU only, no Metal:
make build CMAKE_ARGS="-DGGML_METAL=OFF"
# CUDA on Linux:
make build CMAKE_ARGS="-DGGML_CUDA=ON"
# Use a different llama.cpp tag:
make build CMAKE_ARGS="-DLLAMA_CPP_TAG=b8517"
# Debug build:
make build CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Debug"run_model is intentionally tiny — see main.cpp for the full source.
Usage: run_model -m <model.gguf> [options]
-m <path> path to GGUF model (required)
-p <text> prompt (default: "Hello, my name is")
-n <int> number of tokens to generate (default: 128)
-c <int> context size (default: 2048)
-t <int> threads (default: auto)
-ngl <int> GPU layers to offload (default: 999)
--temp <float> temperature, <=0 = greedy (default: 0.8)
--top-k <int> top-k (default: 40)
--top-p <float> top-p (default: 0.95)
--seed <uint> RNG seed (default: random)
safetensors is just a tensor container — it doesn't carry the architecture
description, the tokenizer merges, the chat template, or the metadata that
libllama needs at load time. The first time you run make convert we:
- Create a Python venv at
.venv/using a compatible Python. - Install
llama.cpp/requirements/requirements-convert_hf_to_gguf.txt. - Run
convert_hf_to_gguf.py R1-Distill-1.5B/ --outfile R1-Distill-1.5B.gguf --outtype f16.
The script reads config.json, the *.safetensors shards, and
tokenizer.json, and emits a single self-describing .gguf file. After that
Python is no longer needed.
CMakeLists.txt uses FetchContent to clone llama.cpp at the pinned tag
(b9048) into build/_deps/llama_cpp-src, builds its CMake targets
(llama, ggml, plus backend libs), and links a single executable
run_model against them. On macOS the binary is placed in build/bin/
alongside the libllama.dylib / libggml*.dylib files and uses
@executable_path rpath so it runs without DYLD_LIBRARY_PATH.
The C++ driver is roughly:
ggml_backend_load_all(); // register backends
llama_model* model = llama_model_load_from_file(path, mp); // load weights
const llama_vocab* vocab = llama_model_get_vocab(model);
llama_context* ctx = llama_init_from_model(model, cp); // KV cache, etc.
llama_sampler* smpl = llama_sampler_chain_init(...); // top-k/p + temp + dist
auto tokens = tokenize(vocab, prompt, /*add_special=*/true);
llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size())); // prefill
while (generated < n_predict) {
llama_token id = llama_sampler_sample(smpl, ctx, -1);
if (llama_vocab_is_eog(vocab, id)) break;
fputs(detokenize(vocab, id).c_str(), stdout);
llama_decode(ctx, llama_batch_get_one(&id, 1)); // 1-token decode
}That's the entire generation loop — everything else (RoPE, GQA, RMSNorm,
SwiGLU, KV cache, Metal kernels, …) lives inside libllama and libggml.
make convert fails with No matching distribution found for torch~=2.6.0.
Your default python3 is probably 3.13+. Re-run with a 3.10 – 3.12 interpreter:
brew install python@3.12
make convert PYTHON="$(brew --prefix python@3.12)/bin/python3.12"Failed to clone repository during make configure.
The CMake FetchContent step needs network access and write access to
build/_deps/llama_cpp-src/.git/hooks/. Re-run outside any sandbox / with
permission to write there.
prompt (N tokens) >= context (M).
Increase CTX: make run CTX=8192 PROMPT="…" (max 131072 for this model).
llama_decode failed or model loads but produces gibberish.
Make sure the GGUF was produced by convert_hf_to_gguf.py from the same
llama.cpp tag you're building against. If you bump LLAMA_CPP_TAG, run
make distclean && make convert to regenerate the GGUF.
Slow CPU-only inference.
On Apple Silicon make sure Metal stayed enabled (-DGGML_METAL=ON is the
default in CMakeLists.txt); verify by looking for Metal framework found
in the configure log. Otherwise quantize: make distclean && make convert OUTTYPE=q4_k_m.
- This wrapper code (
CMakeLists.txt,Makefile,main.cpp): released into the public domain / 0-BSD — do whatever you want. llama.cppis MIT-licensed (fetched at build time, not vendored).- The model weights in
R1-Distill-1.5B/are governed by their original Hugging Face license — seeR1-Distill-1.5B/LICENSE.