run-raw-model

Download the model

hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./R1-Distill-1.5B

A minimal C++ example that loads a Hugging Face transformer checkpoint (the DeepSeek-R1-Distill-Qwen-1.5B model in R1-Distill-1.5B/) and runs token-by-token inference, with no Python in the hot path.

It is built on top of llama.cpp: the project pins a specific tag, fetches it via CMake FetchContent, links a tiny main.cpp against libllama, and uses Apple Metal / Accelerate / BLAS where available.

prompt ──► [main.cpp]
              │   llama_model_load_from_file
              │   llama_init_from_model
              │   llama_tokenize
              │   llama_decode  ◄── KV cache, attention, MLP (libllama + ggml)
              │   llama_sampler_sample
              ▼
            tokens (streamed to stdout)

.
├── CMakeLists.txt        # fetches llama.cpp@b9048, builds bin/run_model
├── Makefile              # configure / build / convert / run wrapper
├── main.cpp              # ~200-line C++ inference driver (llama.cpp C API)
├── R1-Distill-1.5B/      # the original Hugging Face checkpoint (safetensors)
└── R1-Distill-1.5B.gguf  # produced by `make convert` (gitignored)

Prerequisites

Tool	Minimum	Notes
CMake	3.18	`brew install cmake` or your distro package
C++ toolchain	C++17 capable	AppleClang / clang / gcc all work
Git	any recent version	needed for `FetchContent` to clone llama.cpp
Python	3.10 – 3.12	only for the one-time `safetensors → gguf` conversion
Disk space	~8 GB free	source + build artefacts + GGUF model

Why a specific Python range? llama.cpp's converter pins numpy~=1.26 and torch~=2.6. Those wheels are only published for Python ≤ 3.12. The Makefile auto-detects python3.12 / 3.11 / 3.10 on PATH and falls back to plain python3; override with make convert PYTHON=/path/to/python.

The C++ binary itself has no Python or PyTorch dependency at runtime — it only depends on libllama / libggml* which are statically/dynamically linked alongside it.

Quick start

# 1. Build the C++ binary (clones llama.cpp@b9048 on first run, ~2 minutes).
make build

# 2. Convert the HF safetensors checkpoint to GGUF (one-time, ~2 minutes).
make convert

# 3. Run inference.
make run PROMPT="Why is the sky blue?" N_PREDICT=200

Or call the binary directly:

./build/bin/run_model \
    -m R1-Distill-1.5B.gguf \
    -p "The capital of France is" \
    -n 128 \
    --temp 0.8

Sample output (greedy, --temp 0):

The capital of France is Paris, the capital of Germany is Berlin,
the capital of Italy is Rome, the capital of Spain is Madrid, ...

Makefile targets

Target	What it does
`make help`	Print the target list and current variable values
`make configure`	Run CMake configure; clones `llama.cpp` into `build/_deps/`
`make build`	Configure + compile `build/bin/run_model`
`make convert`	Create `.venv`, install converter deps, write `R1-Distill-1.5B.gguf`
`make run`	Build (if needed) and run inference with the variables below
`make clean`	Delete `build/`
`make distclean`	Also delete `.venv/` and the generated `.gguf`

Tunable variables

All can be set on the make command line (make run TEMP=0).

Variable	Default	Meaning
`MODEL_DIR`	`R1-Distill-1.5B`	HF checkpoint directory (must contain `config.json`)
`GGUF`	`$(MODEL_DIR).gguf`	Output path for the converted model
`OUTTYPE`	`f16`	Conversion dtype: `f16`, `bf16`, `q8_0`, `q4_k_m`, `tq1_0`, …
`PROMPT`	`Hello, my name is`	Prompt fed to the model
`N_PREDICT`	`128`	Max tokens to generate
`TEMP`	`0.8`	Sampling temperature (`<=0` = greedy)
`CTX`	`2048`	Context window (model trained for 131072)
`NGL`	`999`	GPU layers to offload (`0` = CPU only)
`PYTHON`	auto-detected	Python interpreter used to create `.venv` for conversion
`CMAKE_ARGS`	(empty)	Extra flags passed to `cmake -S . -B build`

Build-time switches

Pass anything CMake / llama.cpp understands via CMAKE_ARGS:

# CPU only, no Metal:
make build CMAKE_ARGS="-DGGML_METAL=OFF"

# CUDA on Linux:
make build CMAKE_ARGS="-DGGML_CUDA=ON"

# Use a different llama.cpp tag:
make build CMAKE_ARGS="-DLLAMA_CPP_TAG=b8517"

# Debug build:
make build CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Debug"

CLI reference

run_model is intentionally tiny — see main.cpp for the full source.

Usage: run_model -m <model.gguf> [options]
  -m   <path>     path to GGUF model (required)
  -p   <text>     prompt (default: "Hello, my name is")
  -n   <int>      number of tokens to generate (default: 128)
  -c   <int>      context size (default: 2048)
  -t   <int>      threads (default: auto)
  -ngl <int>      GPU layers to offload (default: 999)
  --temp <float>  temperature, <=0 = greedy (default: 0.8)
  --top-k <int>   top-k (default: 40)
  --top-p <float> top-p (default: 0.95)
  --seed <uint>   RNG seed (default: random)

How it works

1. Conversion (`make convert`)

safetensors is just a tensor container — it doesn't carry the architecture description, the tokenizer merges, the chat template, or the metadata that libllama needs at load time. The first time you run make convert we:

Create a Python venv at .venv/ using a compatible Python.
Install llama.cpp/requirements/requirements-convert_hf_to_gguf.txt.
Run convert_hf_to_gguf.py R1-Distill-1.5B/ --outfile R1-Distill-1.5B.gguf --outtype f16.

The script reads config.json, the *.safetensors shards, and tokenizer.json, and emits a single self-describing .gguf file. After that Python is no longer needed.

2. Build (`make build`)

CMakeLists.txt uses FetchContent to clone llama.cpp at the pinned tag (b9048) into build/_deps/llama_cpp-src, builds its CMake targets (llama, ggml, plus backend libs), and links a single executable run_model against them. On macOS the binary is placed in build/bin/ alongside the libllama.dylib / libggml*.dylib files and uses @executable_path rpath so it runs without DYLD_LIBRARY_PATH.

3. Inference (`main.cpp`)

The C++ driver is roughly:

ggml_backend_load_all();                                    // register backends
llama_model* model = llama_model_load_from_file(path, mp);  // load weights
const llama_vocab* vocab = llama_model_get_vocab(model);
llama_context* ctx = llama_init_from_model(model, cp);      // KV cache, etc.

llama_sampler* smpl = llama_sampler_chain_init(...);        // top-k/p + temp + dist
auto tokens = tokenize(vocab, prompt, /*add_special=*/true);

llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size())); // prefill

while (generated < n_predict) {
    llama_token id = llama_sampler_sample(smpl, ctx, -1);
    if (llama_vocab_is_eog(vocab, id)) break;
    fputs(detokenize(vocab, id).c_str(), stdout);
    llama_decode(ctx, llama_batch_get_one(&id, 1));         // 1-token decode
}

That's the entire generation loop — everything else (RoPE, GQA, RMSNorm, SwiGLU, KV cache, Metal kernels, …) lives inside libllama and libggml.

Troubleshooting

make convert fails with No matching distribution found for torch~=2.6.0. Your default python3 is probably 3.13+. Re-run with a 3.10 – 3.12 interpreter:

brew install python@3.12
make convert PYTHON="$(brew --prefix python@3.12)/bin/python3.12"

Failed to clone repository during make configure. The CMake FetchContent step needs network access and write access to build/_deps/llama_cpp-src/.git/hooks/. Re-run outside any sandbox / with permission to write there.

prompt (N tokens) >= context (M). Increase CTX: make run CTX=8192 PROMPT="…" (max 131072 for this model).

llama_decode failed or model loads but produces gibberish. Make sure the GGUF was produced by convert_hf_to_gguf.py from the same llama.cpp tag you're building against. If you bump LLAMA_CPP_TAG, run make distclean && make convert to regenerate the GGUF.

Slow CPU-only inference. On Apple Silicon make sure Metal stayed enabled (-DGGML_METAL=ON is the default in CMakeLists.txt); verify by looking for Metal framework found in the configure log. Otherwise quantize: make distclean && make convert OUTTYPE=q4_k_m.

License

This wrapper code (CMakeLists.txt, Makefile, main.cpp): released into the public domain / 0-BSD — do whatever you want.
llama.cpp is MIT-licensed (fetched at build time, not vendored).
The model weights in R1-Distill-1.5B/ are governed by their original Hugging Face license — see R1-Distill-1.5B/LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

run-raw-model

Contents

Prerequisites

Quick start

Makefile targets

Tunable variables

Build-time switches

CLI reference

How it works

1. Conversion (`make convert`)

2. Build (`make build`)

3. Inference (`main.cpp`)

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

run-raw-model

Contents

Prerequisites

Quick start

Makefile targets

Tunable variables

Build-time switches

CLI reference

How it works

1. Conversion (make convert)

2. Build (make build)

3. Inference (main.cpp)

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Conversion (`make convert`)

2. Build (`make build`)

3. Inference (`main.cpp`)

Packages