Skip to content

quantrpeter/run-raw-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

run-raw-model

Download the model

hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./R1-Distill-1.5B

A minimal C++ example that loads a Hugging Face transformer checkpoint (the DeepSeek-R1-Distill-Qwen-1.5B model in R1-Distill-1.5B/) and runs token-by-token inference, with no Python in the hot path.

It is built on top of llama.cpp: the project pins a specific tag, fetches it via CMake FetchContent, links a tiny main.cpp against libllama, and uses Apple Metal / Accelerate / BLAS where available.

prompt ──► [main.cpp]
              │   llama_model_load_from_file
              │   llama_init_from_model
              │   llama_tokenize
              │   llama_decode  ◄── KV cache, attention, MLP (libllama + ggml)
              │   llama_sampler_sample
              ▼
            tokens (streamed to stdout)

Contents

.
├── CMakeLists.txt        # fetches llama.cpp@b9048, builds bin/run_model
├── Makefile              # configure / build / convert / run wrapper
├── main.cpp              # ~200-line C++ inference driver (llama.cpp C API)
├── R1-Distill-1.5B/      # the original Hugging Face checkpoint (safetensors)
└── R1-Distill-1.5B.gguf  # produced by `make convert` (gitignored)

Prerequisites

Tool Minimum Notes
CMake 3.18 brew install cmake or your distro package
C++ toolchain C++17 capable AppleClang / clang / gcc all work
Git any recent version needed for FetchContent to clone llama.cpp
Python 3.10 – 3.12 only for the one-time safetensors → gguf conversion
Disk space ~8 GB free source + build artefacts + GGUF model

Why a specific Python range? llama.cpp's converter pins numpy~=1.26 and torch~=2.6. Those wheels are only published for Python ≤ 3.12. The Makefile auto-detects python3.12 / 3.11 / 3.10 on PATH and falls back to plain python3; override with make convert PYTHON=/path/to/python.

The C++ binary itself has no Python or PyTorch dependency at runtime — it only depends on libllama / libggml* which are statically/dynamically linked alongside it.


Quick start

# 1. Build the C++ binary (clones llama.cpp@b9048 on first run, ~2 minutes).
make build

# 2. Convert the HF safetensors checkpoint to GGUF (one-time, ~2 minutes).
make convert

# 3. Run inference.
make run PROMPT="Why is the sky blue?" N_PREDICT=200

Or call the binary directly:

./build/bin/run_model \
    -m R1-Distill-1.5B.gguf \
    -p "The capital of France is" \
    -n 128 \
    --temp 0.8

Sample output (greedy, --temp 0):

The capital of France is Paris, the capital of Germany is Berlin,
the capital of Italy is Rome, the capital of Spain is Madrid, ...

Makefile targets

Target What it does
make help Print the target list and current variable values
make configure Run CMake configure; clones llama.cpp into build/_deps/
make build Configure + compile build/bin/run_model
make convert Create .venv, install converter deps, write R1-Distill-1.5B.gguf
make run Build (if needed) and run inference with the variables below
make clean Delete build/
make distclean Also delete .venv/ and the generated .gguf

Tunable variables

All can be set on the make command line (make run TEMP=0).

Variable Default Meaning
MODEL_DIR R1-Distill-1.5B HF checkpoint directory (must contain config.json)
GGUF $(MODEL_DIR).gguf Output path for the converted model
OUTTYPE f16 Conversion dtype: f16, bf16, q8_0, q4_k_m, tq1_0, …
PROMPT Hello, my name is Prompt fed to the model
N_PREDICT 128 Max tokens to generate
TEMP 0.8 Sampling temperature (<=0 = greedy)
CTX 2048 Context window (model trained for 131072)
NGL 999 GPU layers to offload (0 = CPU only)
PYTHON auto-detected Python interpreter used to create .venv for conversion
CMAKE_ARGS (empty) Extra flags passed to cmake -S . -B build

Build-time switches

Pass anything CMake / llama.cpp understands via CMAKE_ARGS:

# CPU only, no Metal:
make build CMAKE_ARGS="-DGGML_METAL=OFF"

# CUDA on Linux:
make build CMAKE_ARGS="-DGGML_CUDA=ON"

# Use a different llama.cpp tag:
make build CMAKE_ARGS="-DLLAMA_CPP_TAG=b8517"

# Debug build:
make build CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Debug"

CLI reference

run_model is intentionally tiny — see main.cpp for the full source.

Usage: run_model -m <model.gguf> [options]
  -m   <path>     path to GGUF model (required)
  -p   <text>     prompt (default: "Hello, my name is")
  -n   <int>      number of tokens to generate (default: 128)
  -c   <int>      context size (default: 2048)
  -t   <int>      threads (default: auto)
  -ngl <int>      GPU layers to offload (default: 999)
  --temp <float>  temperature, <=0 = greedy (default: 0.8)
  --top-k <int>   top-k (default: 40)
  --top-p <float> top-p (default: 0.95)
  --seed <uint>   RNG seed (default: random)

How it works

1. Conversion (make convert)

safetensors is just a tensor container — it doesn't carry the architecture description, the tokenizer merges, the chat template, or the metadata that libllama needs at load time. The first time you run make convert we:

  1. Create a Python venv at .venv/ using a compatible Python.
  2. Install llama.cpp/requirements/requirements-convert_hf_to_gguf.txt.
  3. Run convert_hf_to_gguf.py R1-Distill-1.5B/ --outfile R1-Distill-1.5B.gguf --outtype f16.

The script reads config.json, the *.safetensors shards, and tokenizer.json, and emits a single self-describing .gguf file. After that Python is no longer needed.

2. Build (make build)

CMakeLists.txt uses FetchContent to clone llama.cpp at the pinned tag (b9048) into build/_deps/llama_cpp-src, builds its CMake targets (llama, ggml, plus backend libs), and links a single executable run_model against them. On macOS the binary is placed in build/bin/ alongside the libllama.dylib / libggml*.dylib files and uses @executable_path rpath so it runs without DYLD_LIBRARY_PATH.

3. Inference (main.cpp)

The C++ driver is roughly:

ggml_backend_load_all();                                    // register backends
llama_model* model = llama_model_load_from_file(path, mp);  // load weights
const llama_vocab* vocab = llama_model_get_vocab(model);
llama_context* ctx = llama_init_from_model(model, cp);      // KV cache, etc.

llama_sampler* smpl = llama_sampler_chain_init(...);        // top-k/p + temp + dist
auto tokens = tokenize(vocab, prompt, /*add_special=*/true);

llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size())); // prefill

while (generated < n_predict) {
    llama_token id = llama_sampler_sample(smpl, ctx, -1);
    if (llama_vocab_is_eog(vocab, id)) break;
    fputs(detokenize(vocab, id).c_str(), stdout);
    llama_decode(ctx, llama_batch_get_one(&id, 1));         // 1-token decode
}

That's the entire generation loop — everything else (RoPE, GQA, RMSNorm, SwiGLU, KV cache, Metal kernels, …) lives inside libllama and libggml.


Troubleshooting

make convert fails with No matching distribution found for torch~=2.6.0. Your default python3 is probably 3.13+. Re-run with a 3.10 – 3.12 interpreter:

brew install python@3.12
make convert PYTHON="$(brew --prefix python@3.12)/bin/python3.12"

Failed to clone repository during make configure. The CMake FetchContent step needs network access and write access to build/_deps/llama_cpp-src/.git/hooks/. Re-run outside any sandbox / with permission to write there.

prompt (N tokens) >= context (M). Increase CTX: make run CTX=8192 PROMPT="…" (max 131072 for this model).

llama_decode failed or model loads but produces gibberish. Make sure the GGUF was produced by convert_hf_to_gguf.py from the same llama.cpp tag you're building against. If you bump LLAMA_CPP_TAG, run make distclean && make convert to regenerate the GGUF.

Slow CPU-only inference. On Apple Silicon make sure Metal stayed enabled (-DGGML_METAL=ON is the default in CMakeLists.txt); verify by looking for Metal framework found in the configure log. Otherwise quantize: make distclean && make convert OUTTYPE=q4_k_m.


License

  • This wrapper code (CMakeLists.txt, Makefile, main.cpp): released into the public domain / 0-BSD — do whatever you want.
  • llama.cpp is MIT-licensed (fetched at build time, not vendored).
  • The model weights in R1-Distill-1.5B/ are governed by their original Hugging Face license — see R1-Distill-1.5B/LICENSE.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors