reft.cpp

What is `reft.cpp`

A high-performance and easy-to-use LLM/LM serving tools for inference, training. All of Ops/Ops-Fusion/Ops-Optimization, serving framework and training tools in C++ without Python/PyTorch, inspired by llm.c of Andrej Karpathy.
An excutable file of "reft" with model weights is all you need to run the reft-supported model on your GPU(s) with the better performance.
Our deliveralbes are for enterprises, institutes, individuals, GPU/NPU chipset vendors, AIDC, who are seeking for the better performance, cost-efficient, easy-to-use of LLM/LMs.

Quick start

To deploy and run a LLM/LM with on-premises or cloud GPUs, all you need with reft.cpp is GPUs+Linux+Docker.

Example model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

1. Download model weights

mkdir -p models
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --load-dir ./models

2. Download and launch `reft` (OpenAI-compatible API server)

Command

docker run --rm -it --gpus all --net=host --ipc=host \
  -v ./models:/workspace/models ghcr.io/reft-ai/reft:latest \
  /workspace/reft serve \
  --model /workspace/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --served_model_name DeepSeek-R1-Distill-Qwen-1.5B \
  --chat_template ds-distill-qwen2

Output

  ████████████████████████████████████████▏ 100.0% [ 199/ 199 | 476.2 Hz | 0s<0s]  
[2025-11-11 07:02:50.007] [GPUModelRunner(rank=0)][7] [info] Model loaded
[2025-11-11 07:02:50.007] [GPUWorker(rank=0)][7] [info] GPU Mem Info(free=18.4 GB, total=23.5 GB)
[2025-11-11 07:02:50.007] [Serve][1] [info] Available GPU Memory: 14.4 GB
[2025-11-11 07:02:50.007] [Serve][1] [info] max_num_layers: 1
[2025-11-11 07:02:50.008] [Serve][1] [info] All layers have the same page size: 1835008 bytes
[2025-11-11 07:02:50.008] [Serve][1] [info] GPU KV cache: 8438 blocks
[2025-11-11 07:02:50.008] [Serve][1] [info] GPU KV cache size: 540032 tokens
[2025-11-11 07:02:50.008] [Serve][1] [info] Maximum concurrency for 4096 tokens per request: 131.84x
[2025-11-11 07:02:50.008] [Serve][1] [info] The KV cache size required by each layer: 15483797504 bytes
[2025-11-11 07:02:50.008] [GPUModelRunner(rank=0)][7] [info] Initialize KV cache | num_blocks: 8438
[2025-11-11 07:02:50.009] [GPUModelRunner(rank=0)][7] [info] CUDA graph capture sizes: 1
[2025-11-11 07:02:50.152] [GPUModelRunner(rank=0)][7] [info] Graph test begin, size: 1
[2025-11-11 07:02:50.243] [GPUModelRunner(rank=0)][7] [info] Graph test passed, size: 1
[2025-11-11 07:02:50.243] [GPUModelRunner(rank=0)][7] [info] Graph capturing finished in 0 secs, took 0.04 GiB
[2025-11-11 07:02:50.243] [Serve][1] [info] Init engine (profile, create_kv_cache, warmup model) took 0.23 seconds
[2025-11-11 07:02:50.244] [Serve][1] [info] Starting API server ...
[2025-11-11 07:02:50.245] [Serve][1] [info] HTTP server listening on 0.0.0.0:8888 ...

3. Start chatting

Chat via CLI

Command

curl -Ns http://127.0.0.1:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
	"model": "DeepSeek-R1-Distill-Qwen-1.5B",
	"messages": [{"role":"user", "content": "<｜begin▁of▁sentence｜><｜User｜>Who are you?<｜Assistant｜><think>\\n"}],
	"max_tokens": 24,
	"temperature": 0.6,
	"stream": true
  }'

Output

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"Greetings"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"!"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":" I"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"'m"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":" Deep"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"Seek"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}

Chat via WebUI app

1. Download and install app

DeepChat

2. Setup an OpenAI provider

3. Now, enjoy chatting!

Supported models

✅ : Done
☕ : To-Do

LLM

Models	Nvidia GPU	AMD GPU	Hexagon NPU	Moore Threads GPU	MetaX GPU
Qwen2.5-0.5/1.5/3/7/14/32/72B(-Instruct)	✅	☕	☕	☕	☕
Qwen2.5-Math-1.5/7/72B(-Instruct)	✅	☕	☕	☕	☕
Qwen2.5-Coder-0.5/1.5/3/7/14/32B(-Instruct)	✅	☕	☕	☕	☕
DeepSeek-R1	✅	☕	☕	☕	☕
DeepSeek-V3	✅	☕	☕	☕	☕
DeepSeek-R1-Distill-Qwen-1.5/7/14/32B	✅	☕	☕	☕	☕

Vision LM

Models	Nvidia GPU	AMD GPU	Hexagon NPU	Moore Threads GPU	MetaX GPU
SAM	✅	☕	☕	☕	☕
ViT	✅	☕	☕	☕	☕

Audio LM

Models	Nvidia GPU	AMD GPU	Hexagon NPU	Moore Threads GPU	MetaX GPU
Whisper	✅	☕	☕	☕	☕
OpenVoice	✅	☕	☕	☕	☕
MeloTTS-English/...	✅	☕	☕	☕	☕

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

reft.cpp

What is `reft.cpp`

Quick start

To deploy and run a LLM/LM with on-premises or cloud GPUs, all you need with reft.cpp is GPUs+Linux+Docker.

1. Download model weights

2. Download and launch `reft` (OpenAI-compatible API server)

Command

Output

3. Start chatting

Command

Output

1. Download and install app

2. Setup an OpenAI provider

3. Now, enjoy chatting!

Supported models

LLM

Vision LM

Audio LM

About

Uh oh!

Releases 1

Packages

Uh oh!

License

reft-ai/reft.cpp

Folders and files

Latest commit

History

Repository files navigation

reft.cpp

What is reft.cpp

Quick start

To deploy and run a LLM/LM with on-premises or cloud GPUs, all you need with reft.cpp is GPUs+Linux+Docker.

1. Download model weights

2. Download and launch reft (OpenAI-compatible API server)

Command

Output

3. Start chatting

Command

Output

1. Download and install app

2. Setup an OpenAI provider

3. Now, enjoy chatting!

Supported models

LLM

Vision LM

Audio LM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

What is `reft.cpp`

2. Download and launch `reft` (OpenAI-compatible API server)

Packages