-
A high-performance and easy-to-use LLM/LM serving tools for inference, training. All of Ops/Ops-Fusion/Ops-Optimization, serving framework and training tools in C++ without Python/PyTorch, inspired by llm.c of Andrej Karpathy.
-
An excutable file of "reft" with model weights is all you need to run the reft-supported model on your GPU(s) with the better performance.
-
Our deliveralbes are for enterprises, institutes, individuals, GPU/NPU chipset vendors, AIDC, who are seeking for the better performance, cost-efficient, easy-to-use of LLM/LMs.
Example model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
mkdir -p models
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --load-dir ./modelsdocker run --rm -it --gpus all --net=host --ipc=host \
-v ./models:/workspace/models ghcr.io/reft-ai/reft:latest \
/workspace/reft serve \
--model /workspace/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--served_model_name DeepSeek-R1-Distill-Qwen-1.5B \
--chat_template ds-distill-qwen2 ████████████████████████████████████████▏ 100.0% [ 199/ 199 | 476.2 Hz | 0s<0s]
[2025-11-11 07:02:50.007] [GPUModelRunner(rank=0)][7] [info] Model loaded
[2025-11-11 07:02:50.007] [GPUWorker(rank=0)][7] [info] GPU Mem Info(free=18.4 GB, total=23.5 GB)
[2025-11-11 07:02:50.007] [Serve][1] [info] Available GPU Memory: 14.4 GB
[2025-11-11 07:02:50.007] [Serve][1] [info] max_num_layers: 1
[2025-11-11 07:02:50.008] [Serve][1] [info] All layers have the same page size: 1835008 bytes
[2025-11-11 07:02:50.008] [Serve][1] [info] GPU KV cache: 8438 blocks
[2025-11-11 07:02:50.008] [Serve][1] [info] GPU KV cache size: 540032 tokens
[2025-11-11 07:02:50.008] [Serve][1] [info] Maximum concurrency for 4096 tokens per request: 131.84x
[2025-11-11 07:02:50.008] [Serve][1] [info] The KV cache size required by each layer: 15483797504 bytes
[2025-11-11 07:02:50.008] [GPUModelRunner(rank=0)][7] [info] Initialize KV cache | num_blocks: 8438
[2025-11-11 07:02:50.009] [GPUModelRunner(rank=0)][7] [info] CUDA graph capture sizes: 1
[2025-11-11 07:02:50.152] [GPUModelRunner(rank=0)][7] [info] Graph test begin, size: 1
[2025-11-11 07:02:50.243] [GPUModelRunner(rank=0)][7] [info] Graph test passed, size: 1
[2025-11-11 07:02:50.243] [GPUModelRunner(rank=0)][7] [info] Graph capturing finished in 0 secs, took 0.04 GiB
[2025-11-11 07:02:50.243] [Serve][1] [info] Init engine (profile, create_kv_cache, warmup model) took 0.23 seconds
[2025-11-11 07:02:50.244] [Serve][1] [info] Starting API server ...
[2025-11-11 07:02:50.245] [Serve][1] [info] HTTP server listening on 0.0.0.0:8888 ...Chat via CLI
curl -Ns http://127.0.0.1:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-Distill-Qwen-1.5B",
"messages": [{"role":"user", "content": "<|begin▁of▁sentence|><|User|>Who are you?<|Assistant|><think>\\n"}],
"max_tokens": 24,
"temperature": 0.6,
"stream": true
}'data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"Greetings"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"!"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":" I"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"'m"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":" Deep"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
data: {"id":"d971c92d-8505-4152-b8b3-cf9726e19127","object":"chat.completion.chunk","created":1589478378,"model":"DeepSeek-R1-Distill-Qwen-1.5B","system_fingerprint":"fp_44709d6fcb","choices":[{"delta":{"content":"Seek"},"index":0,"logprobs":null,"finish_reason":""}],"usage":null}
- ✅ : Done
- ☕ : To-Do
| Models | Nvidia GPU | AMD GPU | Hexagon NPU | Moore Threads GPU | MetaX GPU |
|---|---|---|---|---|---|
| Qwen2.5-0.5/1.5/3/7/14/32/72B(-Instruct) | ✅ | ☕ | ☕ | ☕ | ☕ |
| Qwen2.5-Math-1.5/7/72B(-Instruct) | ✅ | ☕ | ☕ | ☕ | ☕ |
| Qwen2.5-Coder-0.5/1.5/3/7/14/32B(-Instruct) | ✅ | ☕ | ☕ | ☕ | ☕ |
| DeepSeek-R1 | ✅ | ☕ | ☕ | ☕ | ☕ |
| DeepSeek-V3 | ✅ | ☕ | ☕ | ☕ | ☕ |
| DeepSeek-R1-Distill-Qwen-1.5/7/14/32B | ✅ | ☕ | ☕ | ☕ | ☕ |
| Models | Nvidia GPU | AMD GPU | Hexagon NPU | Moore Threads GPU | MetaX GPU |
|---|---|---|---|---|---|
| SAM | ✅ | ☕ | ☕ | ☕ | ☕ |
| ViT | ✅ | ☕ | ☕ | ☕ | ☕ |
| Models | Nvidia GPU | AMD GPU | Hexagon NPU | Moore Threads GPU | MetaX GPU |
|---|---|---|---|---|---|
| Whisper | ✅ | ☕ | ☕ | ☕ | ☕ |
| OpenVoice | ✅ | ☕ | ☕ | ☕ | ☕ |
| MeloTTS-English/... | ✅ | ☕ | ☕ | ☕ | ☕ |


