refft.cpp is a building tool to compile LLM/LMs' inference and training on the designated cloud-GPU or edge-NPU backends to a native executable including API, inference serving, training, model, ops, etc
-
Average 20%+ faster inference and training than Python/PyTorch-based inference/training(in the same quantization/precision and use cases)
-
0 running dependencies other than Linux/Android/Mac system and GPU/NPU backends
- Native Compilation -- Compile the whole inference/training of a LLM/LM into the native executable object
- OpenAI-Compatible API -- Seamless integration with existing tools
- Custom Training via Plugins -- Data-loader, Optimizer, Model layers, Loss-function
- Multi-Modal Support -- Text, vision, audio, etc
- Native vRAM mgt -- Native mem mgt instead of GC to lower peak occ-mem and alloc-overhead
- Mixed-precision quantization -- FP16, w4a16, w8a16, etc supported per tensor/channel/block
- NPU dynamics -- enable NPU to support dynamic shape, MoE, control flow, flexible heterogeneous compute
`Click and jump to the tools webpage`
refft.cpp build tools can make the executable files as the following examples
./bin/refft-cli --model qwen3 --model_dir /path/to/model --prompt "Who are you?" --max_new_tokens 64If the binary was built for a fixed matrix tuple, users normally do not need to repeat backend / runner / precision flags.
Useful options:
--ignore_eos--do_sample--temperature--top_k--top_p--speculative_mode ngram--speculative_max_draft_tokens 4--speculative_ngram_size 3
Minimal server usage:
./bin/refft-server --model qwen3 --model_dir /path/to/model --port 8000Health check:
curl http://127.0.0.1:8000/healthOpenAI-compatible chat request:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "Explain KV cache reuse briefly."}],
"max_tokens": 128,
"stream": false
}'Full install and usage guide:
| Model Package | Description |
|---|---|
| refft-android-aarch64-qnn-qwen3 | 0.6B/1.7B/4B/8B/14B/32B supported FlashAtttion ops supported Quantization can be set to w4a16, w8a16, w4afp16, w8afp16, fp16 and default is fp16 Tested on OnePlus15/SM8850/16GB-DDR |
| refft-android-aarch64-qnn-qwen3-moe | 30B-A3B supported MoE, FlashAtttion ops supported TP supported for multi-HTPs backends Quantization can be set to w4a16, w8a16, w4afp16, w8afp16, fp16 and default is fp16 Tested on OnePlus15/SM8850/16GB-DDR |
| Model Package | Description |
|---|---|
| refft-linux-x64-cuda-qwen3-20260323.tar.xz | 0.6B/1.7B/4B/8B/14B/32B supported |
| refft-ubuntu2404-x64-cuda-qwen3-20260323.deb | 0.6B/1.7B/4B/8B/14B/32B supported |
| refft-linux-x64-cuda-qwen3-moe-20260323.tar.xz | 30B-A3B/235B-A22B supported |
| refft-ubuntu2404-x64-cuda-qwen3-moe-20260323.deb | 30B-A3B/235B-A22B supported |
Note: Please contact us for multi-nodes support
| Model Packcage | Description |
|---|---|
| refft-macos-arm64-mlx-qwen3-20260323.tar.xz | 0.6B/1.7B/4B/8B/14B/32B supported |
| refft-macos-arm64-mlx-qwen3-moe-20260323.tar.xz | 30B-A3B/235B-A22B supported |
Download the public datasets or use your own datasets
# Exmaple datasets: `CCI-3-HQ`, `Alpaca GPT4` and `FineWeb`
hf download HuggingFaceFW/finepdfs-edu --repo-type=dataset --local-dir ./datasets/HuggingFaceFW/fineweb-edu
hf download BAAI/CCI3-HQ --repo-type=dataset --local-dir ./datasets/BAAI/CCI3-HQ
hf download llamafactory/alpaca_gpt4_en --repo-type=dataset --local-dir ./datasets/llamafactory/alpaca_gpt4_enTrain LLM via Pre-train/full-SFT/freeze-SFT/LoRA/RL
mkdir -p output
refft train \
--cutoff_len 512 \
--model ./models/Qwen/Qwen3-0.6B \
--block_size 512 \
--test_every 200 \
--batch_size 4 \
--fine_tuning_type full \
--weight_decay 0.1 \
--warmup_steps 100 \
--lr_scheduler_type step \
--learning_rate 4e-5 \
--epochs 100 \
--learning_rate_decay_frac 0.0 \
--use_bf16 \
--stage sft \
--checkpoint_dir ./output/checkpoints/sft-Qwen3-0.6B-full \
--save_every 20000 \
--grad_accumulation_steps 32 \
--resume \
--load_pretrained \
--tensor_parallels 1 \
--pipeline_parallels 1 \
--data_parallels 1 \
--nodes 1 \
--gpus_per_node 1 \
--chat_template qwen3 \
--datasets cci3@./datasets/BAAI/CCI3-HQ/data \
--datasets alpaca@./datasets/llamafactory/alpaca_gpt4_en/alpaca-gpt4-data-en.json \
--datasets fineweb@./datasets/HuggingFaceFW/fineweb-edu/data/CC-MAIN-2025-26Output
[1][2025-11-30 09:20:15][I][ train_main.cc: 186] Reft: v1.0.0, 5301f2a4fb303fd647fe783aa326522efde8ceb4
[1][2025-11-30 09:20:15][I][ train_main.cc: 187] Build Time: Sun Nov 30 08:37:07 CST 202
[2025-11-30 09:20:15.895] [info] Apply chat template: qwen2
[1][2025-11-30 09:20:15][I][ train_main.cc: 525] [0/1] Building tokenizer ...
[2025-11-30 09:20:16.102] [info] Vocab size: 151669
[2025-11-30 09:20:16.102] [info] ids: [[9707,1879,0]
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc: 108] URL: huatuo@/assets/data/huatuo-100.jsonl
βββββββββββββββββββββββββββββββββββββββββ 100.0% [ 101/ 101 | 84.1 kHz | 0s<0s] Parsing lines ...
βββββββββββββββββββββββββββββββββββββββββ 100.0% [ 100/ 100 | 127.3 kHz | 0s<0s] Loading dataset ...
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc: 108] URL: alpaca@/assets/data/alpaca-style/reft_ai.json
βββββββββββββββββββββββββββββββββββββββββ 100.0% [ 5/ 5 | 6.6 kHz | 0s<0s] Loading dataset ...
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc: 24] Dataset has 200 examples in total
[2025-11-30 09:20:16.115] [info] Found the loader for architecture: Qwen3ForCausalLM
[2025-11-30 09:20:16.115] [info] KV cache block size: 512
[2025-11-30 09:20:16.115] [info] KV cache allocator is created
[2025-11-30 09:20:16.124] [info] Creating Qwen model ...
βββββββββββββββββββββββββββββββββββββββββ 100.0% [ 28/ 28 | 12.9 kHz | 0s<0s] Construct blocks ...
βββββββββββββββββββββββββββββββββββββββββ 100.0% [ 311/ 311 | 29.3 Hz | 11s<0s]
[2025-11-30 09:20:26.787] [info] Weights are loaded
[1][2025-11-30 09:20:26][I][ train_main.cc: 658] [0/1] Model loaded
[2025-11-30 09:20:26.801] [info] Last done steps: 0
[1][2025-11-30 09:20:26][I][ train_main.cc: 737] [0/1] Let's start training now!!!
[1][2025-11-30 09:20:26][I][ train_main.cc: 740] [0/1] Fine tuning type: full
[1][2025-11-30 09:20:26][I][ train_main.cc: 753] [0/1] Building trainer ...
[1][2025-11-30 09:20:26][I][ train_main.cc: 831] [0/1] Trainer is ready
[1][2025-11-30 09:20:26][I][ sft_trainer.cc: 24] ++++++++++++++++++++++ Training +++++++++++++++++++++++
[1][2025-11-30 09:20:26][I][ sft_trainer.cc: 25] Start from steps: 0, total steps: 5000, total epochs: 100
[1][2025-11-30 09:20:26][I][ sft_trainer.cc: 27] Options: ignore_idx: 151643, grad_accumulate_steps: 32
[1][2025-11-30 09:20:26][I][ sft_trainer.cc: 30] Resuming dataloader to 0 ...
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [0/5000] loss: 3.64062, lr: 0.0000400, seq_len: 288
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [1/5000] loss: 2.12500, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [2/5000] loss: 1.88281, lr: 0.0000400, seq_len: 360
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [3/5000] loss: 1.42969, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [4/5000] loss: 1.96875, lr: 0.0000400, seq_len: 512
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [5/5000] loss: 1.08594, lr: 0.0000400, seq_len: 320
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [6/5000] loss: 1.39062, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][ sft_trainer.cc: 170] [0/1] [7/5000] loss: 1.75781, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:28][I][ sft_trainer.cc: 170] [0/1] [8/5000] loss: 1.57812, lr: 0.0000400, seq_len: 512
[1][2025-11-30 09:20:28][I][ sft_trainer.cc: 170] [0/1] [9/5000] loss: 1.12500, lr: 0.0000400, seq_len: 384Why refft.cpp implements all of modeling, serving and training in C++
It's manly for a better performance and easy-to-use compared to Python/PyTorch-based as well as for scalability on edge-NPU.
Why Triton is not used in refft.cpp
Because the Triton models can get up to 78% of the performance of the CUDA models on the H100 and up to 82% on the A100.
How to support multi-nodes GPU/NPU
Technically refft.cpp supports multi-nodes inference and training, while multi-nodes haven't been tested due to lacking of HW resources. Please contact us if needed.Strictly equivalence of computational precision matters the most in LLM/LM's ops and serving optimization
https://epoch.ai/gradient-updates/why-benchmarking-is-hardhttps://blog.vllm.ai/2025/10/28/Kimi-K2-Accuracy.html
Please contact us via haiteng@refinefuture.ai for commercial uses, technical consulting, sponsorship/partnership opportunities, etc.
refft.cpp was inspired by Andrej Karpathy' llm.c, and also referred to HuggingFace, PyTorch, vLLM, SGLang, FlashAttention, FlashInfer.

