About

refft.cpp is a building tool to compile LLM/LMs' inference and training on the designated cloud-GPU or edge-NPU backends to a native executable including API, inference serving, training, model, ops, etc

Average 20%+ faster inference and training than Python/PyTorch-based inference/training(in the same quantization/precision and use cases)
0 running dependencies other than Linux/Android/Mac system and GPU/NPU backends

🔥 Key Features

Native Compilation -- Compile the whole inference/training of a LLM/LM into the native executable object
OpenAI-Compatible API -- Seamless integration with existing tools
Custom Training via Plugins -- Data-loader, Optimizer, Model layers, Loss-function
Multi-Modal Support -- Text, vision, audio, etc
Native vRAM mgt -- Native mem mgt instead of GC to lower peak occ-mem and alloc-overhead
Mixed-precision quantization -- FP16, w4a16, w8a16, etc supported per tensor/channel/block
NPU dynamics -- enable NPU to support dynamic shape, MoE, control flow, flexible heterogeneous compute

🎉 refft.cpp build tools

`Click and jump to the tools webpage`

🚀 Inference of LLM/LM

refft.cpp build tools can make the executable files as the following examples

Quick Start

Minimal CLI usage:

./bin/refft-cli --model qwen3 --model_dir /path/to/model --prompt "Who are you?" --max_new_tokens 64

If the binary was built for a fixed matrix tuple, users normally do not need to repeat backend / runner / precision flags.

Useful options:

--ignore_eos
--do_sample
--temperature
--top_k
--top_p
--speculative_mode ngram
--speculative_max_draft_tokens 4
--speculative_ngram_size 3

Server Usage

Minimal server usage:

./bin/refft-server --model qwen3 --model_dir /path/to/model --port 8000

Health check:

curl http://127.0.0.1:8000/health

OpenAI-compatible chat request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Explain KV cache reuse briefly."}],
    "max_tokens": 128,
    "stream": false
  }'

Full install and usage guide:

Install And Usage

For QNN

Model Package	Description
refft-android-aarch64-qnn-qwen3	0.6B/1.7B/4B/8B/14B/32B supported FlashAtttion ops supported Quantization can be set to w4a16, w8a16, w4afp16, w8afp16, fp16 and default is fp16 Tested on OnePlus15/SM8850/16GB-DDR
refft-android-aarch64-qnn-qwen3-moe	30B-A3B supported MoE, FlashAtttion ops supported TP supported for multi-HTPs backends Quantization can be set to w4a16, w8a16, w4afp16, w8afp16, fp16 and default is fp16 Tested on OnePlus15/SM8850/16GB-DDR

For Nvidia

Model Package	Description
refft-linux-x64-cuda-qwen3-20260323.tar.xz	0.6B/1.7B/4B/8B/14B/32B supported
refft-ubuntu2404-x64-cuda-qwen3-20260323.deb	0.6B/1.7B/4B/8B/14B/32B supported
refft-linux-x64-cuda-qwen3-moe-20260323.tar.xz	30B-A3B/235B-A22B supported
refft-ubuntu2404-x64-cuda-qwen3-moe-20260323.deb	30B-A3B/235B-A22B supported

Note: Please contact us for multi-nodes support

For Apple Silicon

Model Packcage	Description
refft-macos-arm64-mlx-qwen3-20260323.tar.xz	0.6B/1.7B/4B/8B/14B/32B supported
refft-macos-arm64-mlx-qwen3-moe-20260323.tar.xz	30B-A3B/235B-A22B supported

🚀 Training of LLM/LM

Download the public datasets or use your own datasets

# Exmaple datasets: `CCI-3-HQ`, `Alpaca GPT4` and `FineWeb`

hf download HuggingFaceFW/finepdfs-edu --repo-type=dataset --local-dir ./datasets/HuggingFaceFW/fineweb-edu
hf download BAAI/CCI3-HQ --repo-type=dataset --local-dir ./datasets/BAAI/CCI3-HQ
hf download llamafactory/alpaca_gpt4_en --repo-type=dataset --local-dir ./datasets/llamafactory/alpaca_gpt4_en

Train LLM via Pre-train/full-SFT/freeze-SFT/LoRA/RL

mkdir -p output
refft train \
	--cutoff_len 512 \
	--model ./models/Qwen/Qwen3-0.6B \
	--block_size 512 \
	--test_every 200 \
	--batch_size 4 \
	--fine_tuning_type full \
	--weight_decay 0.1 \
	--warmup_steps 100 \
	--lr_scheduler_type step \
	--learning_rate 4e-5 \
	--epochs 100 \
	--learning_rate_decay_frac 0.0 \
	--use_bf16 \
	--stage sft \
	--checkpoint_dir ./output/checkpoints/sft-Qwen3-0.6B-full \
	--save_every 20000 \
	--grad_accumulation_steps 32 \
	--resume \
	--load_pretrained \
    --tensor_parallels 1 \
    --pipeline_parallels 1 \
    --data_parallels 1 \
    --nodes 1 \
    --gpus_per_node 1 \
	--chat_template qwen3 \
	--datasets cci3@./datasets/BAAI/CCI3-HQ/data \
	--datasets alpaca@./datasets/llamafactory/alpaca_gpt4_en/alpaca-gpt4-data-en.json \
	--datasets fineweb@./datasets/HuggingFaceFW/fineweb-edu/data/CC-MAIN-2025-26

Output

[1][2025-11-30 09:20:15][I][         train_main.cc: 186]  Reft: v1.0.0, 5301f2a4fb303fd647fe783aa326522efde8ceb4
[1][2025-11-30 09:20:15][I][         train_main.cc: 187]  Build Time: Sun Nov 30 08:37:07 CST 202
[2025-11-30 09:20:15.895] [info] Apply chat template: qwen2
[1][2025-11-30 09:20:15][I][         train_main.cc: 525]  [0/1] Building tokenizer ...
[2025-11-30 09:20:16.102] [info] Vocab size: 151669
[2025-11-30 09:20:16.102] [info] ids: [[9707,1879,0]
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc: 108]  URL: huatuo@/assets/data/huatuo-100.jsonl
 ████████████████████████████████████████▏ 100.0% [ 101/ 101 | 84.1 kHz | 0s<0s] Parsing lines ...
 ████████████████████████████████████████▏ 100.0% [ 100/ 100 | 127.3 kHz | 0s<0s] Loading dataset ...
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc: 108]  URL: alpaca@/assets/data/alpaca-style/reft_ai.json
 ████████████████████████████████████████▏ 100.0% [   5/   5 | 6.6 kHz | 0s<0s] Loading dataset ...
[1][2025-11-30 09:20:16][I][sequence_dataloader_builders.cc:  24]  Dataset has 200 examples in total

[2025-11-30 09:20:16.115] [info] Found the loader for architecture: Qwen3ForCausalLM
[2025-11-30 09:20:16.115] [info] KV cache block size: 512
[2025-11-30 09:20:16.115] [info] KV cache allocator is created
[2025-11-30 09:20:16.124] [info] Creating Qwen model ...
 ████████████████████████████████████████▏ 100.0% [  28/  28 | 12.9 kHz | 0s<0s] Construct blocks ...
 ████████████████████████████████████████▏ 100.0% [ 311/ 311 | 29.3 Hz | 11s<0s]
[2025-11-30 09:20:26.787] [info] Weights are loaded
[1][2025-11-30 09:20:26][I][         train_main.cc: 658]  [0/1] Model loaded
[2025-11-30 09:20:26.801] [info] Last done steps: 0
[1][2025-11-30 09:20:26][I][         train_main.cc: 737]  [0/1] Let's start training now!!!
[1][2025-11-30 09:20:26][I][         train_main.cc: 740]  [0/1] Fine tuning type: full
[1][2025-11-30 09:20:26][I][         train_main.cc: 753]  [0/1] Building trainer ...
[1][2025-11-30 09:20:26][I][         train_main.cc: 831]  [0/1] Trainer is ready
[1][2025-11-30 09:20:26][I][        sft_trainer.cc:  24]  ++++++++++++++++++++++ Training +++++++++++++++++++++++
[1][2025-11-30 09:20:26][I][        sft_trainer.cc:  25]  Start from steps: 0, total steps: 5000, total epochs: 100
[1][2025-11-30 09:20:26][I][        sft_trainer.cc:  27]  Options: ignore_idx: 151643, grad_accumulate_steps: 32
[1][2025-11-30 09:20:26][I][        sft_trainer.cc:  30]  Resuming dataloader to 0 ...
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [0/5000] loss: 3.64062, lr: 0.0000400, seq_len: 288
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [1/5000] loss: 2.12500, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [2/5000] loss: 1.88281, lr: 0.0000400, seq_len: 360
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [3/5000] loss: 1.42969, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [4/5000] loss: 1.96875, lr: 0.0000400, seq_len: 512
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [5/5000] loss: 1.08594, lr: 0.0000400, seq_len: 320
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [6/5000] loss: 1.39062, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:27][I][        sft_trainer.cc: 170]  [0/1] [7/5000] loss: 1.75781, lr: 0.0000400, seq_len: 384
[1][2025-11-30 09:20:28][I][        sft_trainer.cc: 170]  [0/1] [8/5000] loss: 1.57812, lr: 0.0000400, seq_len: 512
[1][2025-11-30 09:20:28][I][        sft_trainer.cc: 170]  [0/1] [9/5000] loss: 1.12500, lr: 0.0000400, seq_len: 384

FAQs

Why refft.cpp implements all of modeling, serving and training in C++

It's manly for a better performance and easy-to-use compared to Python/PyTorch-based as well as for scalability on edge-NPU.

Why Triton is not used in refft.cpp

Because the Triton models can get up to 78% of the performance of the CUDA models on the H100 and up to 82% on the A100.

CUDA-Free Inference for LLMs

How to support multi-nodes GPU/NPU

Technically refft.cpp supports multi-nodes inference and training, while multi-nodes haven't been tested due to lacking of HW resources. Please contact us if needed.

Strictly equivalence of computational precision matters the most in LLM/LM's ops and serving optimization

https://epoch.ai/gradient-updates/why-benchmarking-is-hard
https://blog.vllm.ai/2025/10/28/Kimi-K2-Accuracy.html

Contact Us

Please contact us via haiteng@refinefuture.ai for commercial uses, technical consulting, sponsorship/partnership opportunities, etc.

Acknowledgment

refft.cpp was inspired by Andrej Karpathy' llm.c, and also referred to HuggingFace, PyTorch, vLLM, SGLang, FlashAttention, FlashInfer.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

🔥 Key Features

🎉 refft.cpp build tools

🚀 Inference of LLM/LM

Quick Start

Minimal CLI usage:

Server Usage

For QNN

For Nvidia

For Apple Silicon

🚀 Training of LLM/LM

FAQs

Contact Us

Acknowledgment

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

About

🔥 Key Features

🎉 refft.cpp build tools

🚀 Inference of LLM/LM

Quick Start

Minimal CLI usage:

Server Usage

For QNN

For Nvidia

For Apple Silicon

🚀 Training of LLM/LM

FAQs

Contact Us

Acknowledgment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Packages