Skip to content

uYanJX/QCFuse

Repository files navigation

QCFuse

Paper SGLang 0.5.4

QCFuse is a pipeline-constrained, query-aware KV cache fusion system for efficient long-context RAG generation. This repository contains the QCFuse research release described in arXiv:2606.05875.

✨ Highlights

  • Full-prefill-level quality. QCFuse preserves the quality of full prefill.
  • Matched-quality speedup. QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.
  • Fastest under strict quality control. Under a 2% relative quality drop criterion, QCFuse is the fastest method among all compared methods that satisfies the constraint, with a 2.1x average speedup.

📊 Results

Quality and TTFT trade-off on LongBench and RULER

Quality and TTFT trade-off on LongBench and RULER. Lower TTFT and higher quality are better.

🧪 Datasets

The release runner expects each evaluation split as a local JSONL file named {dataset}.jsonl under --data_dir.

Benchmark Official source Tasks used in this release
LongBench THUDM/LongBench musique, 2wikimqa, hotpotqa
RULER NVIDIA/RULER ruler_mv (MV), ruler_mq (MQ), ruler_vt (VT)

⚙️ Installation

Install SGLang 0.5.4:

git clone -b v0.5.4 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"

Install the evaluation dependencies used by the Blend runner:

pip install rouge-score

Use a CUDA/PyTorch environment compatible with your GPU and SGLang 0.5.4. The runner expects local model files and local JSONL datasets.

🚀 Running QCFuse

Run the SSD-backed QCFuse method:

python blend/sglang_blend_ssd.py \
  --model qwen3-8b \
  --model_dir /path/to/models \
  --data_dir /path/to/data \
  --dataset hotpotqa \
  --baseline ours \
  --size 200 \
  --cache_dir /path/to/cache

--cache_dir stores the SSD-backed chunk and query caches. With --baseline ours, the runner performs offline cache preparation before the online evaluation pass.

Run the full-prefill baseline:

python blend/sglang_blend_ssd.py \
  --model qwen3-8b \
  --model_dir /path/to/models \
  --data_dir /path/to/data \
  --dataset hotpotqa \
  --baseline fullcomp \
  --size 200 \
  --cache_dir /path/to/cache

Supported --baseline values are ours and fullcomp. Supported --dataset values are hotpotqa, 2wikimqa, musique, ruler_mv, ruler_mq, and ruler_vt.

📚 Citation

If you find QCFuse useful, please cite:

@misc{yan2026qcfusequeryawarecachefusion,
      title={QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving},
      author={Jianxin Yan and Wangze Ni and Zhenxin Li and Jiabao Jin and Zhitao Shen and Haoyang Li and Jia Zhu and Peng Cheng and Xuemin Lin and Lei Chen and Kui Ren},
      year={2026},
      eprint={2606.05875},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.05875},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors