You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Contributions and feedback are welcome. Join Slack.
Focus
Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, HiCache, and load balancing.
Usability: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
Kernel optimization for next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
Reinforcement learning framework integration and training-inference mismatch mitigation.
Multimodal: Enhance diffusion models for video and image generation. Omni model support.
SGLang Roadmap — 2026 Q1
Contributions and feedback are welcome. Join Slack.
Focus
Base Engine Features
Turn on overlap scheduler for speculative decoding by default
PoC: @hnyls2002
Slack: #spec-decoding
Issue: [Feature] Overlap Spec Support #11762
Turn on prefill CUDA graph by default
PoC: @Oasis-Git @ispobock @BBuf
Slack: #piecewise-cuda-graph
Issue: [Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490
General memory pool and prefix cache for hybrid models
PoC: @cctry @xiezhq-hermann
Slack: #prefix-cache, #kv-cache-store
Issue: [Feature] Memory Cache System Refactoring Road Map (Mem Cache V2) #12587
Mixed chunked prefill refactor
PoC: @hzh0425 @yizhang2077
Issue: [Feature] Mixed ChunkPrefill Optimization Roadmap #13626
Torch compile stack (Looking for PoC)
Slack: #torch-compile
PR: [WIP] Support torch compile based pass manager framework #10987
Issue: [RFC] SGLang unified kernel fusion and torch compile optimisations #10118
SRT core/plugin refactor
Goal: make the core reusable, so people can do customization easily and maintain their out-of-the-tree code.
DP attention and attention backend refactor
Goal: make attention backends fully stateless, unify the sync positions of dp attention.
Parallelism
Pipeline parallelism refactor for long-context prefill and high-throughput decoding
PoC: @ShangmingCai
Slack: #pipeline-parallel
Issue: [Roadmap] Pipeline parallelism roadmap #11857
Expert parallelism refactor
PoC: @ch-wan
Slack: #expert-parallel
Issue: [Roadmap] MoE Refactor #8715
Elastic parallel PRs: [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423, [4/N]Elastic EP support deepep backend #11837
Context parallelism
Prefill CP: [Feature] Support context parallel for Qwen3 model #16632
Megatron SP: [WIP][Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2 #12820
Decode CP:
Compatibility goals
GB200/GB300 NVL72 optimizations
PoC: @Fridge003 @fzyzcjy
More details in PD Disaggregation/Large Scale Serving section of SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
Slack: #deepseek-large-scale-serving
Server Reliability
Kernel
JIT kernels
Roadmap: [Roadmap] JIT kernel development #17035 [Feature] sgl-kernel wheel slimming plan tracking #17865
PoC: @DarkSharpness
Integrate Flashinfer kernels
More details in Flashinfer section of SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
Slack: #flashinfer-kernels
Tune FP8 gemm in Cutlass
Slack: #kernel-dev
Communication kernel work
Slack: #kernel-dev
Automated nightly fusion detection
Workflow: https://github.com/sgl-project/sglang/actions/runs/19004823026
Slack: #ci-cd-build-release
Speculative Decoding
PD Disaggregation
KV Cache System & Memory Pool
PoC: @xiezhq-hermann
Issue: [Feature] HiCache for Hybrid and Sparse LLMs #12826.
slack #kv-cache-store
Sparse attention and KV cache scheduler for GPU/CPU
PR: [Feature] Support Sparse Attention and KV cache scheduling between CPU and GPU for GQA/DSA. #11191
Diffusion (Multimodal Generation)
Multimodal Models
Day-0 support for major models; add more OCR models
Contributors: @mick @JustinTong0323 @yuan-luo
Performance improvements: better prefix & embedding cache
Faster CUDA IPC in MQ for large video/images
PR: [FEAT] Shared mem pool based cuda ipc for multi-modal data transport #11917
Omni Support [RFC] SGLang-Omni Design #16546
Slack: #multi-modal
Quantization
General support for various quantization formats and refactor
Issue: [Roadmap] Quantization Modifications #15194
ModelOpt support
PoC: @Edwardf0t1
More details in Model Optimizer section of SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
Slack: #modelopt
Communication quantization (fp4/fp8 allreduce/allgather/alltoall)
Slack: #quantization
Multi-LoRA Serving
PoC: @Fridge003
Contributor: @ConnorLi96 @lifuhuang
Contributor: @glenliu21 @ConnorLi96 @lifuhuang
Contributors: @ConnorLi96 @Jonahcb
Slack: #lora
Prefill-Only
PoC: @sundar24295s
Slack: #prefill-only
RL Framework Integration
Slack: #reinforcement-learning, #slime-rl-framework
Diffusion Language Models (DLLMs)
PoC: Zehuan Li, Jinwei Yao, Chenyang Zhao
RFC: Block Diffusion Large Language Model (dLLM) Framework
Roadmap: [Roadmap] Diffusion LLMs (2025 Q4 & 2026 Q1) #14199
Hardware
Model Coverage
PoC: @wisclmy0611 @JustinTong0323
Slack: #dev
Model Gateway & API Layer
Support multimodality and image processor in gRPC mode
Support PII and classify API for classifying intent and complexity of the input
Semantic Routing Support
Allow Gateway to actively listen to SGLang server's KV cache events to better handle routing decisions in gRPC mode
Allow SGLang server to start with both gRPC and HTTP server
Model Gateway terminal UI
Reactive UI to launch workers remotely; this should support both local machines and remote
Natively support Anthropic Message API instead of wrapping around chat completion in gRPC mode
Gateway SDK, supporting GoLang, Python, and Node.js for every Rust crate (policies, tokenizer, parsers, etc)
Metrics enhancement, including tracing, model-specific metrics (TTFT, TPOT, etc)
PoC: @slin1237 @CatherineSue
Issue: SGLang Autonomous Model Gateway Roadmap #13098
Slack: #router-sig
Tracing and Profiling
Advanced Priority Scheduling
PoC: @harrisonlimh
CI / Release / Maintenance
CI suites refactor: [Roadmap] CI suites organization #13808
Improve CI monitor workflow
Improve nightly tests
Full feature coverage CI with all combinations (every two days)
Coverage of latest hardware (B300/GB200)
More details in CI/CD section of SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
Slack: #ci-cd-build-release, #help-desk