
| Documentation | Blog |
- [2025/05] - Arctic Inference w. Shift Parallelism: The Fastest Open Source Inference System for Enterprise AI
- [2025/05] - Scaling vLLM for Embeddings: 16x Throughput and Cost Reduction
- [2025/05] - Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training
- [2025/04] - Low-Latency and High-Throughput Inference for Long Context w. Sequence Parallelism (Ulysses)
Arctic Inference is an open-source vLLM plugin that brings Snowflake’s inference innovations to the community, delivering the fastest and most cost-effective open-source inference for LLMs and Embeddings.
Arctic Inference achieves high throughput and low latency through a wholistic set of inference optimizations:
Advanced Parallelism | Speculative Decoding | Model Optimization | Other Optimizations |
---|---|---|---|
Arctic Ulysses (blog)
Shift Parallelism (blog) |
Arctic Speculator (blog)
Suffix Decoding (blog, paper) |
SwiftKV (blog, paper) |
Embeddings (blog)
Reasoning (blog, paper) |
For real-world LLM workloads, a single deployment of Arctic Inference + vLLM achieves:
- 3.4x faster request completion and 1.06x higher throughput compared to the best throughput-optimized deployment (TP=1, DP=8)
- 1.7x higher throughput and 1.28x faster request completion compared to the best latency-optimized deployment (TP=8, DP=1)
Arctic Inference + vLLM achieves the elusive "trifecta" of quicker response, higher throughput, and faster generation in a single deployment:
- 2.25x faster response time (prefill throughput per request)
- 1.75x faster generation per request
- SOTA combined throughput
See our blog for evaluation details.
For embeddings, Arctic Inference + vLLM delivers a whopping 1.4M toks/sec per GPU:
- 16x faster than plain vLLM on short sequences and 4.2x faster on long sequences
- 2.4x faster than Text Embeddings Inference (TEI) on short sequences and at parity for long sequences
$ pip install arctic-inference[vllm]
Once installed, Arctic Inference automatically patches vLLM to use Arctic Inference with Shift Parallelism and other optimizations implemented in Arctic Inference, and users can continue to use their familiar vLLM APIs and CLI. It’s easy to get started!
By using the examples below, you can get benefits from Shift Parallelism, Speculative Decoding, and SwiftKV all at once!
vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
--quantization "fp8" \
--tensor-parallel-size 1 \
--ulysses-sequence-parallel-size 2 \
--enable-shift-parallel \
--speculative-config '{
"method": "arctic",
"model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
"num_speculative_tokens": 3,
"enable_suffix_decoding": true,
"disable_by_batch_size": 64
}'
import vllm
from vllm import LLM, SamplingParams
vllm.plugins.load_general_plugins()
llm = LLM(
model="Snowflake/Llama-3.1-SwiftKV-8B-Instruct",
quantization="fp8",
tensor_parallel_size=1,
ulysses_sequence_parallel_size=2,
enable_shift_parallel=True,
speculative_config={
"method": "arctic",
"model": "Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
"num_speculative_tokens": 3,
"enable_suffix_decoding": True,
"disable_by_batch_size": 64,
},
)
conversation = [
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
sampling_params = SamplingParams(temperature=0.0, max_tokens=800)
outputs = llm.chat(conversation, sampling_params=sampling_params)