Skip to content

OpenPipe/ArcticInference

 
 

Repository files navigation

License Apache 2.0 PyPI version

Latest news

Arctic Inference

Arctic Inference is an open-source vLLM plugin that brings Snowflake’s inference innovations to the community, delivering the fastest and most cost-effective open-source inference for LLMs and Embeddings.

Arctic Inference achieves high throughput and low latency through a wholistic set of inference optimizations:

Advanced Parallelism Speculative Decoding Model Optimization Other Optimizations
Arctic Ulysses (blog)
Shift Parallelism (blog)
Arctic Speculator (blog)
Suffix Decoding (blog, paper)
SwiftKV (blog, paper) Embeddings (blog)
Reasoning (blog, paper)

Optimized LLM Inference

For real-world LLM workloads, a single deployment of Arctic Inference + vLLM achieves:

  • 3.4x faster request completion and 1.06x higher throughput compared to the best throughput-optimized deployment (TP=1, DP=8)
  • 1.7x higher throughput and 1.28x faster request completion compared to the best latency-optimized deployment (TP=8, DP=1)

Arctic Inference + vLLM achieves the elusive "trifecta" of quicker response, higher throughput, and faster generation in a single deployment:

  • 2.25x faster response time (prefill throughput per request)
  • 1.75x faster generation per request
  • SOTA combined throughput

Optimized Embeddings


See our blog for evaluation details.

For embeddings, Arctic Inference + vLLM delivers a whopping 1.4M toks/sec per GPU:

  • 16x faster than plain vLLM on short sequences and 4.2x faster on long sequences
  • 2.4x faster than Text Embeddings Inference (TEI) on short sequences and at parity for long sequences

Quick Start

$ pip install arctic-inference[vllm]

Once installed, Arctic Inference automatically patches vLLM to use Arctic Inference with Shift Parallelism and other optimizations implemented in Arctic Inference, and users can continue to use their familiar vLLM APIs and CLI. It’s easy to get started!

Running Arctic Inference with vLLM

By using the examples below, you can get benefits from Shift Parallelism, Speculative Decoding, and SwiftKV all at once!

Serving

vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
    --quantization "fp8" \
    --tensor-parallel-size 1 \
    --ulysses-sequence-parallel-size 2 \
    --enable-shift-parallel \
    --speculative-config '{
        "method": "arctic",
        "model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
        "num_speculative_tokens": 3,
        "enable_suffix_decoding": true,
        "disable_by_batch_size": 64
    }'

Offline

import vllm
from vllm import LLM, SamplingParams

vllm.plugins.load_general_plugins()

llm = LLM(
    model="Snowflake/Llama-3.1-SwiftKV-8B-Instruct",
    quantization="fp8",
    tensor_parallel_size=1,
    ulysses_sequence_parallel_size=2,
    enable_shift_parallel=True,
    speculative_config={
        "method": "arctic",
        "model": "Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
        "num_speculative_tokens": 3,
        "enable_suffix_decoding": True,
        "disable_by_batch_size": 64,
    },
)

conversation = [
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

sampling_params = SamplingParams(temperature=0.0, max_tokens=800)

outputs = llm.chat(conversation, sampling_params=sampling_params)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 88.6%
  • Cuda 6.1%
  • C++ 3.7%
  • Other 1.6%