A collection of all available inference solutions for the LLMs
Name | Organization | Description | Supported Hardware | Key Features | License |
---|---|---|---|---|---|
vLLM | UC Berkeley | High-throughput and memory-efficient inference and serving engine for LLMs. | CPU, GPU | PagedAttention for optimized memory management, high-throughput serving. | Apache 2.0 |
Text-Generation-Inference | Hugging Face 🤗 | Efficient and scalable text generation inference for LLMs. | CPU, GPU | Multi-model serving, dynamic batching, optimized for transformers. | Apache 2.0 |
llm-engine | Scale AI | Scale LLM Engine public repository for efficient inference. | CPU, GPU | Scalable deployment, monitoring tools, integration with Scale AI services. | Apache 2.0 |
DeepSpeed | Microsoft | Deep learning optimization library for easy, efficient, and effective distributed training and inference. | CPU, GPU | ZeRO redundancy optimizer, mixed-precision training, model parallelism. | MIT |
OpenLLM | BentoML | Operating LLMs in production with ease. | CPU, GPU | Model serving, deployment orchestration, integration with BentoML. | Apache 2.0 |
LMDeploy | InternLM Team | Toolkit for compressing, deploying, and serving LLMs. | CPU, GPU | Model compression, deployment automation, serving optimization. | Apache 2.0 |
FlexFlow | CMU, Stanford, UCSD | A distributed deep learning framework. | CPU, GPU, TPU | Automatic parallelization, support for complex models, scalability. | Apache 2.0 |
CTranslate2 | OpenNMT | Fast inference engine for Transformer models. | CPU, GPU | Int8 quantization, multi-threaded execution, optimized for translation models. | MIT |
FastChat | lm-sys | Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena. | CPU, GPU | Chatbot framework, multi-turn conversations, evaluation tools. | Apache 2.0 |
Triton Inference Server | NVIDIA | Optimized cloud and edge inferencing solution. | CPU, GPU | Model ensemble, dynamic batching, support for multiple frameworks. | BSD-3-Clause |
Lepton.AI | lepton.ai | Pythonic framework to simplify AI service building. | CPU, GPU | Service orchestration, API generation, scalability. | MIT |
ScaleLLM | Vectorch | High-performance inference system for LLMs, designed for production environments. | CPU, GPU | Low-latency serving, high throughput, production-ready. | Apache 2.0 |
Lorax | Predibase | Serve hundreds of fine-tuned LLMs in production for the cost of one. | CPU, GPU | Model multiplexing, cost-efficient serving, scalability. | Apache 2.0 |
TensorRT-LLM | NVIDIA | Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines. | GPU | TensorRT optimization, high-performance inference, integration with NVIDIA GPUs. | Apache 2.0 |
mistral.rs | mistral.rs | Blazingly fast LLM inference. | CPU, GPU | Rust-based implementation, performance optimization, lightweight. | MIT |
NanoFlow | NanoFlow | Throughput-oriented high-performance serving framework for LLMs. | CPU, GPU | High throughput, low latency, optimized for large-scale deployments. | Apache 2.0 |
LMCache | LMCache | Fast and cost-efficient inference. | CPU, GPU | Caching mechanisms, cost optimization, scalable serving. | Apache 2.0 |
Litserve | Lightning.AI | Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale. | CPU, GPU | Rapid deployment, flexible architecture, enterprise integration. | Apache 2.0 |
DeepSeek Inference System Overview | DeepSeek | Higher throughput and lower latency inference system. | CPU, GPU | Optimized performance, low latency, high throughput. | Proprietary |