llm-inference-solutions

A collection of all available inference solutions for the LLMs

Name	Organization	Description	Supported Hardware	Key Features	License
vLLM	UC Berkeley	High-throughput and memory-efficient inference and serving engine for LLMs.	CPU, GPU	PagedAttention for optimized memory management, high-throughput serving.	Apache 2.0
Text-Generation-Inference	Hugging Face 🤗	Efficient and scalable text generation inference for LLMs.	CPU, GPU	Multi-model serving, dynamic batching, optimized for transformers.	Apache 2.0
llm-engine	Scale AI	Scale LLM Engine public repository for efficient inference.	CPU, GPU	Scalable deployment, monitoring tools, integration with Scale AI services.	Apache 2.0
DeepSpeed	Microsoft	Deep learning optimization library for easy, efficient, and effective distributed training and inference.	CPU, GPU	ZeRO redundancy optimizer, mixed-precision training, model parallelism.	MIT
OpenLLM	BentoML	Operating LLMs in production with ease.	CPU, GPU	Model serving, deployment orchestration, integration with BentoML.	Apache 2.0
LMDeploy	InternLM Team	Toolkit for compressing, deploying, and serving LLMs.	CPU, GPU	Model compression, deployment automation, serving optimization.	Apache 2.0
FlexFlow	CMU, Stanford, UCSD	A distributed deep learning framework.	CPU, GPU, TPU	Automatic parallelization, support for complex models, scalability.	Apache 2.0
CTranslate2	OpenNMT	Fast inference engine for Transformer models.	CPU, GPU	Int8 quantization, multi-threaded execution, optimized for translation models.	MIT
FastChat	lm-sys	Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena.	CPU, GPU	Chatbot framework, multi-turn conversations, evaluation tools.	Apache 2.0
Triton Inference Server	NVIDIA	Optimized cloud and edge inferencing solution.	CPU, GPU	Model ensemble, dynamic batching, support for multiple frameworks.	BSD-3-Clause
Lepton.AI	lepton.ai	Pythonic framework to simplify AI service building.	CPU, GPU	Service orchestration, API generation, scalability.	MIT
ScaleLLM	Vectorch	High-performance inference system for LLMs, designed for production environments.	CPU, GPU	Low-latency serving, high throughput, production-ready.	Apache 2.0
Lorax	Predibase	Serve hundreds of fine-tuned LLMs in production for the cost of one.	CPU, GPU	Model multiplexing, cost-efficient serving, scalability.	Apache 2.0
TensorRT-LLM	NVIDIA	Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines.	GPU	TensorRT optimization, high-performance inference, integration with NVIDIA GPUs.	Apache 2.0
mistral.rs	mistral.rs	Blazingly fast LLM inference.	CPU, GPU	Rust-based implementation, performance optimization, lightweight.	MIT
NanoFlow	NanoFlow	Throughput-oriented high-performance serving framework for LLMs.	CPU, GPU	High throughput, low latency, optimized for large-scale deployments.	Apache 2.0
LMCache	LMCache	Fast and cost-efficient inference.	CPU, GPU	Caching mechanisms, cost optimization, scalable serving.	Apache 2.0
Litserve	Lightning.AI	Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale.	CPU, GPU	Rapid deployment, flexible architecture, enterprise integration.	Apache 2.0
DeepSeek Inference System Overview	DeepSeek	Higher throughput and lower latency inference system.	CPU, GPU	Optimized performance, low latency, high throughput.	Proprietary

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-inference-solutions

About

Releases

Packages

License

mani-kantap/llm-inference-solutions

Folders and files

Latest commit

History

Repository files navigation

llm-inference-solutions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages