Skip to content

A collection of all available inference solutions for the LLMs

License

Notifications You must be signed in to change notification settings

mani-kantap/llm-inference-solutions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

llm-inference-solutions

A collection of all available inference solutions for the LLMs

Name Organization Description Supported Hardware Key Features License
vLLM UC Berkeley High-throughput and memory-efficient inference and serving engine for LLMs. CPU, GPU PagedAttention for optimized memory management, high-throughput serving. Apache 2.0
Text-Generation-Inference Hugging Face 🤗 Efficient and scalable text generation inference for LLMs. CPU, GPU Multi-model serving, dynamic batching, optimized for transformers. Apache 2.0
llm-engine Scale AI Scale LLM Engine public repository for efficient inference. CPU, GPU Scalable deployment, monitoring tools, integration with Scale AI services. Apache 2.0
DeepSpeed Microsoft Deep learning optimization library for easy, efficient, and effective distributed training and inference. CPU, GPU ZeRO redundancy optimizer, mixed-precision training, model parallelism. MIT
OpenLLM BentoML Operating LLMs in production with ease. CPU, GPU Model serving, deployment orchestration, integration with BentoML. Apache 2.0
LMDeploy InternLM Team Toolkit for compressing, deploying, and serving LLMs. CPU, GPU Model compression, deployment automation, serving optimization. Apache 2.0
FlexFlow CMU, Stanford, UCSD A distributed deep learning framework. CPU, GPU, TPU Automatic parallelization, support for complex models, scalability. Apache 2.0
CTranslate2 OpenNMT Fast inference engine for Transformer models. CPU, GPU Int8 quantization, multi-threaded execution, optimized for translation models. MIT
FastChat lm-sys Open platform for training, serving, and evaluating large language models; release repo for Vicuna and Chatbot Arena. CPU, GPU Chatbot framework, multi-turn conversations, evaluation tools. Apache 2.0
Triton Inference Server NVIDIA Optimized cloud and edge inferencing solution. CPU, GPU Model ensemble, dynamic batching, support for multiple frameworks. BSD-3-Clause
Lepton.AI lepton.ai Pythonic framework to simplify AI service building. CPU, GPU Service orchestration, API generation, scalability. MIT
ScaleLLM Vectorch High-performance inference system for LLMs, designed for production environments. CPU, GPU Low-latency serving, high throughput, production-ready. Apache 2.0
Lorax Predibase Serve hundreds of fine-tuned LLMs in production for the cost of one. CPU, GPU Model multiplexing, cost-efficient serving, scalability. Apache 2.0
TensorRT-LLM NVIDIA Provides users with an easy-to-use Python API to define LLMs and build TensorRT engines. GPU TensorRT optimization, high-performance inference, integration with NVIDIA GPUs. Apache 2.0
mistral.rs mistral.rs Blazingly fast LLM inference. CPU, GPU Rust-based implementation, performance optimization, lightweight. MIT
NanoFlow NanoFlow Throughput-oriented high-performance serving framework for LLMs. CPU, GPU High throughput, low latency, optimized for large-scale deployments. Apache 2.0
LMCache LMCache Fast and cost-efficient inference. CPU, GPU Caching mechanisms, cost optimization, scalable serving. Apache 2.0
Litserve Lightning.AI Lightning-fast serving engine for AI models; flexible, easy, enterprise-scale. CPU, GPU Rapid deployment, flexible architecture, enterprise integration. Apache 2.0
DeepSeek Inference System Overview DeepSeek Higher throughput and lower latency inference system. CPU, GPU Optimized performance, low latency, high throughput. Proprietary

About

A collection of all available inference solutions for the LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published