A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
May 27, 2024 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
Standardized Serverless ML Inference Platform on Kubernetes
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
The simplest way to serve AI/ML models in production
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Python + Inference - Model Deployment library in Python. Simplest model inference server ever.
Learn to serve Stable Diffusion models on cloud infrastructure at scale. This Lightning App shows load-balancing, orchestrating, pre-provisioning, dynamic batching, GPU-inference, micro-services working together via the Lightning Apps framework.
FastAPI Skeleton App to serve machine learning models production-ready.
OneDiffusion: Run any Stable Diffusion models and fine-tuned weights with ease
A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
BentoML Example Projects 🎨
ClearML - Model-Serving Orchestration and Repository Solution
Deploy DL/ ML inference pipelines with minimal extra code.
The official python package for NimbleBox. Exposes all APIs as CLIs and contains modules to make ML 🌸
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
Add a description, image, and links to the model-serving topic page so that developers can more easily learn about it.
To associate your repository with the model-serving topic, visit your repo's landing page and select "manage topics."