#

llm-serving

Here are 109 public repositories matching this topic...

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Updated Jul 22, 2025
Python

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Updated Jul 22, 2025
Python

liguodongiot / llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

llm llmops llm-serving llm-training llm-inference

Updated Jul 10, 2025
HTML

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Updated Jul 22, 2025
Python

bentoml / OpenLLM

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

llama mistral fine-tuning mlops bentoml vicuna llm model-inference llmops llm-serving llm-inference open-source-llm llama2 openllm llm-ops llama3-1 llama3-2 llama3-2-vision

Updated Jul 21, 2025
Python

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

cuda pytorch moe blackwell llm-serving

Updated Jul 22, 2025
C++

skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Updated Jul 22, 2025
Python

BentoML

bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

python machine-learning deep-learning model-serving multimodal mlops ml-engineering ai-inference llm generative-ai llmops llm-serving model-inference-service llm-inference inference-platform

Updated Jul 22, 2025
Python

superduper

superduper-io / superduper

Superduper: End-to-end framework for building custom AI applications and agents.

Updated Jul 22, 2025
Python

PaddlePaddle / FastDeploy

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

inference openai serving ernie llm llm-serving vllm ernie-45 ernie-45-vl

Updated Jul 22, 2025
Python

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

transformers pytorch llama gpt lora model-serving fine-tuning llm llmops llm-serving llm-inference

Updated May 21, 2025
Python

gpustack / gpustack

Simple, scalable AI model deployment on GPU clusters

Updated Jul 22, 2025
Python

microsoft / aici

AICI: Prompts as (Wasm) Programs

rust ai wasm inference transformer language-model model-serving wasmtime llm llmops llm-serving llm-inference llm-framework

Updated Jan 22, 2025
Rust

MoonshotAI / MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

pytorch transformer moe llm llm-serving llm-training flash-attention

Updated Apr 3, 2025
Python

ray-project / ray-llm

RayLLM - LLMs on Ray (Archived). Read README for more info.

ray llm llm-serving

Updated Mar 13, 2025

thu-pacman / chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

gpu pytorch model-serving llm llm-serving deepseek

Updated Jul 22, 2025
Python

vllm-project / vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

inference transformer model-serving mlops ascend llm llmops llm-serving vllm

Updated Jul 22, 2025
Python

zhihu / ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda pytorch llama gpt inference-engine model-serving llm llm-serving llm-inference deepseek-r1

Updated Jul 10, 2025
C++

mosec

mosecorg / mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

python rust machine-learning deep-learning mxnet tensorflow gpu cv pytorch tts hacktoberfest model-serving nerual-network machine-learning-platform jax mlops llm llm-serving

Updated Jul 11, 2025
Python

efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference model-serving llm llm-serving llama2

Updated Jul 9, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-serving topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-serving topic, visit your repo's landing page and select "manage topics."